Systematically Identifying, Defining and Organizing Knowledge Components for Data Science Problem Solving through Human-LLM Collaboration

Department

Program

Citation of Original Publication

Priyanka Rani, FNU, Maryam Alomair, Shimei Pan, and Lujie K. Chen. “Systematically Identifying, Defining and Organizing Knowledge Components for Data Science Problem Solving through Human-LLM Collaboration.” Proceedings of the Twelfth ACM Conference on Learning @ Scale, L@S ’25, July 17, 2025, 341–45. https://doi.org/10.1145/3698205.3733952.

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Abstract

As demand grows for job-ready data science professionals, there is increasing recognition that traditional training often falls short in cultivating the higher-order reasoning and real-world problem-solving skills essential to the field. A foundational step toward addressing this gap is the identification and organization of knowledge components (KCs) that underlie data science problem solving (DSPS). KCs represent conditional knowledge-knowing about appropriate actions given particular contexts or conditions-and correspond to the critical decisions data scientists must make throughout the problem-solving process. While existing taxonomies in data science education support curriculum development, they often lack the granularity and focus needed to support the assessment and development of DSPS skills. In this paper, we present a novel framework that combines the strengths of large language models (LLMs) and human expertise to identify, define, and organize KCs specific to DSPS. We treat LLMs as ''knowledge engineering assistants'' capable of generating candidate KCs by drawing on their extensive training data, which includes a vast amount of domain knowledge and diverse sets of real-world DSPS cases. Our process involves prompting multiple LLMs to generate decision points, synthesizing and refining KC definitions across models, and using sentence-embedding models to infer the underlying structure of the resulting taxonomy. Human experts then review and iteratively refine the taxonomy to ensure validity. This human-AI collaborative workflow offers a scalable and efficient proof-of-concept for LLM-assisted knowledge engineering. The resulting KC taxonomy lays the groundwork for developing fine-grained assessment tools and adaptive learning systems that support deliberate practice in DSPS. Furthermore, the framework illustrates the potential of LLMs not just as content generators but as partners in structuring domain knowledge to inform instructional design. Future work will involve extending the framework by generating a directed graph of KCs based on their input-output dependencies and validating the taxonomy through expert consensus and learner studies. This approach contributes to both the practical advancement of DSPS coaching in data science education and the broader methodological toolkit for AI-supported knowledge engineering.