Systematically Identifying, Defining and Organizing Knowledge Components for Data Science Problem Solving through Human-LLM Collaboration

dc.contributor.authorPriyanka Rani, Fnu
dc.contributor.authorAlomair, Maryam
dc.contributor.authorPan, Shimei
dc.contributor.authorChen, Lujie Karen
dc.date.accessioned2026-02-03T18:14:41Z
dc.date.issued2025-07-17
dc.descriptionL@S '25: Twelfth ACM Conference on Learning @ Scale, Palermo, Italy, July 22-23, 2025
dc.description.abstractAs demand grows for job-ready data science professionals, there is increasing recognition that traditional training often falls short in cultivating the higher-order reasoning and real-world problem-solving skills essential to the field. A foundational step toward addressing this gap is the identification and organization of knowledge components (KCs) that underlie data science problem solving (DSPS). KCs represent conditional knowledge-knowing about appropriate actions given particular contexts or conditions-and correspond to the critical decisions data scientists must make throughout the problem-solving process. While existing taxonomies in data science education support curriculum development, they often lack the granularity and focus needed to support the assessment and development of DSPS skills. In this paper, we present a novel framework that combines the strengths of large language models (LLMs) and human expertise to identify, define, and organize KCs specific to DSPS. We treat LLMs as ''knowledge engineering assistants'' capable of generating candidate KCs by drawing on their extensive training data, which includes a vast amount of domain knowledge and diverse sets of real-world DSPS cases. Our process involves prompting multiple LLMs to generate decision points, synthesizing and refining KC definitions across models, and using sentence-embedding models to infer the underlying structure of the resulting taxonomy. Human experts then review and iteratively refine the taxonomy to ensure validity. This human-AI collaborative workflow offers a scalable and efficient proof-of-concept for LLM-assisted knowledge engineering. The resulting KC taxonomy lays the groundwork for developing fine-grained assessment tools and adaptive learning systems that support deliberate practice in DSPS. Furthermore, the framework illustrates the potential of LLMs not just as content generators but as partners in structuring domain knowledge to inform instructional design. Future work will involve extending the framework by generating a directed graph of KCs based on their input-output dependencies and validating the taxonomy through expert consensus and learner studies. This approach contributes to both the practical advancement of DSPS coaching in data science education and the broader methodological toolkit for AI-supported knowledge engineering.
dc.description.sponsorshipThis material is based upon work supported by the National Science Foundation under Grant No.2429590
dc.description.urihttps://dl.acm.org/doi/10.1145/3698205.3733952
dc.format.extent6 pages
dc.genreconference papers and proceedings
dc.identifierdoi:10.13016/m21kh8-izfy
dc.identifier.citationPriyanka Rani, FNU, Maryam Alomair, Shimei Pan, and Lujie K. Chen. “Systematically Identifying, Defining and Organizing Knowledge Components for Data Science Problem Solving through Human-LLM Collaboration.” Proceedings of the Twelfth ACM Conference on Learning @ Scale, L@S ’25, July 17, 2025, 341–45. https://doi.org/10.1145/3698205.3733952.
dc.identifier.urihttps://doi.org/10.1145/3698205.3733952
dc.identifier.urihttp://hdl.handle.net/11603/41652
dc.language.isoen
dc.publisherACM
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Data Science
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Data Science
dc.relation.ispartofUMBC Information Systems Department
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectUMBC Accelerated Cognitive Cybersecurity Laboratory
dc.subjectUMBC Lab for Informatics for Human Flourishing
dc.subjectUMBC NLP and Social Computing Lab
dc.titleSystematically Identifying, Defining and Organizing Knowledge Components for Data Science Problem Solving through Human-LLM Collaboration
dc.typeText
dcterms.creatorhttps://orcid.org/0009-0005-7606-5884
dcterms.creatorhttps://orcid.org/0009-0008-8343-5814
dcterms.creatorhttps://orcid.org/0000-0002-5989-8543
dcterms.creatorhttps://orcid.org/0000-0002-7185-8405

Files