Network Traffic Data Preprocessing: An Agentic AI Framework

dc.contributor.advisorJim, Carol
dc.contributor.advisorSalem, Ahmed
dc.contributor.authorMacDowell, Christopher
dc.contributor.departmentHood College Computer Science and Information Technology
dc.contributor.programInformation Technology
dc.date.accessioned2025-11-17T14:03:23Z
dc.date.issued2025-11-14
dc.description.abstractThe application of machine learning (ML) to power modern Intrusion Detection Systems (IDS) is critically dependent on the quality of network traffic data. However, the data preprocessing stage—which transforms complex, "dirty" raw data into a clean, ML-ready format—remains a time-consuming, manual bottleneck that relies heavily on domain expertise. This research addresses this gap by proposing and validating a novel, holistic framework using agentic Generative AI (GenAI) to fully automate the end-to- end data preprocessing pipeline. The core of this framework is a sophisticated, multi-step "prompt-chain" that compels a Large Language Model (LLM) to act as an expert data scientist. This agentic process forces the AI to move beyond simple code generation; it must first perform rigorous exploratory data analysis, construct a feature selection plan, design a methodologically sound preprocessing strategy (preventing data leakage and handling class imbalance), generate the code, and validate its own work through iterative debugging and "hostile code" reviews. This three-phase agentic framework was experimentally validated by comparing the performance of three state-of-the-art GenAI reasoning models (DeepSeek V3, Google Gemini 2.5 Pro, and OpenAI GPT-5) against a human-derived manual baseline. The evaluation was conducted on three distinct network traffic datasets of increasing complexity: the curated UNSW-NB15 (as a control), the raw IDSIoT2024, and the highly complex, "dirty" VNFCYBERDATA. The resulting data artifacts were evaluated using four standard ML classifiers: K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), and Gaussian Naïve Bayes (GNB). The results comprehensively validate the hypothesis. While the framework's application on the already-curated UNSW-NB15 dataset led to "over-cleaning" and a slight performance degradation, its performance on the two raw datasets was definitively superior to the manual baseline. The GenAI framework consistently and significantly outperformed the manual method on the IDSIoT2024 and VNFCYBERDATA datasets. Notably, the GenAI-processed data enabled sensitive classifiers like GNB to function, whereas the manual baseline caused them to fail catastrophically. Furthermore, the framework proved definitively superior in preparing the data for the critical task of minority class (rare attack) detection, where the manual baseline failed. This research demonstrates that a prompt-chain-driven GenAI framework is not merely a viable alternative but a more robust, comprehensive, and powerful method than traditional manual preprocessing for raw, real-world network traffic data.
dc.format.extent353 pages
dc.genreThesis (M.S.)
dc.identifier.urihttp://hdl.handle.net/11603/40770
dc.language.isoen
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/
dc.subjectArtificial Intelligence (AI)
dc.subjectGenerative Artificial Intelligence
dc.subjectData Science
dc.subjectMachine Learning
dc.subjectCyber Security
dc.subjectNetwork Traffic Analysis
dc.titleNetwork Traffic Data Preprocessing: An Agentic AI Framework
dc.typeText

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MacDowell_Christopher_Hood_College_Thesis_Final_Draft.pdf
Size:
7.14 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.65 KB
Format:
Item-specific license agreed upon to submission
Description: