Network Traffic Data Preprocessing: An Agentic AI Framework

Author/Creator ORCID

Type of Work

Department

Hood College Computer Science and Information Technology

Program

Information Technology

Citation of Original Publication

Rights

Attribution-NonCommercial-NoDerivs 3.0 United States

Abstract

The application of machine learning (ML) to power modern Intrusion Detection Systems (IDS) is critically dependent on the quality of network traffic data. However, the data preprocessing stage—which transforms complex, "dirty" raw data into a clean, ML-ready format—remains a time-consuming, manual bottleneck that relies heavily on domain expertise. This research addresses this gap by proposing and validating a novel, holistic framework using agentic Generative AI (GenAI) to fully automate the end-to- end data preprocessing pipeline. The core of this framework is a sophisticated, multi-step "prompt-chain" that compels a Large Language Model (LLM) to act as an expert data scientist. This agentic process forces the AI to move beyond simple code generation; it must first perform rigorous exploratory data analysis, construct a feature selection plan, design a methodologically sound preprocessing strategy (preventing data leakage and handling class imbalance), generate the code, and validate its own work through iterative debugging and "hostile code" reviews. This three-phase agentic framework was experimentally validated by comparing the performance of three state-of-the-art GenAI reasoning models (DeepSeek V3, Google Gemini 2.5 Pro, and OpenAI GPT-5) against a human-derived manual baseline. The evaluation was conducted on three distinct network traffic datasets of increasing complexity: the curated UNSW-NB15 (as a control), the raw IDSIoT2024, and the highly complex, "dirty" VNFCYBERDATA. The resulting data artifacts were evaluated using four standard ML classifiers: K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), and Gaussian Naïve Bayes (GNB). The results comprehensively validate the hypothesis. While the framework's application on the already-curated UNSW-NB15 dataset led to "over-cleaning" and a slight performance degradation, its performance on the two raw datasets was definitively superior to the manual baseline. The GenAI framework consistently and significantly outperformed the manual method on the IDSIoT2024 and VNFCYBERDATA datasets. Notably, the GenAI-processed data enabled sensitive classifiers like GNB to function, whereas the manual baseline caused them to fail catastrophically. Furthermore, the framework proved definitively superior in preparing the data for the critical task of minority class (rare attack) detection, where the manual baseline failed. This research demonstrates that a prompt-chain-driven GenAI framework is not merely a viable alternative but a more robust, comprehensive, and powerful method than traditional manual preprocessing for raw, real-world network traffic data.