Network Traffic Data Preprocessing: An Agentic AI Framework
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Hood College Computer Science and Information Technology
Program
Information Technology
Citation of Original Publication
Rights
Attribution-NonCommercial-NoDerivs 3.0 United States
Abstract
The application of machine learning (ML) to power modern Intrusion Detection
Systems (IDS) is critically dependent on the quality of network traffic data. However,
the data preprocessing stage—which transforms complex, "dirty" raw data into a clean,
ML-ready format—remains a time-consuming, manual bottleneck that relies heavily on
domain expertise. This research addresses this gap by proposing and validating a novel,
holistic framework using agentic Generative AI (GenAI) to fully automate the end-to-
end data preprocessing pipeline.
The core of this framework is a sophisticated, multi-step "prompt-chain" that
compels a Large Language Model (LLM) to act as an expert data scientist. This agentic
process forces the AI to move beyond simple code generation; it must first perform
rigorous exploratory data analysis, construct a feature selection plan, design a
methodologically sound preprocessing strategy (preventing data leakage and handling
class imbalance), generate the code, and validate its own work through iterative
debugging and "hostile code" reviews.
This three-phase agentic framework was experimentally validated by
comparing the performance of three state-of-the-art GenAI reasoning models
(DeepSeek V3, Google Gemini 2.5 Pro, and OpenAI GPT-5) against a human-derived
manual baseline. The evaluation was conducted on three distinct network traffic
datasets of increasing complexity: the curated UNSW-NB15 (as a control), the raw
IDSIoT2024, and the highly complex, "dirty" VNFCYBERDATA. The resulting data
artifacts were evaluated using four standard ML classifiers: K-Nearest Neighbors
(KNN), Decision Tree (DT), Random Forest (RF), and Gaussian Naïve Bayes (GNB).
The results comprehensively validate the hypothesis. While the framework's
application on the already-curated UNSW-NB15 dataset led to "over-cleaning" and a
slight performance degradation, its performance on the two raw datasets was
definitively superior to the manual baseline. The GenAI framework consistently and
significantly outperformed the manual method on the IDSIoT2024 and
VNFCYBERDATA datasets. Notably, the GenAI-processed data enabled sensitive
classifiers like GNB to function, whereas the manual baseline caused them to fail
catastrophically. Furthermore, the framework proved definitively superior in preparing
the data for the critical task of minority class (rare attack) detection, where the manual
baseline failed. This research demonstrates that a prompt-chain-driven GenAI
framework is not merely a viable alternative but a more robust, comprehensive, and
powerful method than traditional manual preprocessing for raw, real-world network
traffic data.
