Improving Robustness of Data Mining Models in Cybersecurity Applications


Author/Creator ORCID




Information Systems


Information Systems

Citation of Original Publication


Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.


Machine learning models have been widely used to build detection models for cyber security applications such as spam filtering, fraud detection, virus or malware detection, and intrusion detection. However, it is well-known that adversaries can adapt their attacks to evade detection. One most prominent example is email spam, where spammers routinely modify emails to get past classifier-based spam filters. There has been some work on making machine learning models more robust to such attacks. However, one simple but promising approach called randomization is under-explored. In addition, most existing works focus on models with differentiable error functions while tree-based models do not have such error functions but are quite popular because they are easy to interpret. This paper proposes a novel approach that uses randomness to improve robustness of data mining models against attacks that try to evade detection by adapting to these models. This approach differs from existing work on related topics such as adversarial learning or robust data mining in that it does not use a deterministic model. The rationale is that according to the Minimum Description Length (MDL) principle, a deterministic model should be concise to avoid overfitting the training data. This makes deterministic models vulnerable to simple attacks. For example, for a decision tree model, an attacker can simply send several testing data instances to the model and based on the output guess the features and conditions in the decision tree. One can try to create a more complicated decision tree with more features and conditions to make the attackers' job more difficult, but this violates the MDL principle. Another possible solution is to use ensemble learning. However, the goal of ensemble learning is to improve mining quality, and it is likely that the models built from ensemble learning still use a small set of features more frequently and are vulnerable to adaptation attacks. Our approach uses randomness to increase the uncertainty of models but at the same time without violating the MDL principle for each individual model. Our contributions include: 1) methods to build a pool of diverse mining models by optimizing the tradeoff between robustness and mining quality; 2) methods to randomly select a subset of models at run time (when the model is used for detection) to further boost robustness; 3) a theoretical framework for decision tree models that bounds the minimal number of features an attacker needs to modify given a set of selected models.