Benchmarking Discretisation Level of Continuous Attributes: Theoretical and Experimental Approaches
Links to Files
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Program
Citation of Original Publication
W. Chen, C. Wang, J. Li, B. Yang, Y. Liu and J. Wang, "Benchmarking Discretisation Level of Continuous Attributes: Theoretical and Experimental Approaches," 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 3623-3631, doi: 10.1109/BigData47090.2019.9006513.
Rights
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Subjects
Abstract
The discretisation of an attribute refers to partitioning its continuous numerical values into intervals, each of which is associated a categorical label. The amount of such different categorical labels is called as target discretisation level of the continuous attribute. For data mining algorithms that can only work on discrete data, the discretisation will be necessary. At the same time, the discretisation can also make the original data more concise and interpretable. However, it is challenging to balance the target discretisation level and the information loss during the discretisation process. In this paper, we propose to use entropy of a continuous attribute as a benchmark to determine its target discretisation level for the first time. An entropy based naive unsupervised discretisation approach is also proposed and shows big advantages in terms of both data reduction and accuracy, which is evaluated by performing classifiers on the dataset whose continuous attributes are discretised based on the proposed approach. Our experiments on 28 datasets and 9 popular classifiers show that the accuracy of a discretisation approach will be largely affected when the target discretisation level of each continuous attribute is lower than the entropy benchmark. Meanwhile increasing the target discretisation level from the benchmark does not always improve the accuracy of the discretizer. These discoveries can provide valuable guidance to explore or optimise the approaches to the discretisation of continuous attributes.
