Nicholas, CharlesJoyce, Robert j2021-09-012021-09-012020-01-0112166http://hdl.handle.net/11603/22879The malware analysis community is completely devoid of a diverse, up to date reference dataset with ground truth labels. Consequentially, it is typical for automatic malware classifiers to be evaluated using custom datasets with near ground truth labels. However, classifier evaluation using near ground truth labels can yield erroneous or biased results. We propose an alternative classifier evaluation framework that does not require reference labels. We introduce the concept of a ground truth refinement and propose potential methods for constructing an approximation of one from a malware dataset. We prove that using a ground truth refinement it is possible to compute lower bounds on precision and error rate as well as upper bounds on recall and accuracy without requiring ground truth reference labels. We perform a case study on the popular AVClass malware labeler using our proposed evaluation framework.application:pdfClassifier EvaluationData ScienceMalware AnalysisMalware ClassificationEvaluating Automatic Malware Classifiers in the Absence of Reference LabelsText