Sampling Within k-Means Algorithm to Cluster Large Datasets

Bejarano, JeremyBose, KoushikiBrannan, TylerThomas, AnitaAdragni, KofiNeerchal, Nagaraj K.Ostrouchov, George2018-10-252018-10-252011http://hdl.handle.net/11603/11692Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy.11 pagesen-USThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.Cluster Large Datasetssample sizek-meanstolerance and confidence intervalsUMBC High Performance Computing Facility (HPCF)Sampling Within k-Means Algorithm to Cluster Large DatasetsText