Sampling Within k-Means Algorithm to Cluster Large Datasets

Bejarano, Jeremy; Bose, Koushiki; Brannan, Tyler; Thomas, Anita; Adragni, Kofi; Neerchal, Nagaraj K.; Ostrouchov, George

Sampling Within k-Means Algorithm to Cluster Large Datasets

dc.contributor.author	Bejarano, Jeremy
dc.contributor.author	Bose, Koushiki
dc.contributor.author	Brannan, Tyler
dc.contributor.author	Thomas, Anita
dc.contributor.author	Adragni, Kofi
dc.contributor.author	Neerchal, Nagaraj K.
dc.contributor.author	Ostrouchov, George
dc.date.accessioned	2018-10-25T15:37:49Z
dc.date.available	2018-10-25T15:37:49Z
dc.date.issued	2011
dc.description.abstract	Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy.	en
dc.description.sponsorship	This research was conducted during Summer 2011 in the REU Site: Interdisciplinary Program in High Performance Computing (www.umbc.edu/hpcreu) in the UMBC Department of Mathematics and Statistics, funded by the National Science Foundation (grant no. DMS– 0851749). This program is also supported by UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF). The computational hardware in HPCF (www.umbc.edu/hpcf) is partially funded by the National Science Foundation through the MRI program (grant no. CNS–0821258) and the SCREMS program (grant no. DMS– 0821311), with additional substantial support from UMBC.	en
dc.description.uri	https://userpages.umbc.edu/~gobbert/papers/REU2011Team2.pdf	en
dc.format.extent	11 pages	en
dc.genre	technical report	en
dc.identifier	doi:10.13016/M2QV3C732
dc.identifier.uri	http://hdl.handle.net/11603/11692
dc.language.iso	en	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Mathematics Department Collection
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Student Collection
dc.relation.ispartofseries	HPCF Technical Report;HPCF–2011–12
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subject	Cluster Large Datasets	en
dc.subject	sample size	en
dc.subject	k-means	en
dc.subject	tolerance and confidence intervals	en
dc.subject	UMBC High Performance Computing Facility (HPCF)	en
dc.title	Sampling Within k-Means Algorithm to Cluster Large Datasets	en
dc.type	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: REU2011Team2.pdf
Size:: 145.83 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.68 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Mathematics and Statistics Department
UMBC Faculty Collection
UMBC Student Collection