Sampling Within k-Means Algorithm to Cluster Large Datasets

dc.contributor.authorBejarano, Jeremy
dc.contributor.authorBose, Koushiki
dc.contributor.authorBrannan, Tyler
dc.contributor.authorThomas, Anita
dc.contributor.authorAdragni, Kofi
dc.contributor.authorNeerchal, Nagaraj K.
dc.contributor.authorOstrouchov, George
dc.date.accessioned2018-10-25T15:37:49Z
dc.date.available2018-10-25T15:37:49Z
dc.date.issued2011
dc.description.abstractDue to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy.en_US
dc.description.sponsorshipThis research was conducted during Summer 2011 in the REU Site: Interdisciplinary Program in High Performance Computing (www.umbc.edu/hpcreu) in the UMBC Department of Mathematics and Statistics, funded by the National Science Foundation (grant no. DMS– 0851749). This program is also supported by UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF). The computational hardware in HPCF (www.umbc.edu/hpcf) is partially funded by the National Science Foundation through the MRI program (grant no. CNS–0821258) and the SCREMS program (grant no. DMS– 0821311), with additional substantial support from UMBC.en_US
dc.description.urihttps://userpages.umbc.edu/~gobbert/papers/REU2011Team2.pdfen_US
dc.format.extent11 pagesen_US
dc.genretechnical reporten_US
dc.identifierdoi:10.13016/M2QV3C732
dc.identifier.urihttp://hdl.handle.net/11603/11692
dc.language.isoen_USen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Mathematics Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofseriesHPCF Technical Report;HPCF–2011–12
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectCluster Large Datasetsen_US
dc.subjectsample sizeen_US
dc.subjectk-meansen_US
dc.subjecttolerance and confidence intervalsen_US
dc.subjectUMBC High Performance Computing Facility (HPCF)en_US
dc.titleSampling Within k-Means Algorithm to Cluster Large Datasetsen_US
dc.typeTexten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
REU2011Team2.pdf
Size:
145.83 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: