Clustering for Monitoring Distributed Data Streams

dc.contributor.advisorKogan, Jacob
dc.contributor.advisorMalinovsky, Yaakov
dc.contributor.authorBarouti, Maria
dc.contributor.departmentMathematics and Statistics
dc.contributor.programMathematics, Applied
dc.date.accessioned2019-10-11T14:01:50Z
dc.date.available2019-10-11T14:01:50Z
dc.date.issued2016-01-01
dc.description.abstractData mining is a challenging research area of computer science with profound applications in database industries and resulting market needs. Data mining is the computational process of discovering patterns in big data sets. This process enables us to extract valuable information from large data by involving methods at the intersection of different topics such as machine learning, statistics, and artificial intelligence. Over the last years there has been a growing interest in data analysis research by monitoring data streams in a distributed system. In this study we propose to monitor arbitrary threshold functions over distributed data streams while minimizing communication overhead. To illustrate this further, assume that we have a number of sensors that are spread in the space and we would like to monitor the average of their measurements while minimizing communication between the sensors. Each sensor represents a node that produces time varying vectors derived from the stream of measurements. Thus we are interested to check if a function evaluated at the vectors' average at each time is greater than zero while communication between the nodes is minimized. Motivated by recent contributions based on geometric ideas, and after reviewing some well known clustering algorithms we present an alternative approach that combines system theory techniques, clustering and statistical approaches. Our approach enables monitoring values of an arbitrary threshold function over distributed data streams through a set of constraints applied independently on each stream and/or clusters of streams. The clusters are designed to evolve in time and to adapt themselves to the data stream. A correct choice of clusters yields a reduction in communication load. Unlike many clustering algorithms that attempt to collect together similar data items, monitoring requires clusters with dissimilar vectors canceling each other as much as possible. In particular, subclusters of a good cluster do not have to be good. This novel type of clustering dictated by the problem at hand requires development of new algorithms and/or modification of the existing ones, and this thesis is a step in this direction. We report experiments on real-world data with a newly devised clustering algorithm. The experiments detect instances where communication between nodes is required, and show that the clustering approach reduces communication load. Last but not least, we indicate new future directions and discuss possible methodologies that can be involved into my future research agenda.
dc.genredissertations
dc.identifierdoi:10.13016/m2jhfw-ta4f
dc.identifier.other11510
dc.identifier.urihttp://hdl.handle.net/11603/15676
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Mathematics and Statistics Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
dc.sourceOriginal File Name: Barouti_umbc_0434D_11510.pdf
dc.subjectclustering
dc.subjectdata mining
dc.subjectdistributed data streams
dc.subjectfeature selection
dc.subjectmonitoring
dc.subjectoptimization
dc.titleClustering for Monitoring Distributed Data Streams
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Barouti_umbc_0434D_11510.pdf
Size:
971.19 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Barouti_Clustering_Open.pdf
Size:
45.56 KB
Format:
Adobe Portable Document Format
Description: