Center-Based Clustering with Divergence

Author/Creator

Author/Creator ORCID

Date

2010-01-01

Department

Mathematics and Statistics

Program

Mathematics, Applied

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission.

Abstract

Families of center-based clustering methods are capable of handling high dimensional sparse data arising in Text Mining applications. All current center-based algorithms seek to minimize a particular objective function with an attempt to improve upon the standard k-means algorithm. This dissertation links traditional partition based approach to k-means with optimization treatment of the clustering problem. The interplay between two approaches sheds light on solutions generated by optimization techniques. We investigate the Hessian of the objective function, i.e, the second order optimality condition. When an algorithm leads to a critical point where the Hessian fails to be positive definite, we propose a way to move away from the critical point. Such a move may lead to possible improvement. In many cases additional information about the desired type of clusters is available. When incorporated into the clustering process this information may lead to better clustering results. We consider pairwise constrained clustering. In pairwise constrained clustering, we may have information about pairs of vectors that may not belong to the same cluster (cannot-links), information about pairs of vectors that must belong to the same cluster (must-links), or both. We focus on two k-means type clustering algorithms and two different distance--like functions. The clustering algorithms are k-means and spherical k-means. The distance-like functions are reverse Bregman divergence and cosine similarity. We show that for these algorithms and distance-like functions the pairwise constrained clustering problem can be reduced to clustering with cannot-link constraints only. We substitute cannot-link constraints by penalty, and propose clustering algorithms that tackle clustering with penalties.