Graduation Year

2004

Document Type

Thesis

Degree

M.S.C.S.

Degree Granting Department

Computer Science

Major Professor

Hall, Lawrence

Keywords

ensemble, merging, filtering, disputed examples, extrema

Abstract

Clustering large data sets recently has emerged as an important area of research. The ever-increasing size of data sets and poor scalability of clustering algorithms has drawn attention to distributed clustering for partitioning large data sets. Sometimes, centrally pooling the distributed data is also expensive. There might be also constraints on data sharing between different distributed locations due to privacy, security, or proprietary nature of the data. In this work we propose an algorithm to cluster large-scale data sets without centrally pooling the data. Data at distributed sites are clustered independently i.e. without any communication among them. After partitioning the local/distributed sites we send only the centroids of each site to a central location. Thus there is very little bandwidth cost in a wide area network scenario.

The distributed sites/subsets neither exchange cluster labels nor individual data features thus providing the framework for privacy preserving distributive clustering. Centroids from each local site form an ensemble of centroids at the central site. Our assumption is that data in all distributed locations are from the same underlying distribution and the set of centroids obtained by partitioning the data in each subset/distributed location gives us partial information about the position of the cluster centroids in that distribution. Now, the problem of finding a global partition using the limited knowledge of the ensemble of centroids can be viewed as the problem of reaching a global consensus on the position of cluster centroids.

A global consensus on the position of cluster centroids of the global data using only the very limited statistics of the position of centroids from each local site is reached by grouping the centroids into consensus chains and computing the weighted mean of centroids in a consensus chain to represent a global cluster centroid. We compute the Euclidean distance of each example from the global set of centroids, and assign it to the centroid nearest to it. Experimental results show that quality of clusters generated by our algorithm is similar to the quality of clusters generated by clustering all the data at a time. We have shown that the disputed examples between the clusters generated by our algorithm and clustering all the data at a time lay on the border of clusters as expected. We also proposed a centroid-filtering algorithm to make partitions formed by our algorithm better.

Share

COinS