Graduation Year

2007

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence O. Hall, Ph.D.

Keywords

Partitioning, Hard-c-means, Fuzzy-c-means, Scalability, Merging, Streaming

Abstract

Clustering algorithms are an important tool for data mining and data analysis purposes. Clustering algorithms fall under the category of unsupervised learning algorithms, which can group patterns without an external teacher or labels using some kind of similarity metric. Clustering algorithms are generally iterative in nature and computationally intensive. They will have disk accesses in every iteration for data sets larger than memory, making the algorithms unacceptably slow. Data could be processed in chunks, which fit into memory, to provide a scalable framework. Multiple processors may be used to process chunks in parallel. Clustering solutions from each chunk together form an ensemble and can be merged to provide a global solution. So, merging multiple clustering solutions, an ensemble, is important for providing a scalable framework.

Combining multiple clustering solutions or partitions, is also important for obtaining a robust clustering solution, merging distributed clustering solutions, and providing a knowledge reuse and privacy preserving data mining framework. Here we address combining multiple clustering solutions in a scalable framework. We also propose algorithms for incrementally clustering large or very large data sets. We propose an algorithm that can cluster large data sets through a single pass. This algorithm is also extended to handle clustering infinite data streams. These types of incremental/online algorithms can be used for real time processing as they don't revisit data and are capable of processing data streams under the constraint of limited buffer size and computational time. Thus, different frameworks/algorithms have been proposed to address scalability issues in different settings.

To our knowledge we are the first to introduce scalable algorithms for merging cluster ensembles, in terms of time and space complexity, on large real world data sets. We are also the first to introduce single pass and streaming variants of the fuzzy c means algorithm. We have evaluated the performance of our proposed frameworks/algorithms both on artificial and large real world data sets. A comparison of our algorithms with other relevant algorithms is discussed. These comparisons show the scalability and effectiveness of the partitions created by these new algorithms.

Share

COinS