Graduation Year

2007

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence O. Hall, Ph.D.

Keywords

Machine learning, Spatial inhomogeneity, Parallel processing, Complex simulations, Ensemble methods

Abstract

This dissertation explores Machine Learning in the context of computationally intensive simulations. Complex simulations such as those performed at Sandia National Laboratories for the Advanced Strategic Computing program may contain multiple terabytes of data. The amount of data is so large that it is computationally infeasible to transfer between nodes on a supercomputer. In order to create the simulation, data is distributed spatially. For example, if this dissertation was to be broken apart spatially, the binding might be one partition, the first fifty pages another partition, the top three inches of every remaining page another partition, and the remainder confined to the last partition. This distribution of data is not conducive to learning using existing machine learning algorithms, as it violates some standard assumptions, the most important being that data is independently and identically distributed (i.i.d.).

Unique algorithms must be created in order to deal with the spatially distributed data. Another problem which this dissertation addresses is learning from large data sets in general. The pervasive spread of computers into so many areas has enabled data capture from places that previously did not have available data. Various algorithms for speeding up classification of small and medium-sized data sets have been developed over the past several years. Most of these take advantage of developing a multiple classifier system in which the fusion of many classifiers results in higher accuracy than that obtained by any single classifier. Most also have a direct application to the problem of learning from large data sets. In this dissertation, a thorough statistical analysis of several of these algorithms is provided on 57 publicly available data sets. Random forests, in particular, is able to achieve some of the highest accuracy results while speeding up classification significantly.

Random forests, through a classifier fusion strategy known as Probabilistic Majority Voting (PMV) and a variant referred to as Weighted Probabilistic Majority Voting (wPMV), was used on two simulations. The first simulation is of a canister being crushed in the same fashion as a human might crush a soda can. Each of half a million physical data points in the simulation contains nine attributes. In the second simulation, a casing is dropped on the ground. This simulation contains 21 attributes and over 1,500,000 data points. Results show that reasonable accuracy can be obtained by using PMV or wPMV, but this accuracy is not as high as using all of the data in a non-spatially partitioned environment. In order to increase the accuracy, a semi-supervised algorithm was developed.

This algorithm is capable of increasing the accuracy several percentage points over that of using all of the non-partitioned data, and includes several benefits such as reducing the number of labeled examples which scientists would otherwise manually identify. It also depicts more accurately the real-world usage situations which scientists encounter when applying these Machine Learning techniques to new simulations.

Share

COinS