Graduation Year

2010

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence O. Hall, Ph.D.

Keywords

IFP, Feature selection, Classification, T-test, Data mining, SVM-RFE, SVM

Abstract

Gene expression microarray datasets often consist of a limited number of samples with a large number of gene expression measurements, usually on the order of thousands. These characteristics might negatively impact the prediction accuracy of a classification model. Therefore, dimensionality reduction is critical prior to any classification task. We introduce the iterative feature perturbation method (IFP), an embedded gene selector that iteratively discards non-relevant features. Relevant features are those which after perturbation with noise cause a change in the predictive accuracy of the classification model. Non-relevant features do not cause any accuracy change in such a situation. We apply IFP to 4 cancer microarray datasets: colon, leukemia, Moffitt colon, and lung. We compare results obtained by IFP to those of SVM-RFE and the t-test using a linear support vector machine as the classifier.

We use the entire set of features in the datasets, and a preselected set of 200 features (based on p values from the t-test) from each dataset. When using the entire set of features, IFP results in comparable accuracy (and higher at some points) with respect to SVM-RFE on 3 of the 4 datasets. The t-test feature ranking produces classifiers with the highest accuracy across the 4 datasets. When using 200 features, the accuracy results show up to 3% performance improvement for both IFP and SVM-RFE across the 4 datasets. We corroborate these results with an AUC analysis and a statistical analysis using the Friedman/Holm test. Similar to the t-test, we used the methods information gain and relief as filters and compared all three. Results of the AUC analysis show that IFP and SVM-RFE obtain the highest AUC value when applied on the t-test-filtered datasets. This result is additionally corroborated with statistical analysis.

The percentage of overlap between the gene sets selected by any two methods across the four datasets indicates that different sets of genes can result in similar accuracies. We created ensembles of classifiers using the bagging technique with IFP, SVM-RFE and the t-test, and showed that their performance can be equivalent to those of the non-bagging cases, as well as better in some cases.

Share

COinS