Graduation Year

2020

Document Type

Thesis

Degree

M.S.

Degree Name

Master of Science (M.S.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence Hall, Ph.D.

Committee Member

Dmitry B. Goldgof, Ph.D.

Committee Member

Yu Sun, Ph.D.

Keywords

Concordance Correlation Coefficient Random Subspace Method, Gene Expression, Gini Index, High Dimensional Data, Random Subspace Method

Abstract

The role of feature selection is crucial in many applications. A few of these include computational biology, image classification and risk management. In biology, gene expression micro array data sets have been used extensively in many areas of research. These data sets typically suffer from an important problem: the ratio between the number of features over the number of examples is very high. This problem mainly affects prediction accuracy because it is best to collect more labeled examples than features. A correlation based random subspace ensemble feature selector (CCC_RSM) was proposed to handle this problem [5]. In this approach, first it determines the most relevant prediction features. Next, it groups these features based on their correlation to each other. Then, a feature is randomly chosen from each correlated group so that the selected features form a feature subset. The CCC_RSM algorithm repeats the previous step a pre-defined number of times. The proposed algorithm’s performance is evaluated by combining either multiple decision trees or Support Vector Machines. Joining these models’ predictions together can significantly increase the prediction accuracy. In ensembles of these models, each classifier provides a vote and the majority vote is used to produce the final class prediction. This design modifies the random subspace method ensemble [13].

This study focuses on finding alternative feature selectors in the first step so that the CCC_RSM algorithm can obtain good, or even better classification performance. We used four micro array gene expression data sets in the experiments. Based on the original algorithm, we used the Gini Index in place of Relief-F. A detailed analysis of the alternative method’s outputs ivwas considered: (1) overall and average accuracy, (2) Sensitivity, (3) Specificity, (4) F-measure. Consequently, the alternative method gave the highest F-measure score for the Leukemia (1.00), Breast (0.98), Colon (0.91) and CNS (0.81) data sets.

Share

COinS