Graduation Year


Document Type




Degree Granting Department

Chemical and Biomedical Engineering

Major Professor

Steven A. Eschrich, Ph.D.

Co-Major Professor

Dmitry Goldgof, Ph.D.

Committee Member

John Heine, Ph.D.

Committee Member

Rangachar Kasturi, Ph.D.

Committee Member

Ji-Hyun Lee, Dr.PH.


quantization, survival analysis, random subspaces, cost-sensitive analysis, biological covariates


Cancer can develop through a series of genetic events in combination with

external influential factors that alter the progression of the disease. Gene expression

studies are designed to provide an enhanced understanding of the progression of cancer

and to develop clinically relevant biomarkers of disease, prognosis and response to

treatment. One of the main aims of microarray gene expression analyses is to develop

signatures that are highly predictive of specific biological states, such as the molecular

stage of cancer. This dissertation analyzes the classification complexity inherent in gene

expression studies, proposing both techniques for measuring complexity and algorithms

for reducing this complexity.

Classifier algorithms that generate predictive signatures of cancer models must

generalize to independent datasets for successful translation to clinical practice. The

predictive performance of classifier models is shown to be dependent on the inherent

complexity of the gene expression data. Three specific quantitative measures of

classification complexity are proposed and one measure ( f) is shown to correlate highly

(R 2=0.82) with classifier accuracy in experimental data.

Three quantization methods are proposed to enhance contrast in gene expression

data and reduce classification complexity. The accuracy for cancer prognosis prediction

is shown to improve using quantization in two datasets studied: from 67% to 90% in lung

cancer and from 56% to 68% in colorectal cancer. A corresponding reduction in

classification complexity is also observed.

A random subspace based multivariable feature selection approach using costsensitive

analysis is proposed to model the underlying heterogeneous cancer biology and

address complexity due to multiple molecular pathways and unbalanced distribution of

samples into classes. The technique is shown to be more accurate than the univariate ttest

method. The classifier accuracy improves from 56% to 68% for colorectal cancer

prognosis prediction.

A published gene expression signature to predict radiosensitivity of tumor cells is

augmented with clinical indicators to enhance modeling of the data and represent the

underlying biology more closely. Statistical tests and experiments indicate that the

improvement in the model fit is a result of modeling the underlying biology rather than

statistical over-fitting of the data, thereby accommodating classification complexity

through the use of additional variables.