Graduation Year


Document Type




Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Biology (Cell Biology, Microbiology, Molecular Biology)

Major Professor

Bin Xue, Ph.D.

Committee Member

Robert Deschenes, Ph.D.

Committee Member

Vladimir Uversky, Ph.D.

Committee Member

Yu Sun, Ph.D.

Committee Member

Brant Burkhardt, Ph.D.


Gene Expression, MicroRNA:mRNA Prediction, Protein-protein Interaction


The function, behavior, and environmental response of biological systems are essentially determined by the complex interaction and regulation of biomolecules inside the systems. Therefore, it is critical to characterize the inter-molecular interaction and regulation of biomolecules inside these systems. In this direction, many experimental techniques have been developed and these techniques have been used in many different model systems under various conditions. Consequently, a massive amount of data has been generated. These data cover multiple aspects of molecular interaction and regulation, such as protein-protein interaction, microRNA-RNA interaction, gene expression profiles, etc. While carrying rich information, these data may also contain significant levels of errors. In addition, decoding these data to get meaningful molecular mechanisms is still very challenging.

In this dissertation, our recent efforts on data cleaning and data mining were summarized in the following aspects: (1) Microarray gene expression data analysis. Traditional techniques for gene expression analysis are largely dependent on p-values calculated from statistical models. The accuracy and reproducibility of these techniques have serious concerns. Therefore, we designed a distance method based on the Euclidean Distance calculated from the expression levels of a set of pre-ranked genes in both control and treatment groups. The pre-ranked genes are ranked using fold-change. When more and more pre-ranked genes are included in the calculation, the distance is normally monotonically decreasing. Therefore, by selecting a specific cutoff value, a group of genes are identified. In practical, a standard deviation cutoff associated with a sliding window on the curve of distance as a function of number of pre-ranked genes was used to select the group of genes. This group of genes determine the genotypical and phenotypical differences between control samples and treatment samples. By using data sets developed in the Microarray Quality Control project, the true-positive rate of the distance method was much higher than traditional methods.

(2) The DEGs identified from Microarray data analysis represent genes at the RNA level, but not at the protein level. It is well known that a high level of RNA may not necessarily result in a high level of protein because the translation process from RNA to protein is regulated by multiple factors. To name a few, microRNAs may interact with an mRNA to inhibit the translation or to degrade the mRNA, and thus lead to a low level of protein molecules. Therefore, by integrating RNA-level gene expression analysis and microRNA:mRNA interaction prediction, it is feasible to determine gene expression at protein level, at least for genes of which the translation is mainly regulated by microRNAs. For this purpose, a novel meta-strategy was developed for microRNA target prediction. This strategy improved prediction accuracy significantly, comparing to many other microRNA target predictors.

(3) Biology-specific and microRNA-regulated protein-protein interaction networks. Proteins frequently interact with each other to regulate various biological processes. Experimentally validated protein-protein interaction data have been deposited into databases. Nonetheless, the interaction data in these databases are not specific to tissue, cell, or biological conditions. Besides, the interaction data between databases are not consistent. Finally, many IDs in these databases are mislabeled. The fraction of mislabeled IDs in some databases is as high as 15%. In this project, an automatic pipeline was developed to remove mislabeled IDs, and multiple protein-protein interaction databases have been integrated. Thus, the protein-protein interaction information in new database is more accurate. Furthermore, gene expression analysis under specific biological conditions and microRNA regulated gene expression at the protein level are integrated into protein-protein interaction networks to generate biology-specific microRNA-regulated protein-protein interaction networks.

In summary, we developed novel strategy to identify differentially expressed genes at the RNA level, we also developed meta-strategy based high-accuracy miRNA target predictor to predict microRNA:mRNA base pairing and to identify mRNAs that may not be translated into proteins, we further removed mislabeled IDs from PPI databases and combine multiple PPI databases into a new database. By combining all the above-mentioned results together, we developed a novel strategy and pipeline to determine high-accuracy protein-protein interaction networks, as well as the rewiring of protein-protein interaction networks regulated by specific gene expression and miRNA-induced inhibition of protein synthesis.