Graduation Year


Document Type




Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Xinming Ou, Ph.D.

Committee Member

Lawrence Hall, Ph.D.

Committee Member

Jarred Ligatti, Ph.D.

Committee Member

Nasir Ghani, Ph.D.

Committee Member

Jiyong Jang, Ph.D.


Binary Similarity Analysis, Malware Clustering, Malware Detection, Machine Learning


Malware analysis and detection continues to be one of the central battlefields for cybersecurity industry. For the desktop malware domain, we observed multiple significant ransomware attacks in the past several years, e.g., it was estimated that in 2017 the WannaCry ransomware attack affected more than 200,000 computers across 150 countries with hundreds of millions damages. Similarly, we witnessed the increased impacts of Android malware on global individuals due to the popular smartphone and IoT devices worldwide. In this dissertation, we describe similarity comparison based novel techniques that can be applied to achieve large scale desktop and Android malware analysis, and the practical implications of machine learning based approaches for malware detection.

First, we propose a generic and effective solution for accurate and efficient binary similarity analysis of desktop malware. Binary similarity analysis is an essential technique for a variety of security analysis tasks, including malware detection and malware clustering. Even though various solutions have been developed, existing binary similarity analysis methods still suffer from limited efficiency, accuracy, and usability. In this work, we propose a novel graphical fuzzy hashing scheme for accurate and efficient binary similarity analysis. We first abstract control flow graphs (CFGs) of binary codes to extract blended n-gram graphical features of the CFGs, and then encode the graphical features into numeric vectors (called graph signatures) to measure similarity by comparing the graph signatures. We further leverage a fuzzy hashing technique to convert the numeric graph signatures into smaller fixed size fuzzy hash outputs for efficient comparisons. Our comprehensive evaluation demonstrates that our blended n-gram graphical feature based CFG comparison is more effective and efficient compared to existing CFG comparison techniques. Based on our CFG comparison method, we develop BingSim, a binary similarity analysis tool, and show that BingSim outperforms existing binary similarity analysis tools while conducting similarity analysis based malware detection and malware clustering.

Second, we identify the challenges faced by overall similarity based Android malware clustering and design a specialized system for solving the problems. Clustering has been well studied for desktop malware analysis as an effective triage method. Conventional similarity-based clustering techniques, however, cannot be immediately applied to Android malware analysis due to the excessive use of third-party libraries in Android application development and the widespread use of repackaging in malware development. We design and implement an Android malware clustering system through iterative mining of malicious payloads and checking whether malware samples share the same version of malicious payloads. Our system utilizes a hierarchical clustering technique and an efficient bit-vector format to represent Android apps. Experimental results demonstrate that our clustering approach achieves precision of 0.90 and recall of 0.75 for the Android Genome mal- ware dataset, and average precision of 0.98 and recall of 0.96 with respect to manually verified ground-truth.

Third, we study the fundamental issues faced by traditional machine learning (ML) based Android malware detection systems, and examine the role of ML for Android malware detection in practice, which leads to a revised evaluation strategy that evaluates an ML based malware detection system by checking their zero-day detection capabilities. Existing machine learning based Android malware research obtains the ground truth by consulting AV products, and uses the same label set for training and testing. However, there is a mismatch between how the ML system has been evaluated, and the true purpose of using ML system in practice. The goal of applying ML is not to reproduce or verify the same potentially imperfect knowledge, but rather to produce something that is better — closer to the ultimate ground truth about the apps’ maliciousness. Therefore, it will be more meaningful to check their zero-day detection capabilities than detection accuracy for known malware. This evaluation strategy is aligned with how an ML algorithm can potentially benefit malware detection in practice, by acknowledging that any ML classifier has to be trained on imperfect knowledge, and such knowledge evolves over time. Besides the traditional malware prediction approaches, we also examine the mislabel identification approaches. Through extensive experiments, we demonstrate that: (a) it is feasible to evaluate ML based Android malware detection systems with regard to their zero-day malware detection capabilities; (b) both malware prediction and mislabel identification approaches can be used to achieve verifiable zero-day malware detection, even when trained with an old and noisy ground truth dataset.