Graduation Year

2008

Document Type

Thesis

Degree

M.S.C.S.

Degree Granting Department

Computer Science

Major Professor

Lawrence O. Hall, Ph.D.

Keywords

Graph mining, Compound matching, Graph kernel, Graph dataset, Classifier

Abstract

Information contained in chemical compounds, fingerprint databases, social networks, and interactions between websites all have one thing in common: they can be represented as graphs. The need to analyze, compare, and classify graph datasets has become more evident over the last decade. The graph isomorphism problem is known to belong to the NP class, and the subgraph isomorphism problem is known to be an NP-complete problem. Several error-tolerant graph matching techniques have been developed during the last two decades in order to overcome the computational complexity associated with these problems. Some of these techniques rely upon similarity measures based on the topology of the graphs. Random walks and edit distance kernels are examples of such methods. In conjunction with learning algorithms like back-propagation neural networks, k-nearest neighbor, and support vector machines (SVM), these methods provide a way of classifying graphs based on a training set of labeled instances. This thesis presents a novel approach to error-tolerant graph matching based on current flow analysis. Analysis of current flow in electrical networks is a technique that uses the voltages and currents obtained through nodal analysis of a graph representing an electrical circuit. Current flow analysis in electrical networks shares some interesting connections with the number of random walks along the graph. We propose an algorithm to calculate a similarity measure between two graphs based on the current flows along geodesics of the same degree. This similarity measure can be applied over large graph datasets, allowing these datasets to be compared in a reasonable amount of time. This thesis investigates the classification potential of several data mining algorithms based on the information extracted from a graph dataset and represented as current flow vectors. We describe our operational prototype and evaluate its effectiveness on the NCI-HIV dataset.

Share

COinS