USF Tampa Graduate Theses and Dissertations

Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages

Jay Jarman, University of South FloridaFollow

Graduation Year

2011

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Business Administration

Major Professor

Donald J. Berndt, Ph.D.

Committee Member

Stephen L. Luther, Ph.D.

Committee Member

Balaji Padmanabhan, Ph.D.

Committee Member

Rosann W. Collins, Ph.D.

Keywords

data mining, machine learning, computational linguistics, decision tree, rule mining

Abstract

This dissertation focuses on developing and evaluating hybrid approaches for analyzing free-form text in the medical domain. This research draws on natural language processing (NLP) techniques that are used to parse and extract concepts based on a controlled vocabulary. Once important concepts are extracted, additional machine learning algorithms, such as association rule mining and decision tree induction, are used to discover classification rules for specific targets. This multi-stage pipeline approach is contrasted with traditional statistical text mining (STM) methods based on term counts and term-by-document frequencies. The aim is to create effective text analytic processes by adapting and combining individual methods. The methods are evaluated on an extensive set of real clinical notes annotated by experts to provide benchmark results.

There are two main research question for this dissertation. First, can information (specialized language) be extracted from clinical progress notes that will represent the notes without loss of predictive information? Secondly, can classifiers be built for clinical progress notes that are represented by specialized language? Three experiments were conducted to answer these questions by investigating some specific challenges with regard to extracting information from the unstructured clinical notes and classifying documents that are so important in the medical domain.

The first experiment addresses the first research question by focusing on whether relevant patterns within clinical notes reside more in the highly technical medically-relevant terminology or in the passages expressed by common language. The results from this experiment informed the subsequent experiments. It also shows that predictive patterns are preserved by preprocessing text documents with a grammatical NLP system that separates specialized language from common language and it is an acceptable method of data reduction for the purpose of STM.

Experiments two and three address the second research question. Experiment two focuses on applying rule-mining techniques to the output of the information extraction effort from experiment one, with the ultimate goal of creating rule-based classifiers. There are several contributions of this experiment. First, it uses a novel approach to create classification rules from specialized language and to build a classifier. The data is split by classification and then rules are generated. Secondly, several toolkits were assembled to create the automated process by which the rules were created. Third, this automated process created interpretable rules and finally, the resulting model provided good accuracy. The resulting performance was slightly lower than from the classifier from experiment one but had the benefit of having interpretable rules.

Experiment three focuses on using decision tree induction (DTI) for a rule discovery approach to classification, which also addresses research question three. DTI is another rule centric method for creating a classifier. The contributions of this experiment are that DTI can be used to create an accurate and interpretable classifier using specialized language. Additionally, the resulting rule sets are simple and easily interpretable, as well as created using a highly automated process.

Scholar Commons Citation

Jarman, Jay, "Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages" (2011). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/3166

Download

Included in

American Studies Commons, Databases and Information Systems Commons

COinS

USF Tampa Graduate Theses and Dissertations

Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Committee Member

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Search

Browse By

Useful Links

USF Tampa Graduate Theses and Dissertations

Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages

Author

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Committee Member

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Share

Search

Browse By

Useful Links