USF Tampa Graduate Theses and Dissertations

Applying the Wrapper Approach for Auto Discovery of Under-Sampling and Over-Sampling Percentages on Skewed Datasets

Ajay D. Joshi, University of South Florida

Graduation Year

2004

Document Type

Thesis

Degree

M.S.C.S.

Degree Granting Department

Computer Science

Major Professor

Lawrence O. Hall, Ph.D.

Committee Member

Rafael Perez, Ph.D.

Committee Member

Sudeep Sarkar, Ph.D.

Keywords

Machine Learning, Data Mining, SMOTE, RIPPER, imbalance, C4.5, F-value

Abstract

Machine learning applications are plagued by the imbalance observed among the class sizes in many real world datasets. A dataset is said to be skewed or imbalanced when its classes are very unequally represented. A naÃ¯ve classifier learned from these skewed datasets is always biased towards the majority classes which constitute a major percentage of the samples in the dataset. As a result the accuracy on the minority classes is hampered. In many real world applications like network intrusion detection, cancer detection from mammography images, etc. the events of interest are very rare and the cost of not detecting these events is very high. Hence it very important to improve accuracies on the minority classes. It has been proposed previously that under-sampling of the majority classes can reduce the bias of the learned classifier and over-sampling of the minority classes - especially SMOTE (Synthetic Minority Over-sampling TEchnique) can boost the classifier accuracy on minority classes. But the question of how much under-sampling and over-sampling to be done for a particular induction learning algorithm and dataset remains. We present a wrapper approach for searching for the under-sampling and over-sampling (i.e. SMOTE) percentages for a particular learning algorithm for a given skewed dataset. We compare the results obtained by the classifiers built on wrapper selected under sampled and SMOTEd datasets with the ones obtained by classifiers built on the original datasets to show a statistically significant improvement in accuracies over minority classes. This proves the efficacy of the wrapper approach in searching for the under-sampling and over-sampling percentages. Further, it provides an automated method to select the number of synthetic examples to be created.

Scholar Commons Citation

Joshi, Ajay D., "Applying the Wrapper Approach for Auto Discovery of Under-Sampling and Over-Sampling Percentages on Skewed Datasets" (2004). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/1101

Download

Included in

American Studies Commons

COinS

USF Tampa Graduate Theses and Dissertations

Applying the Wrapper Approach for Auto Discovery of Under-Sampling and Over-Sampling Percentages on Skewed Datasets

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Search

Browse By

Useful Links

USF Tampa Graduate Theses and Dissertations

Applying the Wrapper Approach for Auto Discovery of Under-Sampling and Over-Sampling Percentages on Skewed Datasets

Author

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Share

Search

Browse By

Useful Links