You are here
Classification techniques for noisy and imbalanced data
- Date Issued:
- 2009
- Summary:
- Machine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are recorded erroneously, misleads a classifier and reduces predictive performance. Class imbalance occurs when one class represents only a small portion of the examples in a dataset, and, in such cases, classifiers often display poor accuracy on the minority class. The reduction in classification performance becomes even worse when the two issues occur simultaneously. To address the magnified difficulty caused by this interaction, this dissertation performs thorough empirical investigations of several techniques for dealing with class noise and imbalanced data. Comprehensive experiments are performed to assess the effects of the classification techniques on classifier performance, as well as how the level of class imbalance, level of class noise, and distribution of class noise among the classes affects results. An empirical analysis of classifier based noise detection efficiency appears first. Subsequently, an intelligent data sampling technique, based on noise detection, is proposed and tested. Several hybrid classifier ensemble techniques for addressing class noise and imbalance are introduced. Finally, a detailed empirical investigation of classification filtering is performed to determine best practices.
Title: | Classification techniques for noisy and imbalanced data. |
5237 views
5074 downloads |
---|---|---|
Name(s): |
Napolitano, Amri E. College of Engineering and Computer Science Department of Computer and Electrical Engineering and Computer Science |
|
Type of Resource: | text | |
Genre: | Electronic Thesis Or Dissertation | |
Issuance: | monographic | |
Date Issued: | 2009 | |
Publisher: | Florida Atlantic University | |
Physical Form: | electronic | |
Extent: | xvii, 218 p. : ill. | |
Language(s): | English | |
Summary: | Machine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are recorded erroneously, misleads a classifier and reduces predictive performance. Class imbalance occurs when one class represents only a small portion of the examples in a dataset, and, in such cases, classifiers often display poor accuracy on the minority class. The reduction in classification performance becomes even worse when the two issues occur simultaneously. To address the magnified difficulty caused by this interaction, this dissertation performs thorough empirical investigations of several techniques for dealing with class noise and imbalanced data. Comprehensive experiments are performed to assess the effects of the classification techniques on classifier performance, as well as how the level of class imbalance, level of class noise, and distribution of class noise among the classes affects results. An empirical analysis of classifier based noise detection efficiency appears first. Subsequently, an intelligent data sampling technique, based on noise detection, is proposed and tested. Several hybrid classifier ensemble techniques for addressing class noise and imbalance are introduced. Finally, a detailed empirical investigation of classification filtering is performed to determine best practices. | |
Identifier: | 501313430 (oclc), 369201 (digitool), FADT369201 (IID), fau:4271 (fedora) | |
Note(s): |
by Amri Napolitano. Thesis (Ph.D.)--Florida Atlantic University, 2009. Includes bibliography. Electronic reproduction. Boca Raton, Fla., 2009. Mode of access: World Wide Web. |
|
Subject(s): |
Combinatorial group theory Data mining -- Technological innovations Decision trees Machine learning Filters (Mathematics) |
|
Held by: | FBoU FAUER | |
Persistent Link to This Record: | http://purl.flvc.org/FAU/369201 | |
Use and Reproduction: | http://rightsstatements.org/vocab/InC/1.0/ | |
Host Institution: | FAU |