You are here

Classification techniques for noisy and imbalanced data

Download pdf | Full Screen View

Date Issued:
2009
Summary:
Machine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are recorded erroneously, misleads a classifier and reduces predictive performance. Class imbalance occurs when one class represents only a small portion of the examples in a dataset, and, in such cases, classifiers often display poor accuracy on the minority class. The reduction in classification performance becomes even worse when the two issues occur simultaneously. To address the magnified difficulty caused by this interaction, this dissertation performs thorough empirical investigations of several techniques for dealing with class noise and imbalanced data. Comprehensive experiments are performed to assess the effects of the classification techniques on classifier performance, as well as how the level of class imbalance, level of class noise, and distribution of class noise among the classes affects results. An empirical analysis of classifier based noise detection efficiency appears first. Subsequently, an intelligent data sampling technique, based on noise detection, is proposed and tested. Several hybrid classifier ensemble techniques for addressing class noise and imbalance are introduced. Finally, a detailed empirical investigation of classification filtering is performed to determine best practices.
Title: Classification techniques for noisy and imbalanced data.
5237 views
5074 downloads
Name(s): Napolitano, Amri E.
College of Engineering and Computer Science
Department of Computer and Electrical Engineering and Computer Science
Type of Resource: text
Genre: Electronic Thesis Or Dissertation
Issuance: monographic
Date Issued: 2009
Publisher: Florida Atlantic University
Physical Form: electronic
Extent: xvii, 218 p. : ill.
Language(s): English
Summary: Machine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are recorded erroneously, misleads a classifier and reduces predictive performance. Class imbalance occurs when one class represents only a small portion of the examples in a dataset, and, in such cases, classifiers often display poor accuracy on the minority class. The reduction in classification performance becomes even worse when the two issues occur simultaneously. To address the magnified difficulty caused by this interaction, this dissertation performs thorough empirical investigations of several techniques for dealing with class noise and imbalanced data. Comprehensive experiments are performed to assess the effects of the classification techniques on classifier performance, as well as how the level of class imbalance, level of class noise, and distribution of class noise among the classes affects results. An empirical analysis of classifier based noise detection efficiency appears first. Subsequently, an intelligent data sampling technique, based on noise detection, is proposed and tested. Several hybrid classifier ensemble techniques for addressing class noise and imbalance are introduced. Finally, a detailed empirical investigation of classification filtering is performed to determine best practices.
Identifier: 501313430 (oclc), 369201 (digitool), FADT369201 (IID), fau:4271 (fedora)
Note(s): by Amri Napolitano.
Thesis (Ph.D.)--Florida Atlantic University, 2009.
Includes bibliography.
Electronic reproduction. Boca Raton, Fla., 2009. Mode of access: World Wide Web.
Subject(s): Combinatorial group theory
Data mining -- Technological innovations
Decision trees
Machine learning
Filters (Mathematics)
Held by: FBoU FAUER
Persistent Link to This Record: http://purl.flvc.org/FAU/369201
Use and Reproduction: http://rightsstatements.org/vocab/InC/1.0/
Host Institution: FAU