You are here

Data Quality in Data Mining and Machine Learning

Download pdf | Full Screen View

Date Issued:
2007
Summary:
With advances in data storage and data transmission technologies, and given the increasing use of computers by both individuals and corporations, organizations are accumulating an ever-increasing amount of information in data warehouses and databases. The huge surge in data, however, has made the process of extracting useful, actionable, and interesting knowled_qe from the data extremely difficult. In response to the challenges posed by operating in a data-intensive environment, the fields of data mining and machine learning (DM/ML) have successfully provided solutions to help uncover knowledge buried within data. DM/ML techniques use automated (or semi-automated) procedures to process vast quantities of data in search of interesting patterns. DM/ML techniques do not create knowledge, instead the implicit assumption is that knowledge is present within the data, and these procedures are needed to uncover interesting, important, and previously unknown relationships. Therefore, the quality of the data is absolutely critical in ensuring successful analysis. Having high quality data, i.e., data which is (relatively) free from errors and suitable for use in data mining tasks, is a necessary precondition for extracting useful knowledge. In response to the important role played by data quality, this dissertation investigates data quality and its impact on DM/ML. First, we propose several innovative procedures for coping with low quality data. Another aspect of data quality, the occurrence of missing values, is also explored. Finally, a detailed experimental evaluation on learning from noisy and imbalanced datasets is provided, supplying valuable insight into how class noise in skewed datasets affects learning algorithms.
Title: Data Quality in Data Mining and Machine Learning.
229 views
144 downloads
Name(s): Van Hulse, Jason
Khoshgoftaar, Taghi M., Thesis advisor
Florida Atlantic University, Degree Grantor
College of Engineering and Computer Science
Department of Computer and Electrical Engineering and Computer Science
Type of Resource: text
Genre: Electronic Thesis Or Dissertation
Date Created: 2007
Date Issued: 2007
Publisher: Florida Atlantic University
Place of Publication: Boca Raton, Fla.
Physical Form: application/pdf
Extent: 226 p.
Language(s): English
Summary: With advances in data storage and data transmission technologies, and given the increasing use of computers by both individuals and corporations, organizations are accumulating an ever-increasing amount of information in data warehouses and databases. The huge surge in data, however, has made the process of extracting useful, actionable, and interesting knowled_qe from the data extremely difficult. In response to the challenges posed by operating in a data-intensive environment, the fields of data mining and machine learning (DM/ML) have successfully provided solutions to help uncover knowledge buried within data. DM/ML techniques use automated (or semi-automated) procedures to process vast quantities of data in search of interesting patterns. DM/ML techniques do not create knowledge, instead the implicit assumption is that knowledge is present within the data, and these procedures are needed to uncover interesting, important, and previously unknown relationships. Therefore, the quality of the data is absolutely critical in ensuring successful analysis. Having high quality data, i.e., data which is (relatively) free from errors and suitable for use in data mining tasks, is a necessary precondition for extracting useful knowledge. In response to the important role played by data quality, this dissertation investigates data quality and its impact on DM/ML. First, we propose several innovative procedures for coping with low quality data. Another aspect of data quality, the occurrence of missing values, is also explored. Finally, a detailed experimental evaluation on learning from noisy and imbalanced datasets is provided, supplying valuable insight into how class noise in skewed datasets affects learning algorithms.
Identifier: FA00000858 (IID)
Degree granted: Dissertation (Ph.D.)--Florida Atlantic University, 2007.
Collection: FAU Electronic Theses and Dissertations Collection
Note(s): Includes bibliography.
College of Engineering and Computer Science
Subject(s): Data mining--Quality control
Machine learning
Electronic data processing--Quality control
Held by: Florida Atlantic University Libraries
Persistent Link to This Record: http://purl.flvc.org/fau/fd/FA00000858
Sublocation: Digital Library
Use and Reproduction: Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Use and Reproduction: http://rightsstatements.org/vocab/InC/1.0/
Host Institution: FAU
Is Part of Series: Florida Atlantic University Digital Library Collections.