You are here
NOVEL TECHNIQUES FOR HANDLING IMBALANCED DATA WITH UNSUPERVISED METHODS
- Date Issued:
- 2024
- Abstract/Description:
- In the modern data landscape, vast amounts of unlabeled data are continuously generated, necessitating development of robust unsupervised techniques for handling unlabeled data. This is the case for fraud detection and healthcare sectors analyses, where data is often significantly imbalanced. This dissertation focuses on novel techniques for handling imbalanced data, with specific emphasis on a novel unsupervised class labeling technique for unlabeled fraud detection datasets and unlabeled cognitive datasets. Traditional supervised machine learning relies on labeled data, which is often expensive and difficult to create, particularly in domains requiring expert input. Additionally, such datasets suffer from challenges associated with class imbalance, where one class has significantly fewer examples than another, complicating model training and significantly reducing performance. The primary objectives of this dissertation include developing a novel unsupervised cleaning method, and an innovative unsupervised class labeling method. We validate and evaluate our methods across various datasets, which include two Medicare fraud detection datasets, a credit card fraud detection dataset, and three datasets used for detecting cognitive decline. Our unique approach involves using an unsupervised autoencoder to learn from dataset features and synthesize labels. Primarily targeting imbalanced datasets, but still effective for balanced datasets, our method calculates an error metric for each instance. This metric is used to distinguish between fraudulent and legitimate cases, allowing us to assign a binary class label. To further improve label generation, we integrate an unsupervised feature selection method that ranks and identifies the most important features without using class labels.
Title: | NOVEL TECHNIQUES FOR HANDLING IMBALANCED DATA WITH UNSUPERVISED METHODS. |
![]() ![]() |
---|---|---|
Name(s): |
Kennedy, Robert Kwan Lee, author Khoshgoftaar, Taghi M. , Thesis advisor Florida Atlantic University, Degree grantor Department of Computer and Electrical Engineering and Computer Science College of Engineering and Computer Science |
|
Type of Resource: | text | |
Genre: | Electronic Thesis Or Dissertation | |
Date Created: | 2024 | |
Date Issued: | 2024 | |
Publisher: | Florida Atlantic University | |
Place of Publication: | Boca Raton, Fla. | |
Physical Form: | application/pdf | |
Extent: | 171 p. | |
Language(s): | English | |
Abstract/Description: | In the modern data landscape, vast amounts of unlabeled data are continuously generated, necessitating development of robust unsupervised techniques for handling unlabeled data. This is the case for fraud detection and healthcare sectors analyses, where data is often significantly imbalanced. This dissertation focuses on novel techniques for handling imbalanced data, with specific emphasis on a novel unsupervised class labeling technique for unlabeled fraud detection datasets and unlabeled cognitive datasets. Traditional supervised machine learning relies on labeled data, which is often expensive and difficult to create, particularly in domains requiring expert input. Additionally, such datasets suffer from challenges associated with class imbalance, where one class has significantly fewer examples than another, complicating model training and significantly reducing performance. The primary objectives of this dissertation include developing a novel unsupervised cleaning method, and an innovative unsupervised class labeling method. We validate and evaluate our methods across various datasets, which include two Medicare fraud detection datasets, a credit card fraud detection dataset, and three datasets used for detecting cognitive decline. Our unique approach involves using an unsupervised autoencoder to learn from dataset features and synthesize labels. Primarily targeting imbalanced datasets, but still effective for balanced datasets, our method calculates an error metric for each instance. This metric is used to distinguish between fraudulent and legitimate cases, allowing us to assign a binary class label. To further improve label generation, we integrate an unsupervised feature selection method that ranks and identifies the most important features without using class labels. | |
Identifier: | FA00014547 (IID) | |
Degree granted: | Dissertation (PhD)--Florida Atlantic University, 2024. | |
Collection: | FAU Electronic Theses and Dissertations Collection | |
Note(s): | Includes bibliography. | |
Subject(s): |
Machine learning Big data Computer science |
|
Persistent Link to This Record: | http://purl.flvc.org/fau/fd/FA00014547 | |
Use and Reproduction: | Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. | |
Use and Reproduction: | http://rightsstatements.org/vocab/InC/1.0/ | |
Host Institution: | FAU |