Current Search: Charles E. Schmidt College of Science (x) » Khoshgoftaar, Taghi M. (x)
View All Items
Pages
- Title
- OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION.
- Creator
- Lowe, Michael A., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
With the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the...
Show moreWith the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the capabilities a state-of-the-art unsupervised pre-training method; Transformers and Sequential Denoising Auto-Encoder (TSDAE). In this study we show various opportunities for using OCR2Seq a multi-modal generative augmentation strategy to further enhance and measure the quality of noise samples used when using TSDAE as a pretraining task. This study is a first of its kind work that leverages converting both generalized and sparse domains of relational data into multi-modal sources. Our primary objective is measuring the quality of augmentation in relation to the current implementation of the sentence transformers library. Further work includes the effect on ranking, language understanding, and corrective quality.
Show less - Date Issued
- 2023
- PURL
- http://purl.flvc.org/fau/fd/FA00014367
- Subject Headings
- Natural language processing (Computer science), Deep learning (Machine learning)
- Format
- Document (PDF)
- Title
- A COMPARATIVE STUDY OF STRUCTURED VERSUS UNSTRUCTURED TEXT DATA.
- Creator
- Cardenas, Erika, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
In today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of...
Show moreIn today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of structured versus unstructured text data in two different applications. The first application is in the field of real estate. We compare the performance of tabular real-estate data and unstructured text descriptions of homes to predict the house price. The second application is in translating Electronic Health Records (EHR) tabular data to text data for survival classification of COVID-19 patients. Lastly, we present a range of strategies and perspectives for future research.
Show less - Date Issued
- 2023
- PURL
- http://purl.flvc.org/fau/fd/FA00014220
- Subject Headings
- Natural language processing (Computer science), Text data mining
- Format
- Document (PDF)
- Title
- DATA AUGMENTATION IN DEEP LEARNING.
- Creator
- Shorten, Connor, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Recent successes of Deep Learning-powered AI are largely due to the trio of: algorithms, GPU computing, and big data. Data could take the shape of hospital records, satellite images, or the text in this paragraph. Deep Learning algorithms typically need massive collections of data before they can make reliable predictions. This limitation inspired investigation into a class of techniques referred to as Data Augmentation. Data Augmentation was originally developed as a set of label-preserving...
Show moreRecent successes of Deep Learning-powered AI are largely due to the trio of: algorithms, GPU computing, and big data. Data could take the shape of hospital records, satellite images, or the text in this paragraph. Deep Learning algorithms typically need massive collections of data before they can make reliable predictions. This limitation inspired investigation into a class of techniques referred to as Data Augmentation. Data Augmentation was originally developed as a set of label-preserving transformations used in order to simulate large datasets from small ones. For example, imagine developing a classifier that categorizes images as either a “cat” or a “dog”. After initial collection and labeling, there may only be 500 of these images, which are not enough data points to train a Deep Learning model. By transforming these images with Data Augmentations such as rotations and brightness modifications, more labeled images are available for model training and classification! In addition to applications for learning from limited labeled data, Data Augmentation can also be used for generalization testing. For example, we can augment the test set to set the visual style of images to “winter” and see how that impacts the performance of a stop sign detector. The dissertation begins with an overview of Deep Learning methods such as neural network architectures, gradient descent optimization, and generalization testing. Following an initial description of this technology, the dissertation explains overfitting. Overfitting is the crux of Deep Learning methods in which improvements to the training set do not lead to improvements on the testing set. To the rescue are Data Augmentation techniques, of which the Dissertation presents an overview of the augmentations used for both image and text data, as well as the promising potential of generative data augmentation with models such as ChatGPT. The dissertation then describes three major experimental works revolving around CIFAR-10 image classification, language modeling a novel dataset of Keras information, and patient survival classification from COVID-19 Electronic Health Records. The dissertation concludes with a reflection on the evolution of limitations of Deep Learning and directions for future work.
Show less - Date Issued
- 2023
- PURL
- http://purl.flvc.org/fau/fd/FA00014228
- Subject Headings
- Deep learning (Machine learning), Artificial intelligence, Data augmentation
- Format
- Document (PDF)
- Title
- ADDRESSING HIGHLY IMBALANCED BIG DATA CHALLENGES FOR MEDICARE FRAUD DETECTION.
- Creator
- Johnson, Justin M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Access to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the...
Show moreAccess to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the most severe fraudulent cases, but with millions of healthcare providers and more than 60 million active Medicare beneficiaries, manual fraud detection efforts are not able to make widespread, meaningful impact. Through the proliferation of electronic health records and continuous breakthroughs in data mining and machine learning, there is a great opportunity to develop and leverage advanced machine learning systems for automating healthcare fraud detection. This dissertation identifies key challenges associated with predictive modeling for large-scale Medicare fraud detection and presents innovative solutions to address these challenges in order to provide state-of-the-art results on multiple real-world Medicare fraud data sets. Our methodology for curating nine distinct Medicare fraud classification data sets is presented with comprehensive details describing data accumulation, data pre-processing, data aggregation techniques, data enrichment strategies, and improved fraud labeling. Data-level and algorithm-level methods for treating severe class imbalance, including a flexible output thresholding method and a cost-sensitive framework, are evaluated using deep neural network and ensemble learners. Novel encoding techniques and representation learning methods for high-dimensional categorical features are proposed to create expressive representations of provider attributes and billing procedure codes.
Show less - Date Issued
- 2022
- PURL
- http://purl.flvc.org/fau/fd/FA00014057
- Subject Headings
- Medicare fraud, Big data, Machine learning
- Format
- Document (PDF)
- Title
- MACHINE LEARNING ALGORITHMS FOR PREDICTING BOTNET ATTACKS IN IOT NETWORKS.
- Creator
- Leevy, Joffrey, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
The proliferation of Internet of Things (IoT) devices in various networks is being matched by an increase in related cybersecurity risks. To help counter these risks, big datasets such as Bot-IoT were designed to train machine learning algorithms on network-based intrusion detection for IoT devices. From a binary classification perspective, there is a high-class imbalance in Bot-IoT between each of the attack categories and the normal category, and also between the combined attack categories...
Show moreThe proliferation of Internet of Things (IoT) devices in various networks is being matched by an increase in related cybersecurity risks. To help counter these risks, big datasets such as Bot-IoT were designed to train machine learning algorithms on network-based intrusion detection for IoT devices. From a binary classification perspective, there is a high-class imbalance in Bot-IoT between each of the attack categories and the normal category, and also between the combined attack categories and the normal category. Within the scope of predicting botnet attacks in IoT networks, this dissertation demonstrates the usefulness and efficiency of novel machine learning methods, such as an easy-to-classify method and a unique set of ensemble feature selection techniques. The focus of this work is on the full Bot-IoT dataset, as well as each of the four attack categories of Bot-IoT, namely, Denial-of-Service (DoS), Distributed Denial-of-Service (DDoS), Reconnaissance, and Information Theft. Since resources and services become inaccessible during DoS and DDoS attacks, this interruption is costly to an organization in terms of both time and money. Reconnaissance attacks often signify the first stage of a cyberattack and preventing them from occurring usually means the end of the intended cyberattack. Information Theft attacks not only erode consumer confidence but may also compromise intellectual property and national security. For the DoS experiment, the ensemble feature selection approach led to the best performance, while for the DDoS experiment, the full set of Bot-IoT features resulted in the best performance. Regarding the Reconnaissance experiment, the ensemble feature selection approach effected the best performance. In relation to the Information Theft experiment, the ensemble feature selection techniques did not affect performance, positively or negatively. However, the ensemble feature selection approach is recommended for this experiment because feature reduction eases computational burden and may provide clarity through improved data visualization. For the full Bot-IoT big dataset, an explainable machine learning approach was taken using the Decision Tree classifier. An easy-to-learn Decision Tree model for predicting attacks was obtained with only three features, which is a significant result for big data.
Show less - Date Issued
- 2022
- PURL
- http://purl.flvc.org/fau/fd/FA00013933
- Subject Headings
- Machine learning, Internet of things--Security measures, Big data, Intrusion detection systems (Computer security)
- Format
- Document (PDF)
- Title
- COLLECTION AND ANALYSIS OF SLOW DENIAL OF SERVICE ATTACKS USING MACHINE LEARNING ALGORITHMS.
- Creator
- Kemp, Clifford, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Application-layer based attacks are becoming a more desirable target in computer networks for hackers. From complex rootkits to Denial of Service (DoS) attacks, hackers look to compromise computer networks. Web and application servers can get shut down by various application-layer DoS attacks, which exhaust CPU or memory resources. The HTTP protocol has become a popular target to launch application-layer DoS attacks. These exploits consume less bandwidth than traditional DoS attacks....
Show moreApplication-layer based attacks are becoming a more desirable target in computer networks for hackers. From complex rootkits to Denial of Service (DoS) attacks, hackers look to compromise computer networks. Web and application servers can get shut down by various application-layer DoS attacks, which exhaust CPU or memory resources. The HTTP protocol has become a popular target to launch application-layer DoS attacks. These exploits consume less bandwidth than traditional DoS attacks. Furthermore, this type of DoS attack is hard to detect because its network traffic resembles legitimate network requests. Being able to detect these DoS attacks effectively is a critical component of any robust cybersecurity system. Machine learning can help detect DoS attacks by identifying patterns in network traffic. With machine learning methods, predictive models can automatically detect network threats. This dissertation offers a novel framework for collecting several attack datasets on a live production network, where producing quality representative data is a requirement. Our approach builds datasets from collected Netflow and Full Packet Capture (FPC) data. We evaluate a wide range of machine learning classifiers which allows us to analyze slow DoS detection models more thoroughly. To identify attacks, we look at each dataset's unique traffic patterns and distinguishing properties. This research evaluates and investigates appropriate feature selection evaluators and search strategies. Features are assessed for their predictive value and degree of redundancy to build a subset of features. Feature subsets with high-class correlation but low intercorrelation are favored. Experimental results indicate Netflow and FPC features are discriminating enough to detect DoS attacks accurately. We conduct a comparative examination of performance metrics to determine the capability of several machine learning classifiers. Additionally, we improve upon our performance scores by investigating a variety of feature selection optimization strategies. Overall, this dissertation proposes a novel machine learning approach for detecting slow DoS attacks. Our machine learning results demonstrate that a single subset of features trained on Netflow data can effectively detect slow application-layer DoS attacks.
Show less - Date Issued
- 2021
- PURL
- http://purl.flvc.org/fau/fd/FA00013848
- Subject Headings
- Machine learning, Algorithms, Denial of service attacks
- Format
- Document (PDF)
- Title
- A REVIEW AND ANALYSIS OF BOT-IOT SECURITY DATA FOR MACHINE LEARNING.
- Creator
- Peterson, Jared M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Machine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The...
Show moreMachine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The full dataset contains about 73 million instances (big data), 3 dependent features, and 43 independent features. The purpose of this thesis is to provide researchers with a foundational understanding of Bot-IoT, its development, its features, its composition, and its pitfalls. It will also summarize many of the published works that utilize Bot-IoT and will propose new areas of research based on the issues identified in the current research and in the dataset.
Show less - Date Issued
- 2021
- PURL
- http://purl.flvc.org/fau/fd/FA00013838
- Subject Headings
- Machine learning, Cyber security, Big data
- Format
- Document (PDF)
- Title
- MACHINE LEARNING ALGORITHMS FOR THE DETECTION AND ANALYSIS OF WEB ATTACKS.
- Creator
- Zuech, Richard, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
The Internet has provided humanity with many great benefits, but it has also introduced new risks and dangers. E-commerce and other web portals have become large industries with big data. Criminals and other bad actors constantly seek to exploit these web properties through web attacks. Being able to properly detect these web attacks is a crucial component in the overall cybersecurity landscape. Machine learning is one tool that can assist in detecting web attacks. However, properly using...
Show moreThe Internet has provided humanity with many great benefits, but it has also introduced new risks and dangers. E-commerce and other web portals have become large industries with big data. Criminals and other bad actors constantly seek to exploit these web properties through web attacks. Being able to properly detect these web attacks is a crucial component in the overall cybersecurity landscape. Machine learning is one tool that can assist in detecting web attacks. However, properly using machine learning to detect web attacks does not come without its challenges. Classification algorithms can have difficulty with severe levels of class imbalance. Class imbalance occurs when one class label disproportionately outnumbers another class label. For example, in cybersecurity, it is common for the negative (normal) label to severely outnumber the positive (attack) label. Another difficulty encountered in machine learning is models can be complex, thus making it difficult for even subject matter experts to truly understand a model’s detection process. Moreover, it is important for practitioners to determine which input features to include or exclude in their models for optimal detection performance. This dissertation studies machine learning algorithms in detecting web attacks with big data. Severe class imbalance is a common problem in cybersecurity, and mainstream machine learning research does not sufficiently consider this with web attacks. Our research first investigates the problems associated with severe class imbalance and rarity. Rarity is an extreme form of class imbalance where the positive class suffers extremely low positive class count, thus making it difficult for the classifiers to discriminate. In reducing imbalance, we demonstrate random undersampling can effectively mitigate the class imbalance and rarity problems associated with web attacks. Furthermore, our research introduces a novel feature popularity technique which produces easier to understand models by only including the fewer, most popular features. Feature popularity granted us new insights into the web attack detection process, even though we had already intensely studied it. Even so, we proceed cautiously in selecting the best input features, as we determined that the “most important” Destination Port feature might be contaminated by lopsided traffic distributions.
Show less - Date Issued
- 2021
- PURL
- http://purl.flvc.org/fau/fd/FA00013823
- Subject Headings
- Machine learning, Computer security, Algorithms, Cybersecurity
- Format
- Document (PDF)
- Title
- DEEP MAXOUT NETWORKS FOR CLASSIFICATION PROBLEMS ACROSS MULTIPLE DOMAINS.
- Creator
- Castaneda, Gabriel, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Machine learning techniques such as deep neural networks have become an indispensable tool for a wide range of applications such as image classification, speech recognition, and sentiment analysis in text. An activation function is a mathematical equation that determines the output of each neuron in the neural network. In deep learning architectures the choice of activation functions is very important to the network’s performance. Activation functions determine the output of the model, its...
Show moreMachine learning techniques such as deep neural networks have become an indispensable tool for a wide range of applications such as image classification, speech recognition, and sentiment analysis in text. An activation function is a mathematical equation that determines the output of each neuron in the neural network. In deep learning architectures the choice of activation functions is very important to the network’s performance. Activation functions determine the output of the model, its computational efficiency, and its ability to train and converge after multiple iterations of training epochs. The selection of an activation function is critical to building and training an effective and efficient neural network. In real-world applications of deep neural networks, the activation function is a hyperparameter. We have observed a lack of consensus on how to select a good activation function for a deep neural network, and that a specific function may not be suitable for all domain-specific applications.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013362
- Subject Headings
- Classification, Machine learning--Technique, Neural networks (Computer science)
- Format
- Document (PDF)
- Title
- DATA COLLECTION FRAMEWORK AND MACHINE LEARNING ALGORITHMS FOR THE ANALYSIS OF CYBER SECURITY ATTACKS.
- Creator
- Calvert, Chad, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors...
Show moreThe integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors of persistent and emerging threats. When collecting modern day network traffic for intrusion detection, substantial amounts of traffic can be collected, much of which consists of relatively few attack instances as compared to normal traffic. This skewed distribution between normal and attack data can lead to high levels of class imbalance. Machine learning techniques can be used to aid in attack detection, but large levels of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013289
- Subject Headings
- Machine learning, Algorithms, Anomaly detection (Computer security), Intrusion detection systems (Computer security), Big data
- Format
- Document (PDF)
- Title
- INVESTIGATING MACHINE LEARNING ALGORITHMS WITH IMBALANCED BIG DATA.
- Creator
- Hasanin, Tawfiq, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Recent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such...
Show moreRecent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such bias may lead to adverse consequences, some of them even life-threatening, when the existence of false negatives is generally costlier than false positives. The size of the minority class can vary from fair to extraordinary small, which can lead to different performance scores for machine learning algorithms. Class imbalance is a well-studied area for traditional data, i.e., not big data. However, there is limited research focusing on both rarity and severe class imbalance in big data.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013316
- Subject Headings
- Algorithms, Machine learning, Big data--Data processing, Big data
- Format
- Document (PDF)
- Title
- PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES.
- Creator
- Richter, Aaron N., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized...
Show moreMelanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013342
- Subject Headings
- Melanoma, Electronic Health Records, Machine learning--Technique, Big Data
- Format
- Document (PDF)
- Title
- Big Data Analytics and Engineering for Medicare Fraud Detection.
- Creator
- Herland, Matthew Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the...
Show moreThe United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the elderly population continues to rise, increasing the need for programs, such as Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected, draining resources and reducing the quality and accessibility of necessary healthcare services. In response, advanced data analytics have recently been explored to detect possible fraudulent activities. The Centers for Medicare and Medicaid Services (CMS) released several ‘Big Data’ Medicare claims datasets for different parts of their Medicare program to help facilitate this effort. In this dissertation, we employ three CMS Medicare Big Data datasets to evaluate the fraud detection performance available using advanced data analytics techniques, specifically machine learning. We use two distinct approaches, designated as anomaly detection and traditional fraud detection, where each have very distinct data processing and feature engineering. Anomaly detection experiments classify by provider specialty, determining whether outlier physicians within the same specialty signal fraudulent behavior. Traditional fraud detection refers to the experiments directly classifying physicians as fraudulent or non-fraudulent, leveraging machine learning algorithms to discriminate between classes. We present our novel data engineering approaches for both anomaly detection and traditional fraud detection including data processing, fraud mapping, and the creation of a combined dataset consisting of all three Medicare parts. We incorporate the List of Excluded Individuals and Entities database to identify real world fraudulent physicians for model evaluation. Regarding features, the final datasets for anomaly detection contain only claim counts for every procedure a physician submits while traditional fraud detection incorporates aggregated counts and payment information, specialty, and gender. Additionally, we compare cross-validation to the real world application of building a model on a training dataset and evaluating on a separate test dataset for severe class imbalance and rarity.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013215
- Subject Headings
- Big data, Medicare fraud, Data analytics, Machine learning
- Format
- Document (PDF)
- Title
- An Evaluation of Deep Learning with Class Imbalanced Big Data.
- Creator
- Johnson, Justin Matthew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with...
Show moreEffective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with class imbalanced big data. Following an in-depth survey of deep learning methods for addressing class imbalance, we evaluate various methods for addressing imbalance on the task of detecting Medicare fraud, a big data problem characterized by extreme class imbalance. Case studies herein demonstrate the impact of class imbalance on neural networks, evaluate the efficacy of data-level and algorithm-level methods, and achieve state-of-the-art results on the given Medicare data set. Results indicate that combining under-sampling and over-sampling maximizes both performance and efficiency.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013221
- Subject Headings
- Deep learning, Big data, Medicare fraud--Prevention
- Format
- Document (PDF)
- Title
- An Exploration into Synthetic Data and Generative Aversarial Networks.
- Creator
- Shorten, Connor M., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
This Thesis surveys the landscape of Data Augmentation for image datasets. Completing this survey inspired further study into a method of generative modeling known as Generative Adversarial Networks (GANs). A survey on GANs was conducted to understood recent developments and the problems related to training them. Following this survey, four experiments were proposed to test the application of GANs for data augmentation and to contribute to the quality improvement in GAN-generated data....
Show moreThis Thesis surveys the landscape of Data Augmentation for image datasets. Completing this survey inspired further study into a method of generative modeling known as Generative Adversarial Networks (GANs). A survey on GANs was conducted to understood recent developments and the problems related to training them. Following this survey, four experiments were proposed to test the application of GANs for data augmentation and to contribute to the quality improvement in GAN-generated data. Experimental results demonstrate the effectiveness of GAN-generated data as a pre-training metric. The other experiments discuss important characteristics of GAN models such as the refining of prior information, transferring generative models from large datasets to small data, and automating the design of Deep Neural Networks within the context of the GAN framework. This Thesis will provide readers with a complete introduction to Data Augmentation and Generative Adversarial Networks, as well as insights into the future of these techniques.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013263
- Subject Headings
- Neural networks (Computer science), Computer vision, Images, Generative adversarial networks, Data sets
- Format
- Document (PDF)
- Title
- Machine Learning Algorithms with Big Medicare Fraud Data.
- Creator
- Bauder, Richard Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Healthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent...
Show moreHealthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent than the normative observations. This difference between the normal observations and those observations of interest can create highly imbalanced datasets. The problem of class imbalance, to include the classification of rare cases indicating extreme class imbalance, is an important and well-studied area in machine learning. The effects of class imbalance with big data in the real-world Medicare fraud application domain, however, is limited. In particular, the impact of detecting fraud in Medicare claims is critical in lessening the financial and personal impacts of these transgressions. Fortunately, the healthcare domain is one such area where the successful detection of fraud can garner meaningful positive results. The application of machine learning techniques, plus methods to mitigate the adverse effects of class imbalance and rarity, can be used to detect fraud and lessen the impacts for all Medicare beneficiaries. This dissertation presents the application of machine learning approaches to detect Medicare provider claims fraud in the United States. We discuss novel techniques to process three big Medicare datasets and create a new, combined dataset, which includes mapping fraud labels associated with known excluded providers. We investigate the ability of machine learning techniques, unsupervised and supervised, to detect Medicare claims fraud and leverage data sampling methods to lessen the impact of class imbalance and increase fraud detection performance. Additionally, we extend the study of class imbalance to assess the impacts of rare cases in big data for Medicare fraud detection.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013108
- Subject Headings
- Medicare fraud, Big data, Machine learning, Algorithms
- Format
- Document (PDF)
- Title
- Enhancement of Deep Neural Networks and Their Application to Text Mining.
- Creator
- Prusa, Joseph Daniel, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Many current application domains of machine learning and arti cial intelligence involve knowledge discovery from text, such as sentiment analysis, document ontology, and spam detection. Humans have years of experience and training with language, enabling them to understand complicated, nuanced text passages with relative ease. A text classi er attempts to emulate or replicate this knowledge so that computers can discriminate between concepts encountered in text; however, learning high-level...
Show moreMany current application domains of machine learning and arti cial intelligence involve knowledge discovery from text, such as sentiment analysis, document ontology, and spam detection. Humans have years of experience and training with language, enabling them to understand complicated, nuanced text passages with relative ease. A text classi er attempts to emulate or replicate this knowledge so that computers can discriminate between concepts encountered in text; however, learning high-level concepts from text, such as those found in many applications of text classi- cation, is a challenging task due to the many challenges associated with text mining and classi cation. Recently, classi ers trained using arti cial neural networks have been shown to be e ective for a variety of text mining tasks. Convolutional neural networks have been trained to classify text from character-level input, automatically learn high-level abstract representations and avoiding the need for human engineered features. This dissertation proposes two new techniques for character-level learning, log(m) character embedding and convolutional window classi cation. Log(m) embedding is a new character-vector representation for text data that is more compact and memory e cient than previous embedding vectors. Convolutional window classi cation is a technique for classifying long documents, i.e. documents with lengths exceeding the input dimension of the neural network. Additionally, we investigate the performance of convolutional neural networks combined with long short-term memory networks, explore how document length impacts classi cation performance and compare performance of neural networks against non-neural network-based learners in text classi cation tasks.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00005959
- Subject Headings
- Text Mining, Neural networks (Computer science), Machine learning
- Format
- Document (PDF)
- Title
- An evaluation of Unsupervised Machine Learning Algorithms for Detecting Fraud and Abuse in the U.S. Medicare Insurance Program.
- Creator
- Da Rosa, Raquel C., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which...
Show moreThe population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which indicates reasonable fraud detection results. We employ two unsupervised machine learning algorithms, Isolation Forest and Unsupervised Random Forest, which have not been previously used for the detection of fraud and abuse on Medicare data. Additionally, we implement three other machine learning methods previously applied on Medicare data which include: Local Outlier Factor, Autoencoder, and k-Nearest Neighbor. For our dataset, we combine the 2012 to 2015 Medicare provider utilization and payment data and add fraud labels from the List of Excluded Individuals/Entities (LEIE) database. Results show that Local Outlier Factor is the best model to use for Medicare fraud detection.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013042
- Subject Headings
- Machine learning, Medicare fraud, Algorithms
- Format
- Document (PDF)
- Title
- Machine Learning Algorithms for the Analysis of Social Media and Detection of Malicious User Generated Content.
- Creator
- Heredia, Brian, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
One of the de ning characteristics of the modern Internet is its massive connectedness, with information and human connection simply a few clicks away. Social media and online retailers have revolutionized how we communicate and purchase goods or services. User generated content on the web, through social media, plays a large role in modern society; Twitter has been in the forefront of political discourse, with politicians choosing it as their platform for disseminating information, while...
Show moreOne of the de ning characteristics of the modern Internet is its massive connectedness, with information and human connection simply a few clicks away. Social media and online retailers have revolutionized how we communicate and purchase goods or services. User generated content on the web, through social media, plays a large role in modern society; Twitter has been in the forefront of political discourse, with politicians choosing it as their platform for disseminating information, while websites like Amazon and Yelp allow users to share their opinions on products via online reviews. The information available through these platforms can provide insight into a host of relevant topics through the process of machine learning. Speci - cally, this process involves text mining for sentiment analysis, which is an application domain of machine learning involving the extraction of emotion from text. Unfortunately, there are still those with malicious intent and with the changes to how we communicate and conduct business, comes changes to their malicious practices. Social bots and fake reviews plague the web, providing incorrect information and swaying the opinion of unaware readers. The detection of these false users or posts from reading the text is di cult, if not impossible, for humans. Fortunately, text mining provides us with methods for the detection of harmful user generated content. This dissertation expands the current research in sentiment analysis, fake online review detection and election prediction. We examine cross-domain sentiment analysis using tweets and reviews. Novel techniques combining ensemble and feature selection methods are proposed for the domain of online spam review detection. We investigate the ability for the Twitter platform to predict the United States 2016 presidential election. In addition, we determine how social bots in uence this prediction.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013067
- Subject Headings
- Machine learning., Text mining., User-generated content., Social media.
- Format
- Document (PDF)
- Title
- Parallel Distributed Deep Learning on Cluster Computers.
- Creator
- Kennedy, Robert Kwan Lee, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Deep Learning is an increasingly important subdomain of arti cial intelligence. Deep Learning architectures, arti cial neural networks characterized by having both a large breadth of neurons and a large depth of layers, bene ts from training on Big Data. The size and complexity of the model combined with the size of the training data makes the training procedure very computationally and temporally expensive. Accelerating the training procedure of Deep Learning using cluster computers faces...
Show moreDeep Learning is an increasingly important subdomain of arti cial intelligence. Deep Learning architectures, arti cial neural networks characterized by having both a large breadth of neurons and a large depth of layers, bene ts from training on Big Data. The size and complexity of the model combined with the size of the training data makes the training procedure very computationally and temporally expensive. Accelerating the training procedure of Deep Learning using cluster computers faces many challenges ranging from distributed optimizers to the large communication overhead speci c to a system with o the shelf networking components. In this thesis, we present a novel synchronous data parallel distributed Deep Learning implementation on HPCC Systems, a cluster computer system. We discuss research that has been conducted on the distribution and parallelization of Deep Learning, as well as the concerns relating to cluster environments. Additionally, we provide case studies that evaluate and validate our implementation.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013080
- Subject Headings
- Deep learning., Neural networks (Computer science)., Artificial intelligence., Machine learning.
- Format
- Document (PDF)