Current Search: Khoshgoftaar, Taghi M. (x)
View All Items
Pages
- Title
- A REVIEW AND ANALYSIS OF BOT-IOT SECURITY DATA FOR MACHINE LEARNING.
- Creator
- Peterson, Jared M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Machine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The...
Show moreMachine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The full dataset contains about 73 million instances (big data), 3 dependent features, and 43 independent features. The purpose of this thesis is to provide researchers with a foundational understanding of Bot-IoT, its development, its features, its composition, and its pitfalls. It will also summarize many of the published works that utilize Bot-IoT and will propose new areas of research based on the issues identified in the current research and in the dataset.
Show less - Date Issued
- 2021
- PURL
- http://purl.flvc.org/fau/fd/FA00013838
- Subject Headings
- Machine learning, Cyber security, Big data
- Format
- Document (PDF)
- Title
- ADDRESSING HIGHLY IMBALANCED BIG DATA CHALLENGES FOR MEDICARE FRAUD DETECTION.
- Creator
- Johnson, Justin M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Access to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the...
Show moreAccess to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the most severe fraudulent cases, but with millions of healthcare providers and more than 60 million active Medicare beneficiaries, manual fraud detection efforts are not able to make widespread, meaningful impact. Through the proliferation of electronic health records and continuous breakthroughs in data mining and machine learning, there is a great opportunity to develop and leverage advanced machine learning systems for automating healthcare fraud detection. This dissertation identifies key challenges associated with predictive modeling for large-scale Medicare fraud detection and presents innovative solutions to address these challenges in order to provide state-of-the-art results on multiple real-world Medicare fraud data sets. Our methodology for curating nine distinct Medicare fraud classification data sets is presented with comprehensive details describing data accumulation, data pre-processing, data aggregation techniques, data enrichment strategies, and improved fraud labeling. Data-level and algorithm-level methods for treating severe class imbalance, including a flexible output thresholding method and a cost-sensitive framework, are evaluated using deep neural network and ensemble learners. Novel encoding techniques and representation learning methods for high-dimensional categorical features are proposed to create expressive representations of provider attributes and billing procedure codes.
Show less - Date Issued
- 2022
- PURL
- http://purl.flvc.org/fau/fd/FA00014057
- Subject Headings
- Medicare fraud, Big data, Machine learning
- Format
- Document (PDF)
- Title
- An evaluation of machine learning algorithms for tweet sentiment analysis.
- Creator
- Prusa, Joseph D., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Sentiment analysis of tweets is an application of mining Twitter, and is growing in popularity as a means of determining public opinion. Machine learning algorithms are used to perform sentiment analysis; however, data quality issues such as high dimensionality, class imbalance or noise may negatively impact classifier performance. Machine learning techniques exist for targeting these problems, but have not been applied to this domain, or have not been studied in detail. In this thesis we...
Show moreSentiment analysis of tweets is an application of mining Twitter, and is growing in popularity as a means of determining public opinion. Machine learning algorithms are used to perform sentiment analysis; however, data quality issues such as high dimensionality, class imbalance or noise may negatively impact classifier performance. Machine learning techniques exist for targeting these problems, but have not been applied to this domain, or have not been studied in detail. In this thesis we discuss research that has been conducted on tweet sentiment classification, its accompanying data concerns, and methods of addressing these concerns. We test the impact of feature selection, data sampling and ensemble techniques in an effort to improve classifier performance. We also evaluate the combination of feature selection and ensemble techniques and examine the effects of high dimensionality when combining multiple types of features. Additionally, we provide strategies and insights for potential avenues of future work.
Show less - Date Issued
- 2015
- PURL
- http://purl.flvc.org/fau/fd/FA00004460, http://purl.flvc.org/fau/fd/FA00004460
- Subject Headings
- Social media., Natural language processing (Computer science), Machine learning., Algorithms., Fuzzy expert systems., Artificial intelligence.
- Format
- Document (PDF)
- Title
- An evaluation of Unsupervised Machine Learning Algorithms for Detecting Fraud and Abuse in the U.S. Medicare Insurance Program.
- Creator
- Da Rosa, Raquel C., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which...
Show moreThe population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which indicates reasonable fraud detection results. We employ two unsupervised machine learning algorithms, Isolation Forest and Unsupervised Random Forest, which have not been previously used for the detection of fraud and abuse on Medicare data. Additionally, we implement three other machine learning methods previously applied on Medicare data which include: Local Outlier Factor, Autoencoder, and k-Nearest Neighbor. For our dataset, we combine the 2012 to 2015 Medicare provider utilization and payment data and add fraud labels from the List of Excluded Individuals/Entities (LEIE) database. Results show that Local Outlier Factor is the best model to use for Medicare fraud detection.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013042
- Subject Headings
- Machine learning, Medicare fraud, Algorithms
- Format
- Document (PDF)
- Title
- An Evaluation of Deep Learning with Class Imbalanced Big Data.
- Creator
- Johnson, Justin Matthew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with...
Show moreEffective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with class imbalanced big data. Following an in-depth survey of deep learning methods for addressing class imbalance, we evaluate various methods for addressing imbalance on the task of detecting Medicare fraud, a big data problem characterized by extreme class imbalance. Case studies herein demonstrate the impact of class imbalance on neural networks, evaluate the efficacy of data-level and algorithm-level methods, and achieve state-of-the-art results on the given Medicare data set. Results indicate that combining under-sampling and over-sampling maximizes both performance and efficiency.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013221
- Subject Headings
- Deep learning, Big data, Medicare fraud--Prevention
- Format
- Document (PDF)
- Title
- Analysis of machine learning algorithms on bioinformatics data of varying quality.
- Creator
- Shanab, Ahmad Abu, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
One of the main applications of machine learning in bioinformatics is the construction of classification models which can accurately classify new instances using information gained from previous instances. With the help of machine learning algorithms (such as supervised classification and gene selection) new meaningful knowledge can be extracted from bioinformatics datasets that can help in disease diagnosis and prognosis as well as in prescribing the right treatment for a disease. One...
Show moreOne of the main applications of machine learning in bioinformatics is the construction of classification models which can accurately classify new instances using information gained from previous instances. With the help of machine learning algorithms (such as supervised classification and gene selection) new meaningful knowledge can be extracted from bioinformatics datasets that can help in disease diagnosis and prognosis as well as in prescribing the right treatment for a disease. One particular challenge encountered when analyzing bioinformatics datasets is data noise, which refers to incorrect or missing values in datasets. Noise can be introduced as a result of experimental errors (e.g. faulty microarray chips, insufficient resolution, image corruption, and incorrect laboratory procedures), as well as other errors (errors during data processing, transfer, and/or mining). A special type of data noise called class noise, which occurs when an instance/example is mislabeled. Previous research showed that class noise has a detrimental impact on machine learning algorithms (e.g. worsened classification performance and unstable feature selection). In addition to data noise, gene expression datasets can suffer from the problems of high dimensionality (a very large feature space) and class imbalance (unequal distribution of instances between classes). As a result of these inherent problems, constructing accurate classification models becomes more challenging.
Show less - Date Issued
- 2015
- PURL
- http://purl.flvc.org./fau/fd/FA00004425, http://purl.flvc.org/fau/fd/FA00004425
- Subject Headings
- Artificial intelligence, Bioinformatics, Machine learning, System design, Theory of computation
- Format
- Document (PDF)
- Title
- Big Data Analytics and Engineering for Medicare Fraud Detection.
- Creator
- Herland, Matthew Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the...
Show moreThe United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the elderly population continues to rise, increasing the need for programs, such as Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected, draining resources and reducing the quality and accessibility of necessary healthcare services. In response, advanced data analytics have recently been explored to detect possible fraudulent activities. The Centers for Medicare and Medicaid Services (CMS) released several ‘Big Data’ Medicare claims datasets for different parts of their Medicare program to help facilitate this effort. In this dissertation, we employ three CMS Medicare Big Data datasets to evaluate the fraud detection performance available using advanced data analytics techniques, specifically machine learning. We use two distinct approaches, designated as anomaly detection and traditional fraud detection, where each have very distinct data processing and feature engineering. Anomaly detection experiments classify by provider specialty, determining whether outlier physicians within the same specialty signal fraudulent behavior. Traditional fraud detection refers to the experiments directly classifying physicians as fraudulent or non-fraudulent, leveraging machine learning algorithms to discriminate between classes. We present our novel data engineering approaches for both anomaly detection and traditional fraud detection including data processing, fraud mapping, and the creation of a combined dataset consisting of all three Medicare parts. We incorporate the List of Excluded Individuals and Entities database to identify real world fraudulent physicians for model evaluation. Regarding features, the final datasets for anomaly detection contain only claim counts for every procedure a physician submits while traditional fraud detection incorporates aggregated counts and payment information, specialty, and gender. Additionally, we compare cross-validation to the real world application of building a model on a training dataset and evaluating on a separate test dataset for severe class imbalance and rarity.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013215
- Subject Headings
- Big data, Medicare fraud, Data analytics, Machine learning
- Format
- Document (PDF)
- Title
- Design of a Test Framework for the Evaluation of Transfer Learning Algorithms.
- Creator
- Weiss, Karl Robert, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
A traditional machine learning environment is characterized by the training and testing data being drawn from the same domain, therefore, having similar distribution characteristics. In contrast, a transfer learning environment is characterized by the training data having di erent distribution characteristics from the testing data. Previous research on transfer learning has focused on the development and evaluation of transfer learning algorithms using real-world datasets. Testing with real...
Show moreA traditional machine learning environment is characterized by the training and testing data being drawn from the same domain, therefore, having similar distribution characteristics. In contrast, a transfer learning environment is characterized by the training data having di erent distribution characteristics from the testing data. Previous research on transfer learning has focused on the development and evaluation of transfer learning algorithms using real-world datasets. Testing with real-world datasets exposes an algorithm to a limited number of data distribution di erences and does not exercise an algorithm's full capability and boundary limitations. In this research, we de ne, implement, and deploy a transfer learning test framework to test machine learning algorithms. The transfer learning test framework is designed to create a wide-range of distribution di erences that are typically encountered in a transfer learning environment. By testing with many di erent distribution di erences, an algorithm's strong and weak points can be discovered and evaluated against other algorithms. This research additionally performs case studies that use the transfer learning test framework. The rst case study focuses on measuring the impact of exposing algorithms to the Domain Class Imbalance distortion pro le. The next case study uses the entire transfer learning test framework to evaluate both transfer learning and traditional machine learning algorithms. The nal case study uses the transfer learning test framework in conjunction with real-world datasets to measure the impact of the base traditional learner on the performance of transfer learning algorithms. Two additional experiments are performed that are focused on using unique realworld datasets. The rst experiment uses transfer learning techniques to predict fraudulent Medicare claims. The second experiment uses a heterogeneous transfer learning method to predict phishing webgages. These case studies will be of interest to researchers who develop and improve transfer learning algorithms. This research will also be of bene t to machine learning practitioners in the selection of high-performing transfer learning algorithms.
Show less - Date Issued
- 2017
- PURL
- http://purl.flvc.org/fau/fd/FA00005925
- Subject Headings
- Dissertations, Academic -- Florida Atlantic University, Machine learning., Algorithms., Machine learning Development.
- Format
- Document (PDF)
- Title
- Enhancement of Deep Neural Networks and Their Application to Text Mining.
- Creator
- Prusa, Joseph Daniel, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Many current application domains of machine learning and arti cial intelligence involve knowledge discovery from text, such as sentiment analysis, document ontology, and spam detection. Humans have years of experience and training with language, enabling them to understand complicated, nuanced text passages with relative ease. A text classi er attempts to emulate or replicate this knowledge so that computers can discriminate between concepts encountered in text; however, learning high-level...
Show moreMany current application domains of machine learning and arti cial intelligence involve knowledge discovery from text, such as sentiment analysis, document ontology, and spam detection. Humans have years of experience and training with language, enabling them to understand complicated, nuanced text passages with relative ease. A text classi er attempts to emulate or replicate this knowledge so that computers can discriminate between concepts encountered in text; however, learning high-level concepts from text, such as those found in many applications of text classi- cation, is a challenging task due to the many challenges associated with text mining and classi cation. Recently, classi ers trained using arti cial neural networks have been shown to be e ective for a variety of text mining tasks. Convolutional neural networks have been trained to classify text from character-level input, automatically learn high-level abstract representations and avoiding the need for human engineered features. This dissertation proposes two new techniques for character-level learning, log(m) character embedding and convolutional window classi cation. Log(m) embedding is a new character-vector representation for text data that is more compact and memory e cient than previous embedding vectors. Convolutional window classi cation is a technique for classifying long documents, i.e. documents with lengths exceeding the input dimension of the neural network. Additionally, we investigate the performance of convolutional neural networks combined with long short-term memory networks, explore how document length impacts classi cation performance and compare performance of neural networks against non-neural network-based learners in text classi cation tasks.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00005959
- Subject Headings
- Text Mining, Neural networks (Computer science), Machine learning
- Format
- Document (PDF)
- Title
- Wavelet de-noising applied to vibrational envelope analysis methods.
- Creator
- Bertot, Edward Max, Khoshgoftaar, Taghi M., Beaujean, Pierre-Philippe, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In the field of machine prognostics, vibration analysis is a proven method for detecting and diagnosing bearing faults in rotating machines. One popular method for interpreting vibration signals is envelope demodulation, which allows a technician to clearly identify an impulsive fault source and its severity. However incipient faults -faults in early stages - are masked by in-band noise, which can make the associated impulses difficult to detect and interpret. In this thesis, Wavelet De...
Show moreIn the field of machine prognostics, vibration analysis is a proven method for detecting and diagnosing bearing faults in rotating machines. One popular method for interpreting vibration signals is envelope demodulation, which allows a technician to clearly identify an impulsive fault source and its severity. However incipient faults -faults in early stages - are masked by in-band noise, which can make the associated impulses difficult to detect and interpret. In this thesis, Wavelet De-Noising (WDN) is implemented after envelope-demodulation to improve accuracy of bearing fault diagnostics. This contrasts the typical approach of de-noising as a preprocessing step. When manually measuring time-domain impulse amplitudes, the algorithm shows varying improvements in Signal-to-Noise Ratio (SNR) relative to background vibrational noise. A frequency-domain measure of SNR agrees with this result.
Show less - Date Issued
- 2014
- PURL
- http://purl.flvc.org/fau/fd/FA00004080, http://purl.flvc.org/fau/fd/FA00004080
- Subject Headings
- Fluid dynamics, Signal processing, Structural dynamics, Wavelet (Mathematics)
- Format
- Document (PDF)
- Title
- PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES.
- Creator
- Richter, Aaron N., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized...
Show moreMelanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013342
- Subject Headings
- Melanoma, Electronic Health Records, Machine learning--Technique, Big Data
- Format
- Document (PDF)
- Title
- Parallel Distributed Deep Learning on Cluster Computers.
- Creator
- Kennedy, Robert Kwan Lee, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Deep Learning is an increasingly important subdomain of arti cial intelligence. Deep Learning architectures, arti cial neural networks characterized by having both a large breadth of neurons and a large depth of layers, bene ts from training on Big Data. The size and complexity of the model combined with the size of the training data makes the training procedure very computationally and temporally expensive. Accelerating the training procedure of Deep Learning using cluster computers faces...
Show moreDeep Learning is an increasingly important subdomain of arti cial intelligence. Deep Learning architectures, arti cial neural networks characterized by having both a large breadth of neurons and a large depth of layers, bene ts from training on Big Data. The size and complexity of the model combined with the size of the training data makes the training procedure very computationally and temporally expensive. Accelerating the training procedure of Deep Learning using cluster computers faces many challenges ranging from distributed optimizers to the large communication overhead speci c to a system with o the shelf networking components. In this thesis, we present a novel synchronous data parallel distributed Deep Learning implementation on HPCC Systems, a cluster computer system. We discuss research that has been conducted on the distribution and parallelization of Deep Learning, as well as the concerns relating to cluster environments. Additionally, we provide case studies that evaluate and validate our implementation.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013080
- Subject Headings
- Deep learning., Neural networks (Computer science)., Artificial intelligence., Machine learning.
- Format
- Document (PDF)
- Title
- Machine Learning Algorithms with Big Medicare Fraud Data.
- Creator
- Bauder, Richard Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Healthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent...
Show moreHealthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent than the normative observations. This difference between the normal observations and those observations of interest can create highly imbalanced datasets. The problem of class imbalance, to include the classification of rare cases indicating extreme class imbalance, is an important and well-studied area in machine learning. The effects of class imbalance with big data in the real-world Medicare fraud application domain, however, is limited. In particular, the impact of detecting fraud in Medicare claims is critical in lessening the financial and personal impacts of these transgressions. Fortunately, the healthcare domain is one such area where the successful detection of fraud can garner meaningful positive results. The application of machine learning techniques, plus methods to mitigate the adverse effects of class imbalance and rarity, can be used to detect fraud and lessen the impacts for all Medicare beneficiaries. This dissertation presents the application of machine learning approaches to detect Medicare provider claims fraud in the United States. We discuss novel techniques to process three big Medicare datasets and create a new, combined dataset, which includes mapping fraud labels associated with known excluded providers. We investigate the ability of machine learning techniques, unsupervised and supervised, to detect Medicare claims fraud and leverage data sampling methods to lessen the impact of class imbalance and increase fraud detection performance. Additionally, we extend the study of class imbalance to assess the impacts of rare cases in big data for Medicare fraud detection.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013108
- Subject Headings
- Medicare fraud, Big data, Machine learning, Algorithms
- Format
- Document (PDF)
- Title
- Machine learning algorithms for the analysis and detection of network attacks.
- Creator
- Najafabadi, Maryam Mousaarab, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The Internet and computer networks have become an important part of our organizations and everyday life. With the increase in our dependence on computers and communication networks, malicious activities have become increasingly prevalent. Network attacks are an important problem in today’s communication environments. The network traffic must be monitored and analyzed to detect malicious activities and attacks to ensure reliable functionality of the networks and security of users’ information....
Show moreThe Internet and computer networks have become an important part of our organizations and everyday life. With the increase in our dependence on computers and communication networks, malicious activities have become increasingly prevalent. Network attacks are an important problem in today’s communication environments. The network traffic must be monitored and analyzed to detect malicious activities and attacks to ensure reliable functionality of the networks and security of users’ information. Recently, machine learning techniques have been applied toward the detection of network attacks. Machine learning models are able to extract similarities and patterns in the network traffic. Unlike signature based methods, there is no need for manual analyses to extract attack patterns. Applying machine learning algorithms can automatically build predictive models for the detection of network attacks. This dissertation reports an empirical analysis of the usage of machine learning methods for the detection of network attacks. For this purpose, we study the detection of three common attacks in computer networks: SSH brute force, Man In The Middle (MITM) and application layer Distributed Denial of Service (DDoS) attacks. Using outdated and non-representative benchmark data, such as the DARPA dataset, in the intrusion detection domain, has caused a practical gap between building detection models and their actual deployment in a real computer network. To alleviate this limitation, we collect representative network data from a real production network for each attack type. Our analysis of each attack includes a detailed study of the usage of machine learning methods for its detection. This includes the motivation behind the proposed machine learning based detection approach, the data collection process, feature engineering, building predictive models and evaluating their performance. We also investigate the application of feature selection in building detection models for network attacks. Overall, this dissertation presents a thorough analysis on how machine learning techniques can be used to detect network attacks. We not only study a broad range of network attacks, but also study the application of different machine learning methods including classification, anomaly detection and feature selection for their detection at the host level and the network level.
Show less - Date Issued
- 2017
- PURL
- http://purl.flvc.org/fau/fd/FA00004882, http://purl.flvc.org/fau/fd/FA00004882
- Subject Headings
- Machine learning., Computer security., Data protection., Computer networks--Security measures.
- Format
- Document (PDF)
- Title
- Microarray deconvolution: a web application.
- Creator
- Canny, Stephanie, Petrie, Howard, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Microarray gene expression profiling is used in biology for a variety of purposes including identifying disease biomarkers or understanding cellular processes. Biological samples composed of multiple cell or tissue types pose a problem because different compositions of cell-types in samples affect the gene expression profile and also the expression profile of individual components of the sample may be of interest. Physical methods to separate mixed samples are time-consuming and expensive....
Show moreMicroarray gene expression profiling is used in biology for a variety of purposes including identifying disease biomarkers or understanding cellular processes. Biological samples composed of multiple cell or tissue types pose a problem because different compositions of cell-types in samples affect the gene expression profile and also the expression profile of individual components of the sample may be of interest. Physical methods to separate mixed samples are time-consuming and expensive. Consequently, several computational methods have been developed to deconvolute heterogeneous samples into individual components. Different software packages and applications are available to perform these calculations. Microarray Deconvolution is a web application that provides a simple-to-use interface that fills some gaps left by other packages in performing heterogeneous sample microarray deconvolution including microarray raw data processing and normalization, cell-type proportion estimation and simple linear deconvolution.
Show less - Date Issued
- 2013
- PURL
- http://purl.flvc.org/fau/fd/FA00004243
- Format
- Document (PDF)
- Title
- Alleviating class imbalance using data sampling: Examining the effects on classification algorithms.
- Creator
- Napolitano, Amri E., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Imbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling...
Show moreImbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling techniques using five different classifiers and four different datasets. The work addresses which sampling techniques produce the best performance in the presence of class unbalance, which classifiers are most robust to the problem, as well as which sampling techniques perform better or worse with each classifier. Extensive statistical analysis of these results is provided, in addition to an examination of the qualitative effects of the sampling techniques on the types of predictions made by the C4.5 classifier.
Show less - Date Issued
- 2006
- PURL
- http://purl.flvc.org/fcla/dt/13413
- Subject Headings
- Combinatorial group theory, Data mining, Decision trees, Machine learning
- Format
- Document (PDF)
- Title
- Classification of software quality with tree modeling using C4.5 algorithm.
- Creator
- Ponnuswamy, Viswanathan Kolathupalayam., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Developing highly reliable software is a must in today's competitive environment. However quality control is a costly and time consuming process. If the quality of software modules being developed can be predicted early in their life cycle, resources can be effectively allocated improving quality, reducing cost and development time. This study examines the C4.5 algorithm as a tool for building classification trees, classifying software module either as fault-prone or not fault-prone. The...
Show moreDeveloping highly reliable software is a must in today's competitive environment. However quality control is a costly and time consuming process. If the quality of software modules being developed can be predicted early in their life cycle, resources can be effectively allocated improving quality, reducing cost and development time. This study examines the C4.5 algorithm as a tool for building classification trees, classifying software module either as fault-prone or not fault-prone. The classification tree models were developed based on four consecutive releases of a very large legacy telecommunication system. The first two releases were used as training data sets and the subsequent two releases were used as test data sets to evaluate the model. We found out that C4.5 was able to build compact classification trees models with balanced misclassification rates.
Show less - Date Issued
- 2001
- PURL
- http://purl.flvc.org/fcla/dt/12855
- Subject Headings
- Computer software--Quality control, Software measurement
- Format
- Document (PDF)
- Title
- Data Quality in Data Mining and Machine Learning.
- Creator
- Van Hulse, Jason, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
With advances in data storage and data transmission technologies, and given the increasing use of computers by both individuals and corporations, organizations are accumulating an ever-increasing amount of information in data warehouses and databases. The huge surge in data, however, has made the process of extracting useful, actionable, and interesting knowled_qe from the data extremely difficult. In response to the challenges posed by operating in a data-intensive environment, the fields of...
Show moreWith advances in data storage and data transmission technologies, and given the increasing use of computers by both individuals and corporations, organizations are accumulating an ever-increasing amount of information in data warehouses and databases. The huge surge in data, however, has made the process of extracting useful, actionable, and interesting knowled_qe from the data extremely difficult. In response to the challenges posed by operating in a data-intensive environment, the fields of data mining and machine learning (DM/ML) have successfully provided solutions to help uncover knowledge buried within data. DM/ML techniques use automated (or semi-automated) procedures to process vast quantities of data in search of interesting patterns. DM/ML techniques do not create knowledge, instead the implicit assumption is that knowledge is present within the data, and these procedures are needed to uncover interesting, important, and previously unknown relationships. Therefore, the quality of the data is absolutely critical in ensuring successful analysis. Having high quality data, i.e., data which is (relatively) free from errors and suitable for use in data mining tasks, is a necessary precondition for extracting useful knowledge. In response to the important role played by data quality, this dissertation investigates data quality and its impact on DM/ML. First, we propose several innovative procedures for coping with low quality data. Another aspect of data quality, the occurrence of missing values, is also explored. Finally, a detailed experimental evaluation on learning from noisy and imbalanced datasets is provided, supplying valuable insight into how class noise in skewed datasets affects learning algorithms.
Show less - Date Issued
- 2007
- PURL
- http://purl.flvc.org/fau/fd/FA00000858
- Subject Headings
- Data mining--Quality control, Machine learning, Electronic data processing--Quality control
- Format
- Document (PDF)
- Title
- Detection of change-prone telecommunications software modules.
- Creator
- Weir, Ronald Eugene., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Accurately classifying the quality of software is a major problem in any software development project. Software engineers develop models that provide early estimates of quality metrics which allow them to take actions against emerging quality problems. The use of a neural network as a tool to classify programs as a low, medium, or high risk for errors or change is explored using multiple software metrics as input. It is demonstrated that a neural network, trained using the back-propagation...
Show moreAccurately classifying the quality of software is a major problem in any software development project. Software engineers develop models that provide early estimates of quality metrics which allow them to take actions against emerging quality problems. The use of a neural network as a tool to classify programs as a low, medium, or high risk for errors or change is explored using multiple software metrics as input. It is demonstrated that a neural network, trained using the back-propagation supervised learning strategy, produced the desired mapping between the static software metrics and the software quality classes. The neural network classification methodology is compared to the discriminant analysis classification methodology in this experiment. The comparison is based on two and three class predictive models developed using variables resulting from principal component analysis of software metrics.
Show less - Date Issued
- 1995
- PURL
- http://purl.flvc.org/fcla/dt/15183
- Subject Headings
- Computer software--Evaluation, Software engineering, Neural networks (Computer science)
- Format
- Document (PDF)
- Title
- Modeling fault-prone modules of subsystems.
- Creator
- Thaker, Vishal Kirit., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In software engineering software quality has become a topic of major concern. It has also been recognized that the role of maintenance organization is to understand and estimate the cost of maintenance releases of software systems. Planning the next release so as to maximize the increase in functionality and the improvement in quality are essential to successful maintenance management. With the growing collection of software in organizations this cost is becoming substantial. In this research...
Show moreIn software engineering software quality has become a topic of major concern. It has also been recognized that the role of maintenance organization is to understand and estimate the cost of maintenance releases of software systems. Planning the next release so as to maximize the increase in functionality and the improvement in quality are essential to successful maintenance management. With the growing collection of software in organizations this cost is becoming substantial. In this research we have compared two software quality models. We tried to see whether a model built on entire system which predicts subsystem and a model built on subsystem which predicts the same subsystem has similar, better or worst classification results. We used Classification And Regression Tree algorithm (CART) to build classification models. A case study is based on a very large telecommunication system.
Show less - Date Issued
- 2000
- PURL
- http://purl.flvc.org/fcla/dt/12700
- Subject Headings
- Computer software--Quality control, Software engineering
- Format
- Document (PDF)