Current Search: Document (PDF) (x) » FAU (x) » info:fedora/fau:smc (x) » Khoshgoftaar, Taghi M. (x)
View All Items
Pages
- Title
- A COMPARATIVE STUDY OF STRUCTURED VERSUS UNSTRUCTURED TEXT DATA.
- Creator
- Cardenas, Erika, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
In today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of...
Show moreIn today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of structured versus unstructured text data in two different applications. The first application is in the field of real estate. We compare the performance of tabular real-estate data and unstructured text descriptions of homes to predict the house price. The second application is in translating Electronic Health Records (EHR) tabular data to text data for survival classification of COVID-19 patients. Lastly, we present a range of strategies and perspectives for future research.
Show less - Date Issued
- 2023
- PURL
- http://purl.flvc.org/fau/fd/FA00014220
- Subject Headings
- Natural language processing (Computer science), Text data mining
- Format
- Document (PDF)
- Title
- A Comparison of Model Checking Tools for Service Oriented Architectures.
- Creator
- Venkat, Raghava, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Recently most of the research pertaining to Service-Oriented Architecture (SOA) is based on web services and how secure they are in terms of efficiency and effectiveness. This requires validation, verification, and evaluation of web services. Verification and validation should be collaborative when web services from different vendors are integrated together to carry out a coherent task. For this purpose, novel model checking technologies have been devised and applied to web services. "Model...
Show moreRecently most of the research pertaining to Service-Oriented Architecture (SOA) is based on web services and how secure they are in terms of efficiency and effectiveness. This requires validation, verification, and evaluation of web services. Verification and validation should be collaborative when web services from different vendors are integrated together to carry out a coherent task. For this purpose, novel model checking technologies have been devised and applied to web services. "Model Checking" is a promising technique for verification and validation of software systems. WS-BPEL (Business Process Execution Language for Web Services) is an emerging standard language to describe web service composition behavior. The advanced features of BPEL such as concurrency and hierarchy make it challenging to verify BPEL models. Based on all such factors my thesis surveys a few important technologies (tools) for model checking and comparing each of them based on their "functional" and "non-functional" properties. The comparison is based on three case studies (first being the small case, second medium and the third one a large case) where we construct synthetic web service compositions for each case (as there are not many publicly available compositions [1]). The first case study is "Enhanced LoanApproval Process" and is considered a small case. The second is "Enhanced Purchase Order Process" which is of medium size and the third, and largest is based on a scientific workflow pattern, called the "Service Oriented Architecture Implementing BOINC Workflow" based on BOINC (Berkeley Open Infrastructure Network Computing) architecture.
Show less - Date Issued
- 2007
- PURL
- http://purl.flvc.org/fau/fd/FA00012565
- Subject Headings
- Computer network architectures, Expert systems (Computer science), Software engineering, Web servers--Management
- Format
- Document (PDF)
- Title
- A REVIEW AND ANALYSIS OF BOT-IOT SECURITY DATA FOR MACHINE LEARNING.
- Creator
- Peterson, Jared M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Machine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The...
Show moreMachine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The full dataset contains about 73 million instances (big data), 3 dependent features, and 43 independent features. The purpose of this thesis is to provide researchers with a foundational understanding of Bot-IoT, its development, its features, its composition, and its pitfalls. It will also summarize many of the published works that utilize Bot-IoT and will propose new areas of research based on the issues identified in the current research and in the dataset.
Show less - Date Issued
- 2021
- PURL
- http://purl.flvc.org/fau/fd/FA00013838
- Subject Headings
- Machine learning, Cyber security, Big data
- Format
- Document (PDF)
- Title
- A review of the stability of feature selection techniques for bioinformatics data.
- Creator
- Awada, Wael, Khoshgoftaar, Taghi M., Dittman, David, Wald, Randall, Napolitano, Amri E., Graduate College
- Date Issued
- 2013-04-12
- PURL
- http://purl.flvc.org/fcla/dt/3361293
- Subject Headings
- Bioinformatics, DNA microarrays, Data mining
- Format
- Document (PDF)
- Title
- ADDRESSING HIGHLY IMBALANCED BIG DATA CHALLENGES FOR MEDICARE FRAUD DETECTION.
- Creator
- Johnson, Justin M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Access to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the...
Show moreAccess to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the most severe fraudulent cases, but with millions of healthcare providers and more than 60 million active Medicare beneficiaries, manual fraud detection efforts are not able to make widespread, meaningful impact. Through the proliferation of electronic health records and continuous breakthroughs in data mining and machine learning, there is a great opportunity to develop and leverage advanced machine learning systems for automating healthcare fraud detection. This dissertation identifies key challenges associated with predictive modeling for large-scale Medicare fraud detection and presents innovative solutions to address these challenges in order to provide state-of-the-art results on multiple real-world Medicare fraud data sets. Our methodology for curating nine distinct Medicare fraud classification data sets is presented with comprehensive details describing data accumulation, data pre-processing, data aggregation techniques, data enrichment strategies, and improved fraud labeling. Data-level and algorithm-level methods for treating severe class imbalance, including a flexible output thresholding method and a cost-sensitive framework, are evaluated using deep neural network and ensemble learners. Novel encoding techniques and representation learning methods for high-dimensional categorical features are proposed to create expressive representations of provider attributes and billing procedure codes.
Show less - Date Issued
- 2022
- PURL
- http://purl.flvc.org/fau/fd/FA00014057
- Subject Headings
- Medicare fraud, Big data, Machine learning
- Format
- Document (PDF)
- Title
- Alleviating class imbalance using data sampling: Examining the effects on classification algorithms.
- Creator
- Napolitano, Amri E., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Imbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling...
Show moreImbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling techniques using five different classifiers and four different datasets. The work addresses which sampling techniques produce the best performance in the presence of class unbalance, which classifiers are most robust to the problem, as well as which sampling techniques perform better or worse with each classifier. Extensive statistical analysis of these results is provided, in addition to an examination of the qualitative effects of the sampling techniques on the types of predictions made by the C4.5 classifier.
Show less - Date Issued
- 2006
- PURL
- http://purl.flvc.org/fcla/dt/13413
- Subject Headings
- Combinatorial group theory, Data mining, Decision trees, Machine learning
- Format
- Document (PDF)
- Title
- An Empirical Study of Ordinal and Non-ordinal Classification Algorithms for Intrusion Detection in WLANs.
- Creator
- Gopalakrishnan, Leelakrishnan, Khoshgoftaar, Taghi M., Florida Atlantic University
- Abstract/Description
-
Ordinal classification refers to an important category of real world problems, in which the attributes of the instances to be classified and the classes are linearly ordered. Many applications of machine learning frequently involve situations exhibiting an order among the different categories represented by the class attribute. In ordinal classification the class value is converted into a numeric quantity and regression algorithms are applied to the transformed data. The data is later...
Show moreOrdinal classification refers to an important category of real world problems, in which the attributes of the instances to be classified and the classes are linearly ordered. Many applications of machine learning frequently involve situations exhibiting an order among the different categories represented by the class attribute. In ordinal classification the class value is converted into a numeric quantity and regression algorithms are applied to the transformed data. The data is later translated back into a discrete class value in a postprocessing step. This thesis is devoted to an empirical study of ordinal and non-ordinal classification algorithms for intrusion detection in WLANs. We used ordinal classification in conjunction with nine classifiers for the experiments in this thesis. All classifiers are parts of the WEKA machinelearning workbench. The results indicate that most of the classifiers give similar or better results with ordinal classification compared to non-ordinal classification.
Show less - Date Issued
- 2006
- PURL
- http://purl.flvc.org/fau/fd/FA00012521
- Subject Headings
- Wireless LANs--Security measures, Computer networks--Security measures, Data structures (Computer science), Multivariate analysis
- Format
- Document (PDF)
- Title
- An Empirical Study of Performance Metrics for Classifier Evaluation in Machine Learning.
- Creator
- Bruhns, Stefan, Khoshgoftaar, Taghi M., Florida Atlantic University
- Abstract/Description
-
A variety of classifiers for solving classification problems is available from the domain of machine learning. Commonly used classifiers include support vector machines, decision trees and neural networks. These classifiers can be configured by modifying internal parameters. The large number of available classifiers and the different configuration possibilities result in a large number of combinatiorrs of classifier and configuration settings, leaving the practitioner with the problem of...
Show moreA variety of classifiers for solving classification problems is available from the domain of machine learning. Commonly used classifiers include support vector machines, decision trees and neural networks. These classifiers can be configured by modifying internal parameters. The large number of available classifiers and the different configuration possibilities result in a large number of combinatiorrs of classifier and configuration settings, leaving the practitioner with the problem of evaluating the performance of different classifiers. This problem can be solved by using performance metrics. However, the large number of available metrics causes difficulty in deciding which metrics to use and when comparing classifiers on the basis of multiple metrics. This paper uses the statistical method of factor analysis in order to investigate the relationships between several performance metrics and introduces the concept of relative performance which has the potential to case the process of comparing several classifiers. The relative performance metric is also used to evaluate different support vector machine classifiers and to determine if the default settings in the Weka data mining tool are reasonable.
Show less - Date Issued
- 2008
- PURL
- http://purl.flvc.org/fau/fd/FA00012508
- Subject Headings
- Machine learning, Computer algorithms, Pattern recognition systems, Data structures (Computer science), Kernel functions, Pattern perception--Data processing
- Format
- Document (PDF)
- Title
- An Empirical Study of Random Forests for Mining Imbalanced Data.
- Creator
- Golawala, Moiz M., Khoshgoftaar, Taghi M., Florida Atlantic University
- Abstract/Description
-
Skewed or imbalanced data presents a significant problem for many standard learners which focus on optimizing the overall classification accuracy. When the class distribution is skewed, priority is given to classifying examples from the majority class, at the expense of the often more important minority class. The random forest (RF) classification algorithm, which is a relatively new learner with appealing theoretical properties, has received almost no attention in the context of skewed...
Show moreSkewed or imbalanced data presents a significant problem for many standard learners which focus on optimizing the overall classification accuracy. When the class distribution is skewed, priority is given to classifying examples from the majority class, at the expense of the often more important minority class. The random forest (RF) classification algorithm, which is a relatively new learner with appealing theoretical properties, has received almost no attention in the context of skewed datasets. This work presents a comprehensive suite of experimentation evaluating the effectiveness of random forests for learning from imbalanced data. Reasonable parameter settings (for the Weka implementation) for ensemble size and number of random features selected are determined through experimentation oil 10 datasets. Further, the application of seven different data sampling techniques that are common methods for handling imbalanced data, in conjunction with RF, is also assessed. Finally, RF is benchmarked against 10 other commonly-used machine learning algorithms, and is shown to provide very strong performance. A total of 35 imbalanced datasets are used, and over one million classifiers are constructed in this work.
Show less - Date Issued
- 2007
- PURL
- http://purl.flvc.org/fau/fd/FA00012520
- Subject Headings
- Data mining--Case studies, Machine learning--Case studies, Data structure (Computer science), Trees (Graph theory)--Case studies
- Format
- Document (PDF)
- Title
- An Evaluation of Deep Learning with Class Imbalanced Big Data.
- Creator
- Johnson, Justin Matthew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with...
Show moreEffective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with class imbalanced big data. Following an in-depth survey of deep learning methods for addressing class imbalance, we evaluate various methods for addressing imbalance on the task of detecting Medicare fraud, a big data problem characterized by extreme class imbalance. Case studies herein demonstrate the impact of class imbalance on neural networks, evaluate the efficacy of data-level and algorithm-level methods, and achieve state-of-the-art results on the given Medicare data set. Results indicate that combining under-sampling and over-sampling maximizes both performance and efficiency.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013221
- Subject Headings
- Deep learning, Big data, Medicare fraud--Prevention
- Format
- Document (PDF)
- Title
- An evaluation of machine learning algorithms for tweet sentiment analysis.
- Creator
- Prusa, Joseph D., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Sentiment analysis of tweets is an application of mining Twitter, and is growing in popularity as a means of determining public opinion. Machine learning algorithms are used to perform sentiment analysis; however, data quality issues such as high dimensionality, class imbalance or noise may negatively impact classifier performance. Machine learning techniques exist for targeting these problems, but have not been applied to this domain, or have not been studied in detail. In this thesis we...
Show moreSentiment analysis of tweets is an application of mining Twitter, and is growing in popularity as a means of determining public opinion. Machine learning algorithms are used to perform sentiment analysis; however, data quality issues such as high dimensionality, class imbalance or noise may negatively impact classifier performance. Machine learning techniques exist for targeting these problems, but have not been applied to this domain, or have not been studied in detail. In this thesis we discuss research that has been conducted on tweet sentiment classification, its accompanying data concerns, and methods of addressing these concerns. We test the impact of feature selection, data sampling and ensemble techniques in an effort to improve classifier performance. We also evaluate the combination of feature selection and ensemble techniques and examine the effects of high dimensionality when combining multiple types of features. Additionally, we provide strategies and insights for potential avenues of future work.
Show less - Date Issued
- 2015
- PURL
- http://purl.flvc.org/fau/fd/FA00004460, http://purl.flvc.org/fau/fd/FA00004460
- Subject Headings
- Social media., Natural language processing (Computer science), Machine learning., Algorithms., Fuzzy expert systems., Artificial intelligence.
- Format
- Document (PDF)
- Title
- An evaluation of Unsupervised Machine Learning Algorithms for Detecting Fraud and Abuse in the U.S. Medicare Insurance Program.
- Creator
- Da Rosa, Raquel C., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which...
Show moreThe population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which indicates reasonable fraud detection results. We employ two unsupervised machine learning algorithms, Isolation Forest and Unsupervised Random Forest, which have not been previously used for the detection of fraud and abuse on Medicare data. Additionally, we implement three other machine learning methods previously applied on Medicare data which include: Local Outlier Factor, Autoencoder, and k-Nearest Neighbor. For our dataset, we combine the 2012 to 2015 Medicare provider utilization and payment data and add fraud labels from the List of Excluded Individuals/Entities (LEIE) database. Results show that Local Outlier Factor is the best model to use for Medicare fraud detection.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013042
- Subject Headings
- Machine learning, Medicare fraud, Algorithms
- Format
- Document (PDF)
- Title
- An Exploration into Synthetic Data and Generative Aversarial Networks.
- Creator
- Shorten, Connor M., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
This Thesis surveys the landscape of Data Augmentation for image datasets. Completing this survey inspired further study into a method of generative modeling known as Generative Adversarial Networks (GANs). A survey on GANs was conducted to understood recent developments and the problems related to training them. Following this survey, four experiments were proposed to test the application of GANs for data augmentation and to contribute to the quality improvement in GAN-generated data....
Show moreThis Thesis surveys the landscape of Data Augmentation for image datasets. Completing this survey inspired further study into a method of generative modeling known as Generative Adversarial Networks (GANs). A survey on GANs was conducted to understood recent developments and the problems related to training them. Following this survey, four experiments were proposed to test the application of GANs for data augmentation and to contribute to the quality improvement in GAN-generated data. Experimental results demonstrate the effectiveness of GAN-generated data as a pre-training metric. The other experiments discuss important characteristics of GAN models such as the refining of prior information, transferring generative models from large datasets to small data, and automating the design of Deep Neural Networks within the context of the GAN framework. This Thesis will provide readers with a complete introduction to Data Augmentation and Generative Adversarial Networks, as well as insights into the future of these techniques.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013263
- Subject Headings
- Neural networks (Computer science), Computer vision, Images, Generative adversarial networks, Data sets
- Format
- Document (PDF)
- Title
- Analysis of machine learning algorithms on bioinformatics data of varying quality.
- Creator
- Shanab, Ahmad Abu, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
One of the main applications of machine learning in bioinformatics is the construction of classification models which can accurately classify new instances using information gained from previous instances. With the help of machine learning algorithms (such as supervised classification and gene selection) new meaningful knowledge can be extracted from bioinformatics datasets that can help in disease diagnosis and prognosis as well as in prescribing the right treatment for a disease. One...
Show moreOne of the main applications of machine learning in bioinformatics is the construction of classification models which can accurately classify new instances using information gained from previous instances. With the help of machine learning algorithms (such as supervised classification and gene selection) new meaningful knowledge can be extracted from bioinformatics datasets that can help in disease diagnosis and prognosis as well as in prescribing the right treatment for a disease. One particular challenge encountered when analyzing bioinformatics datasets is data noise, which refers to incorrect or missing values in datasets. Noise can be introduced as a result of experimental errors (e.g. faulty microarray chips, insufficient resolution, image corruption, and incorrect laboratory procedures), as well as other errors (errors during data processing, transfer, and/or mining). A special type of data noise called class noise, which occurs when an instance/example is mislabeled. Previous research showed that class noise has a detrimental impact on machine learning algorithms (e.g. worsened classification performance and unstable feature selection). In addition to data noise, gene expression datasets can suffer from the problems of high dimensionality (a very large feature space) and class imbalance (unequal distribution of instances between classes). As a result of these inherent problems, constructing accurate classification models becomes more challenging.
Show less - Date Issued
- 2015
- PURL
- http://purl.flvc.org./fau/fd/FA00004425, http://purl.flvc.org/fau/fd/FA00004425
- Subject Headings
- Artificial intelligence, Bioinformatics, Machine learning, System design, Theory of computation
- Format
- Document (PDF)
- Title
- Big Data Analytics and Engineering for Medicare Fraud Detection.
- Creator
- Herland, Matthew Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the...
Show moreThe United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the elderly population continues to rise, increasing the need for programs, such as Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected, draining resources and reducing the quality and accessibility of necessary healthcare services. In response, advanced data analytics have recently been explored to detect possible fraudulent activities. The Centers for Medicare and Medicaid Services (CMS) released several ‘Big Data’ Medicare claims datasets for different parts of their Medicare program to help facilitate this effort. In this dissertation, we employ three CMS Medicare Big Data datasets to evaluate the fraud detection performance available using advanced data analytics techniques, specifically machine learning. We use two distinct approaches, designated as anomaly detection and traditional fraud detection, where each have very distinct data processing and feature engineering. Anomaly detection experiments classify by provider specialty, determining whether outlier physicians within the same specialty signal fraudulent behavior. Traditional fraud detection refers to the experiments directly classifying physicians as fraudulent or non-fraudulent, leveraging machine learning algorithms to discriminate between classes. We present our novel data engineering approaches for both anomaly detection and traditional fraud detection including data processing, fraud mapping, and the creation of a combined dataset consisting of all three Medicare parts. We incorporate the List of Excluded Individuals and Entities database to identify real world fraudulent physicians for model evaluation. Regarding features, the final datasets for anomaly detection contain only claim counts for every procedure a physician submits while traditional fraud detection incorporates aggregated counts and payment information, specialty, and gender. Additionally, we compare cross-validation to the real world application of building a model on a training dataset and evaluating on a separate test dataset for severe class imbalance and rarity.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013215
- Subject Headings
- Big data, Medicare fraud, Data analytics, Machine learning
- Format
- Document (PDF)
- Title
- CBR-based software quality models and quality of data.
- Creator
- Xiao, Yudong., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The performance accuracy of software quality estimation models is influenced by several factors, including the following two important factors: performance of the prediction algorithm and the quality of data. This dissertation addresses these two factors, and consists of two components: (1) a proposed genetic algorithm (GA) based optimization of software quality models for accuracy enhancement, and (2) a proposed partitioning- and rule-based filter (PRBF) for noise detection toward...
Show moreThe performance accuracy of software quality estimation models is influenced by several factors, including the following two important factors: performance of the prediction algorithm and the quality of data. This dissertation addresses these two factors, and consists of two components: (1) a proposed genetic algorithm (GA) based optimization of software quality models for accuracy enhancement, and (2) a proposed partitioning- and rule-based filter (PRBF) for noise detection toward improvement of data quality. We construct a generalized framework of our embedded GA-optimizer, and instantiate the GA-optimizer for three optimization problems in software quality engineering: parameter optimization for case-based reasoning (CBR) models; module rank optimization for module-order modeling (MOM); and structural optimization for our multi-strategy classification modeling approach, denoted RB2CBL. Empirical case studies using software measurement data from real-world software systems were performed for the optimization problems. The GA-optimization approaches improved software quality prediction accuracy, highlighting the practical benefits of using GA for solving optimization problems in software engineering. The proposed noise detection approach, PRBF, was empirically evaluated using data categorized into two classes. Empirical studies on artificially corrupted datasets and datasets with known (natural) noise demonstrated that PRBF can effectively detect both artificial and natural noise. The proposed filter is a stable and robust technique, and always provided optimal or near-optimal noise detection results. In addition, it is applicable on datasets with nominal and numerical attributes, as well as those with missing values. The PRBF technique supports two methods of noise detection: class noise detection and cost-sensitive noise detection. The former is an easy-to-use method and does not need parameter settings, while the latter is suited for applications where each class has a specific misclassification cost. PRBF can also be used iteratively to investigate the two general types of data noise: attribute and class noise.
Show less - Date Issued
- 2005
- PURL
- http://purl.flvc.org/fcla/dt/12141
- Subject Headings
- Computer software--Quality control, Genetic programming (Computer science), Software engineering, Case-based reasoning, Combinatorial optimization, Computer network architecture
- Format
- Document (PDF)
- Title
- Choosing software reliability models.
- Creator
- Woodcock, Timothy G., Florida Atlantic University, Khoshgoftaar, Taghi M.
- Abstract/Description
-
One of the important problems which software engineers face is how to determine which software reliability model should be used for a particular system. Some recent attempts to compare different models used complementary graphical and analytical techniques. These techniques require an excessive amount of time for plotting the data and running the analyses, and they are still rather subjective as to which model is best. So another technique needs to be found that is simpler and yet yields a...
Show moreOne of the important problems which software engineers face is how to determine which software reliability model should be used for a particular system. Some recent attempts to compare different models used complementary graphical and analytical techniques. These techniques require an excessive amount of time for plotting the data and running the analyses, and they are still rather subjective as to which model is best. So another technique needs to be found that is simpler and yet yields a less subjective measure of goodness of fit. The Akaike Information Criterion (AIC) is proposed as a new approach for selecting the best model. The performance of AIC is measured by Monte-Carlo simulation and by comparison to published data sets. The AIC chooses the correct model 95% of the time.
Show less - Date Issued
- 1989
- PURL
- http://purl.flvc.org/fcla/dt/14561
- Subject Headings
- Computer software--Testing, Computer software--Reliability
- Format
- Document (PDF)
- Title
- Classification of software quality using Bayesian belief networks.
- Creator
- Dong, Yuhong., Florida Atlantic University, Khoshgoftaar, Taghi M.
- Abstract/Description
-
In today's competitive environment for software products, quality has become an increasingly important asset to software development organizations. Software quality models are tools for focusing efforts to find faults early in the development. Delaying corrections can lead to higher costs. In this research, the classification Bayesian Networks modelling technique was used to predict the software quality by classifying program modules either as fault-prone or not fault-prone. A general...
Show moreIn today's competitive environment for software products, quality has become an increasingly important asset to software development organizations. Software quality models are tools for focusing efforts to find faults early in the development. Delaying corrections can lead to higher costs. In this research, the classification Bayesian Networks modelling technique was used to predict the software quality by classifying program modules either as fault-prone or not fault-prone. A general classification rule was applied to yield classification Bayesian Belief Network models. Six classification Bayesian Belief Network models were developed based on quality metrics data records of two very large window application systems. The fit data set was used to build the model and the test data set was used to evaluate the model. The first two models used median based data cluster technique, the second two models used median as critical value to cluster metrics using Generalized Boolean Discriminant Function and the third two models used Kolniogorov-Smirnov test to select the critical value to cluster metrics using Generalized Boolean Discriminant Function; All six models used the product metrics (FAULT or CDCHURN) as predictors.
Show less - Date Issued
- 2002
- PURL
- http://purl.flvc.org/fcla/dt/12918
- Subject Headings
- Computer software--Quality control, Software measurement, Bayesian statistical decision theory
- Format
- Document (PDF)
- Title
- Classification of software quality using tree modeling with the S-Plus algorithm.
- Creator
- Deng, Jianyu., Florida Atlantic University, Khoshgoftaar, Taghi M.
- Abstract/Description
-
In today's competitive environment for software products, quality has become an increasingly important asset to software development organizations. Software quality models are tools for focusing efforts to find faults early in the development. Delaying corrections can lead to higher costs. In this research, the classification tree modeling technique was used to predict the software quality by classifying program modules either as fault-prone or not fault-prone. The S-Plus regression tree...
Show moreIn today's competitive environment for software products, quality has become an increasingly important asset to software development organizations. Software quality models are tools for focusing efforts to find faults early in the development. Delaying corrections can lead to higher costs. In this research, the classification tree modeling technique was used to predict the software quality by classifying program modules either as fault-prone or not fault-prone. The S-Plus regression tree algorithm and a general classification rule were applied to yield classification tree models. Two classification tree models were developed based on four consecutive releases of a very large legacy telecommunications system. The first release was used as the training data set and the subsequent three releases were used as evaluation data sets. The first model used twenty-four product metrics and four execution metrics as candidate predictors. The second model added fourteen process metrics as candidate predictors.
Show less - Date Issued
- 1999
- PURL
- http://purl.flvc.org/fcla/dt/15707
- Subject Headings
- Computer software--Quality control, Software measurement, Computer software--Evaluation
- Format
- Document (PDF)
- Title
- Classification of software quality using tree modeling with the SPRINT/SLIQ algorithm.
- Creator
- Mao, Wenlei., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Providing high quality software products is the common goal of all software engineers. Finding faults early can produce large savings over the software life cycle. Therefore, software quality has become the main subject in our research field. This thesis presents a series of studies on a very large legacy telecommunication system. The system has significantly more than ten million lines of code written in a high-level language similar to Pascal. Software quality models were developed to...
Show moreProviding high quality software products is the common goal of all software engineers. Finding faults early can produce large savings over the software life cycle. Therefore, software quality has become the main subject in our research field. This thesis presents a series of studies on a very large legacy telecommunication system. The system has significantly more than ten million lines of code written in a high-level language similar to Pascal. Software quality models were developed to predict the class of each module either as fault-prone or as not fault-prone. We used the SPRINT/SLIQ algorithm to build the classification tree models. We found out that SPRINT/ SLIQ as an improved CART algorithm can give us tree models with more accuracy, more balance, and less overfitting. We also found that software process metrics can significantly improve the predictive accuracy of software quality models.
Show less - Date Issued
- 2000
- PURL
- http://purl.flvc.org/fcla/dt/15767
- Subject Headings
- Computer software--Quality control, Software engineering, Software measurement
- Format
- Document (PDF)