Current Search: info:fedora/islandora:personCModel (x) » College of Engineering and Computer Science (x) » Khoshgoftaar, Taghi M. (x)
View All Items
Pages
- Title
- Information theory and software measurement.
- Creator
- Allen, Edward B., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Development of reliable, high quality, software requires study and understanding at each step of the development process. A basic assumption in the field of software measurement is that metrics of internal software attributes somehow relate to the intrinsic difficulty in understanding a program. Measuring the information content of a program attempts to indirectly quantify the comprehension task. Information theory based software metrics are attractive because they quantify the amount of...
Show moreDevelopment of reliable, high quality, software requires study and understanding at each step of the development process. A basic assumption in the field of software measurement is that metrics of internal software attributes somehow relate to the intrinsic difficulty in understanding a program. Measuring the information content of a program attempts to indirectly quantify the comprehension task. Information theory based software metrics are attractive because they quantify the amount of information in a well defined framework. However, most information theory based metrics have been proposed with little reference to measurement theory fundamentals, and empirical validation of predictive quality models has been lacking. This dissertation proves that representative information theory based software metrics can be "meaningful" components of software quality models in the context of measurement theory. To this end, members of a major class of metrics are shown to be regular representations of Minimum Description Length or Variety of software attributes, and are interval scale. An empirical validation case study is presented that predicted faults in modules based on Operator Information. This metric is closely related to Harrison's Average Information Content Classification, which is the entropy of the operators. New general methods for calculating synthetic complexity at the system level and module level are presented, quantifying the joint information of an arbitrary set of primitive software measures. Since all kinds of information are not equally relevant to software quality factors, components of synthetic module complexity are also defined. Empirical case studies illustrate the potential usefulness of the proposed synthetic metrics. A metrics data base is often the key to a successful ongoing software metrics program. The contribution of any proposed metric is defined in terms of measured variation using information theory, irrespective of the metric's usefulness in quality models. This is of interest when full validation is not practical. Case studies illustrate the method.
Show less - Date Issued
- 1995
- PURL
- http://purl.flvc.org/fcla/dt/12412
- Subject Headings
- Software engineering, Computer software--Quality control, Information theory
- Format
- Document (PDF)
- Title
- Predicting failure of remote battery backup systems.
- Creator
- Aranguren, Pachano Liz Jeannette, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Uninterruptable Power Supply (UPS) systems have become essential to modern industries that require continuous power supply to manage critical operations. Since a failure of a single battery will affect the entire backup system, UPS systems providers must replace any battery before it runs dead. In this regard, automated monitoring tools are required to determine when a battery needs replacement. Nowadays, a primitive method for monitoring the battery backup system is being used for this task....
Show moreUninterruptable Power Supply (UPS) systems have become essential to modern industries that require continuous power supply to manage critical operations. Since a failure of a single battery will affect the entire backup system, UPS systems providers must replace any battery before it runs dead. In this regard, automated monitoring tools are required to determine when a battery needs replacement. Nowadays, a primitive method for monitoring the battery backup system is being used for this task. This thesis presents a classification model that uses data mining cleansing and processing techniques to remove useless information from the data obtained from the sensors installed in the batteries in order to improve the quality of the data and determine at a given moment in time if a battery should be replaced or not. This prediction model will help UPS systems providers increase the efficiency of battery monitoring procedures.
Show less - Date Issued
- 2013
- PURL
- http://purl.flvc.org/fau/fd/FA0004002
- Subject Headings
- Electric power systems -- Equipment and supplies, Energy storing -- Testing, Lead acid batteries, Power electronics, Protective relays
- Format
- Document (PDF)
- Title
- Machine Learning Algorithms with Big Medicare Fraud Data.
- Creator
- Bauder, Richard Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Healthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent...
Show moreHealthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent than the normative observations. This difference between the normal observations and those observations of interest can create highly imbalanced datasets. The problem of class imbalance, to include the classification of rare cases indicating extreme class imbalance, is an important and well-studied area in machine learning. The effects of class imbalance with big data in the real-world Medicare fraud application domain, however, is limited. In particular, the impact of detecting fraud in Medicare claims is critical in lessening the financial and personal impacts of these transgressions. Fortunately, the healthcare domain is one such area where the successful detection of fraud can garner meaningful positive results. The application of machine learning techniques, plus methods to mitigate the adverse effects of class imbalance and rarity, can be used to detect fraud and lessen the impacts for all Medicare beneficiaries. This dissertation presents the application of machine learning approaches to detect Medicare provider claims fraud in the United States. We discuss novel techniques to process three big Medicare datasets and create a new, combined dataset, which includes mapping fraud labels associated with known excluded providers. We investigate the ability of machine learning techniques, unsupervised and supervised, to detect Medicare claims fraud and leverage data sampling methods to lessen the impact of class imbalance and increase fraud detection performance. Additionally, we extend the study of class imbalance to assess the impacts of rare cases in big data for Medicare fraud detection.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013108
- Subject Headings
- Medicare fraud, Big data, Machine learning, Algorithms
- Format
- Document (PDF)
- Title
- Wavelet de-noising applied to vibrational envelope analysis methods.
- Creator
- Bertot, Edward Max, Khoshgoftaar, Taghi M., Beaujean, Pierre-Philippe, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In the field of machine prognostics, vibration analysis is a proven method for detecting and diagnosing bearing faults in rotating machines. One popular method for interpreting vibration signals is envelope demodulation, which allows a technician to clearly identify an impulsive fault source and its severity. However incipient faults -faults in early stages - are masked by in-band noise, which can make the associated impulses difficult to detect and interpret. In this thesis, Wavelet De...
Show moreIn the field of machine prognostics, vibration analysis is a proven method for detecting and diagnosing bearing faults in rotating machines. One popular method for interpreting vibration signals is envelope demodulation, which allows a technician to clearly identify an impulsive fault source and its severity. However incipient faults -faults in early stages - are masked by in-band noise, which can make the associated impulses difficult to detect and interpret. In this thesis, Wavelet De-Noising (WDN) is implemented after envelope-demodulation to improve accuracy of bearing fault diagnostics. This contrasts the typical approach of de-noising as a preprocessing step. When manually measuring time-domain impulse amplitudes, the algorithm shows varying improvements in Signal-to-Noise Ratio (SNR) relative to background vibrational noise. A frequency-domain measure of SNR agrees with this result.
Show less - Date Issued
- 2014
- PURL
- http://purl.flvc.org/fau/fd/FA00004080, http://purl.flvc.org/fau/fd/FA00004080
- Subject Headings
- Fluid dynamics, Signal processing, Structural dynamics, Wavelet (Mathematics)
- Format
- Document (PDF)
- Title
- Rough Set-Based Software Quality Models and Quality of Data.
- Creator
- Bullard, Lofton A., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In this dissertation we address two significant issues of concern. These are software quality modeling and data quality assessment. Software quality can be measured by software reliability. Reliability is often measured in terms of the time between system failures. A failure is caused by a fault which is a defect in the executable software product. The time between system failures depends both on the presence and the usage pattern of the software. Finding faulty components in the development...
Show moreIn this dissertation we address two significant issues of concern. These are software quality modeling and data quality assessment. Software quality can be measured by software reliability. Reliability is often measured in terms of the time between system failures. A failure is caused by a fault which is a defect in the executable software product. The time between system failures depends both on the presence and the usage pattern of the software. Finding faulty components in the development cycle of a software system can lead to a more reliable final system and will reduce development and maintenance costs. The issue of software quality is investigated by proposing a new approach, rule-based classification model (RBCM) that uses rough set theory to generate decision rules to predict software quality. The new model minimizes over-fitting by balancing the Type I and Type II niisclassiflcation error rates. We also propose a model selection technique for rule-based models called rulebased model selection (RBMS). The proposed rule-based model selection technique utilizes the complete and partial matching rule sets of candidate RBCMs to determine the model with the least amount of over-fitting. In the experiments that were performed, the RBCMs were effective at identifying faulty software modules, and the RBMS technique was able to identify RBCMs that minimized over-fitting. Good data quality is a critical component for building effective software quality models. We address the significance of the quality of data on the classification performance of learners by conducting a comprehensive comparative study. Several trends were observed in the experiments. Class and attribute had the greatest impact on the performance of learners when it occurred simultaneously in the data. Class noise had a significant impact on the performance of learners, while attribute noise had no impact when it occurred in less than 40% of the most significant independent attributes. Random Forest (RF100), a group of 100 decision trees, was the most, accurate and robust learner in all the experiments with noisy data.
Show less - Date Issued
- 2008
- PURL
- http://purl.flvc.org/fau/fd/FA00012567
- Subject Headings
- Computer software--Quality control, Computer software--Reliability, Software engineering, Computer arithmetic
- Format
- Document (PDF)
- Title
- DATA COLLECTION FRAMEWORK AND MACHINE LEARNING ALGORITHMS FOR THE ANALYSIS OF CYBER SECURITY ATTACKS.
- Creator
- Calvert, Chad, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors...
Show moreThe integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors of persistent and emerging threats. When collecting modern day network traffic for intrusion detection, substantial amounts of traffic can be collected, much of which consists of relatively few attack instances as compared to normal traffic. This skewed distribution between normal and attack data can lead to high levels of class imbalance. Machine learning techniques can be used to aid in attack detection, but large levels of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013289
- Subject Headings
- Machine learning, Algorithms, Anomaly detection (Computer security), Intrusion detection systems (Computer security), Big data
- Format
- Document (PDF)
- Title
- Microarray deconvolution: a web application.
- Creator
- Canny, Stephanie, Petrie, Howard, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Microarray gene expression profiling is used in biology for a variety of purposes including identifying disease biomarkers or understanding cellular processes. Biological samples composed of multiple cell or tissue types pose a problem because different compositions of cell-types in samples affect the gene expression profile and also the expression profile of individual components of the sample may be of interest. Physical methods to separate mixed samples are time-consuming and expensive....
Show moreMicroarray gene expression profiling is used in biology for a variety of purposes including identifying disease biomarkers or understanding cellular processes. Biological samples composed of multiple cell or tissue types pose a problem because different compositions of cell-types in samples affect the gene expression profile and also the expression profile of individual components of the sample may be of interest. Physical methods to separate mixed samples are time-consuming and expensive. Consequently, several computational methods have been developed to deconvolute heterogeneous samples into individual components. Different software packages and applications are available to perform these calculations. Microarray Deconvolution is a web application that provides a simple-to-use interface that fills some gaps left by other packages in performing heterogeneous sample microarray deconvolution including microarray raw data processing and normalization, cell-type proportion estimation and simple linear deconvolution.
Show less - Date Issued
- 2013
- PURL
- http://purl.flvc.org/fau/fd/FA00004243
- Format
- Document (PDF)
- Title
- A COMPARATIVE STUDY OF STRUCTURED VERSUS UNSTRUCTURED TEXT DATA.
- Creator
- Cardenas, Erika, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
In today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of...
Show moreIn today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of structured versus unstructured text data in two different applications. The first application is in the field of real estate. We compare the performance of tabular real-estate data and unstructured text descriptions of homes to predict the house price. The second application is in translating Electronic Health Records (EHR) tabular data to text data for survival classification of COVID-19 patients. Lastly, we present a range of strategies and perspectives for future research.
Show less - Date Issued
- 2023
- PURL
- http://purl.flvc.org/fau/fd/FA00014220
- Subject Headings
- Natural language processing (Computer science), Text data mining
- Format
- Document (PDF)
- Title
- DEEP MAXOUT NETWORKS FOR CLASSIFICATION PROBLEMS ACROSS MULTIPLE DOMAINS.
- Creator
- Castaneda, Gabriel, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Machine learning techniques such as deep neural networks have become an indispensable tool for a wide range of applications such as image classification, speech recognition, and sentiment analysis in text. An activation function is a mathematical equation that determines the output of each neuron in the neural network. In deep learning architectures the choice of activation functions is very important to the network’s performance. Activation functions determine the output of the model, its...
Show moreMachine learning techniques such as deep neural networks have become an indispensable tool for a wide range of applications such as image classification, speech recognition, and sentiment analysis in text. An activation function is a mathematical equation that determines the output of each neuron in the neural network. In deep learning architectures the choice of activation functions is very important to the network’s performance. Activation functions determine the output of the model, its computational efficiency, and its ability to train and converge after multiple iterations of training epochs. The selection of an activation function is critical to building and training an effective and efficient neural network. In real-world applications of deep neural networks, the activation function is a hyperparameter. We have observed a lack of consensus on how to select a good activation function for a deep neural network, and that a specific function may not be suitable for all domain-specific applications.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013362
- Subject Headings
- Classification, Machine learning--Technique, Neural networks (Computer science)
- Format
- Document (PDF)
- Title
- An evaluation of Unsupervised Machine Learning Algorithms for Detecting Fraud and Abuse in the U.S. Medicare Insurance Program.
- Creator
- Da Rosa, Raquel C., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which...
Show moreThe population of people ages 65 and older has increased since the 1960s and current estimates indicate it will double by 2060. Medicare is a federal health insurance program for people 65 or older in the United States. Medicare claims fraud and abuse is an ongoing issue that wastes a large amount of money every year resulting in higher health care costs and taxes for everyone. In this study, an empirical evaluation of several unsupervised machine learning approaches is performed which indicates reasonable fraud detection results. We employ two unsupervised machine learning algorithms, Isolation Forest and Unsupervised Random Forest, which have not been previously used for the detection of fraud and abuse on Medicare data. Additionally, we implement three other machine learning methods previously applied on Medicare data which include: Local Outlier Factor, Autoencoder, and k-Nearest Neighbor. For our dataset, we combine the 2012 to 2015 Medicare provider utilization and payment data and add fraud labels from the List of Excluded Individuals/Entities (LEIE) database. Results show that Local Outlier Factor is the best model to use for Medicare fraud detection.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013042
- Subject Headings
- Machine learning, Medicare fraud, Algorithms
- Format
- Document (PDF)
- Title
- Machine learning techniques for alleviating inherent difficulties in bioinformatics data.
- Creator
- Dittman, David, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In response to the massive amounts of data that make up a large number of bioinformatics datasets, it has become increasingly necessary for researchers to use computers to aid them in their endeavors. With difficulties such as high dimensionality, class imbalance, noisy data, and difficult to learn class boundaries, being present within the data, bioinformatics datasets are a challenge to work with. One potential source of assistance is the domain of data mining and machine learning, a field...
Show moreIn response to the massive amounts of data that make up a large number of bioinformatics datasets, it has become increasingly necessary for researchers to use computers to aid them in their endeavors. With difficulties such as high dimensionality, class imbalance, noisy data, and difficult to learn class boundaries, being present within the data, bioinformatics datasets are a challenge to work with. One potential source of assistance is the domain of data mining and machine learning, a field which focuses on working with these large amounts of data and develops techniques to discover new trends and patterns that are hidden within the data and to increases the capability of researchers and practitioners to work with this data. Within this domain there are techniques designed to eliminate irrelevant or redundant features, balance the membership of the classes, handle errors found in the data, and build predictive models for future data.
Show less - Date Issued
- 2015
- PURL
- http://purl.flvc.org/fau/fd/FA00004362, http://purl.flvc.org/fau/fd/FA00004362
- Subject Headings
- Artificial intelligence, Bioinformatics, Machine learning, System design, Theory of computation
- Format
- Document (PDF)
- Title
- Ensemble Learning Algorithms for the Analysis of Bioinformatics Data.
- Creator
- Fazelpour, Alireza, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Developments in advanced technologies, such as DNA microarrays, have generated tremendous amounts of data available to researchers in the field of bioinformatics. These state-of-the-art technologies present not only unprecedented opportunities to study biological phenomena of interest, but significant challenges in terms of processing the data. Furthermore, these datasets inherently exhibit a number of challenging characteristics, such as class imbalance, high dimensionality, small dataset...
Show moreDevelopments in advanced technologies, such as DNA microarrays, have generated tremendous amounts of data available to researchers in the field of bioinformatics. These state-of-the-art technologies present not only unprecedented opportunities to study biological phenomena of interest, but significant challenges in terms of processing the data. Furthermore, these datasets inherently exhibit a number of challenging characteristics, such as class imbalance, high dimensionality, small dataset size, noisy data, and complexity of data in terms of hard to distinguish decision boundaries between classes within the data. In recognition of the aforementioned challenges, this dissertation utilizes a variety of machine-learning and data-mining techniques, such as ensemble classification algorithms in conjunction with data sampling and feature selection techniques to alleviate these problems, while improving the classification results of models built on these datasets. However, in building classification models researchers and practitioners encounter the challenge that there is not a single classifier that performs relatively well in all cases. Thus, numerous classification approaches, such as ensemble learning methods, have been developed to address this problem successfully in a majority of circumstances. Ensemble learning is a promising technique that generates multiple classification models and then combines their decisions into a single final result. Ensemble learning often performs better than single-base classifiers in performing classification tasks. This dissertation conducts thorough empirical research by implementing a series of case studies to evaluate how ensemble learning techniques can be utilized to enhance overall classification performance, as well as improve the generalization ability of ensemble models. This dissertation investigates ensemble learning techniques of the boosting, bagging, and random forest algorithms, and proposes a number of modifications to the existing ensemble techniques in order to improve further the classification results. This dissertation examines the effectiveness of ensemble learning techniques on accounting for challenging characteristics of class imbalance and difficult-to-learn class decision boundaries. Next, it looks into ensemble methods that are relatively tolerant to class noise, and not only can account for the problem of class noise, but improves classification performance. This dissertation also examines the joint effects of data sampling along with ensemble techniques on whether sampling techniques can further improve classification performance of built ensemble models.
Show less - Date Issued
- 2016
- PURL
- http://purl.flvc.org/fau/fd/FA00004588
- Subject Headings
- Bioinformatics., Data mining -- Technological innovations., Machine learning.
- Format
- Document (PDF)
- Title
- Count models for software quality estimation.
- Creator
- Gao, Kehan, Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The primary aim of software engineering is to produce quality software that is delivered on time, within budget, and fulfils all its requirements. A timely estimation of software quality can serve as a prerequisite in achieving high reliability of software-based systems. More specifically, software quality assurance efforts can be prioritized for targeting program modules that are most likely to have a high number of faults. Software quality estimation models are generally of two types: a...
Show moreThe primary aim of software engineering is to produce quality software that is delivered on time, within budget, and fulfils all its requirements. A timely estimation of software quality can serve as a prerequisite in achieving high reliability of software-based systems. More specifically, software quality assurance efforts can be prioritized for targeting program modules that are most likely to have a high number of faults. Software quality estimation models are generally of two types: a classification model that predicts the class membership of modules into two or more quality-based classes, and a quantitative prediction model that estimates the number of faults (or some other software quality factor) that are likely to occur in software modules. In the literature, a variety of techniques have been developed for software quality estimation, most of which are suited for either prediction or classification but not for both, e.g., the multiple linear regression (only for prediction) and logistic regression (only for classification).
Show less - Date Issued
- 2003
- PURL
- http://purl.flvc.org/fcla/dt/12042
- Subject Headings
- Computer software--Quality control, Software engineering, Econometrics, Regression analysis
- Format
- Document (PDF)
- Title
- INVESTIGATING MACHINE LEARNING ALGORITHMS WITH IMBALANCED BIG DATA.
- Creator
- Hasanin, Tawfiq, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Recent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such...
Show moreRecent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such bias may lead to adverse consequences, some of them even life-threatening, when the existence of false negatives is generally costlier than false positives. The size of the minority class can vary from fair to extraordinary small, which can lead to different performance scores for machine learning algorithms. Class imbalance is a well-studied area for traditional data, i.e., not big data. However, there is limited research focusing on both rarity and severe class imbalance in big data.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013316
- Subject Headings
- Algorithms, Machine learning, Big data--Data processing, Big data
- Format
- Document (PDF)
- Title
- Effects of gene selection and data sampling on prediction of breast cancer treatments.
- Creator
- Heredia, Brian, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In recent years more and more researchers have begun to use data mining and machine learning tools to analyze gene microarray data. In this thesis we have collected a selection of datasets revolving around prediction of patient response in the specific area of breast cancer treatment. The datasets collected in this paper are all obtained from gene chips, which have become the industry standard in measurement of gene expression. In this thesis we will discuss the methods and procedures used in...
Show moreIn recent years more and more researchers have begun to use data mining and machine learning tools to analyze gene microarray data. In this thesis we have collected a selection of datasets revolving around prediction of patient response in the specific area of breast cancer treatment. The datasets collected in this paper are all obtained from gene chips, which have become the industry standard in measurement of gene expression. In this thesis we will discuss the methods and procedures used in the studies to analyze the datasets and their effects on treatment prediction with a particular interest in the selection of genes for predicting patient response. We will also analyze the datasets on our own in a uniform manner to determine the validity of these datasets in terms of learning potential and provide strategies for future work which explore how to best identify gene signatures.
Show less - Date Issued
- 2014
- PURL
- http://purl.flvc.org/fau/fd/FA00004292, http://purl.flvc.org/fau/fd/FA00004292
- Subject Headings
- Antineoplastic agents -- Development, Breast -- Cancer -- Treatment, Cancer -- Genetic aspects, DNA mircroarrays, Estimation theory, Gene expression
- Format
- Document (PDF)
- Title
- Machine Learning Algorithms for the Analysis of Social Media and Detection of Malicious User Generated Content.
- Creator
- Heredia, Brian, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
One of the de ning characteristics of the modern Internet is its massive connectedness, with information and human connection simply a few clicks away. Social media and online retailers have revolutionized how we communicate and purchase goods or services. User generated content on the web, through social media, plays a large role in modern society; Twitter has been in the forefront of political discourse, with politicians choosing it as their platform for disseminating information, while...
Show moreOne of the de ning characteristics of the modern Internet is its massive connectedness, with information and human connection simply a few clicks away. Social media and online retailers have revolutionized how we communicate and purchase goods or services. User generated content on the web, through social media, plays a large role in modern society; Twitter has been in the forefront of political discourse, with politicians choosing it as their platform for disseminating information, while websites like Amazon and Yelp allow users to share their opinions on products via online reviews. The information available through these platforms can provide insight into a host of relevant topics through the process of machine learning. Speci - cally, this process involves text mining for sentiment analysis, which is an application domain of machine learning involving the extraction of emotion from text. Unfortunately, there are still those with malicious intent and with the changes to how we communicate and conduct business, comes changes to their malicious practices. Social bots and fake reviews plague the web, providing incorrect information and swaying the opinion of unaware readers. The detection of these false users or posts from reading the text is di cult, if not impossible, for humans. Fortunately, text mining provides us with methods for the detection of harmful user generated content. This dissertation expands the current research in sentiment analysis, fake online review detection and election prediction. We examine cross-domain sentiment analysis using tweets and reviews. Novel techniques combining ensemble and feature selection methods are proposed for the domain of online spam review detection. We investigate the ability for the Twitter platform to predict the United States 2016 presidential election. In addition, we determine how social bots in uence this prediction.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013067
- Subject Headings
- Machine learning., Text mining., User-generated content., Social media.
- Format
- Document (PDF)
- Title
- Big Data Analytics and Engineering for Medicare Fraud Detection.
- Creator
- Herland, Matthew Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the...
Show moreThe United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the elderly population continues to rise, increasing the need for programs, such as Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected, draining resources and reducing the quality and accessibility of necessary healthcare services. In response, advanced data analytics have recently been explored to detect possible fraudulent activities. The Centers for Medicare and Medicaid Services (CMS) released several ‘Big Data’ Medicare claims datasets for different parts of their Medicare program to help facilitate this effort. In this dissertation, we employ three CMS Medicare Big Data datasets to evaluate the fraud detection performance available using advanced data analytics techniques, specifically machine learning. We use two distinct approaches, designated as anomaly detection and traditional fraud detection, where each have very distinct data processing and feature engineering. Anomaly detection experiments classify by provider specialty, determining whether outlier physicians within the same specialty signal fraudulent behavior. Traditional fraud detection refers to the experiments directly classifying physicians as fraudulent or non-fraudulent, leveraging machine learning algorithms to discriminate between classes. We present our novel data engineering approaches for both anomaly detection and traditional fraud detection including data processing, fraud mapping, and the creation of a combined dataset consisting of all three Medicare parts. We incorporate the List of Excluded Individuals and Entities database to identify real world fraudulent physicians for model evaluation. Regarding features, the final datasets for anomaly detection contain only claim counts for every procedure a physician submits while traditional fraud detection incorporates aggregated counts and payment information, specialty, and gender. Additionally, we compare cross-validation to the real world application of building a model on a training dataset and evaluating on a separate test dataset for severe class imbalance and rarity.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013215
- Subject Headings
- Big data, Medicare fraud, Data analytics, Machine learning
- Format
- Document (PDF)
- Title
- Performance analysis of a new object-based I/O architecture for PCs and workstations.
- Creator
- Huynh, Khoa Dang., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In this dissertation, an object-based I/O architecture for personal computers (PCs) and workstations is proposed. The proposed architecture allows the flexibility of having I/O processing performed as much as possible by intelligent I/O adapters, or by the host processor, or by any processor in the system, depending on application requirements and underlying hardware capabilities. It keeps many good features of current I/O architectures, while providing more flexibility to take advantage of...
Show moreIn this dissertation, an object-based I/O architecture for personal computers (PCs) and workstations is proposed. The proposed architecture allows the flexibility of having I/O processing performed as much as possible by intelligent I/O adapters, or by the host processor, or by any processor in the system, depending on application requirements and underlying hardware capabilities. It keeps many good features of current I/O architectures, while providing more flexibility to take advantage of new hardware technologies, promote architectural openness, provide better performance and higher reliability. The proposed architecture introduces a new definition of I/O subsystems and makes use of concurrent object-oriented technology. It combines the notions of object and thread into something called an active object. All concurrency abstractions required by the proposed architecture are provided through external libraries on top of existing sequential object-oriented languages, without any changes to the syntax and semantics of these languages. We also evaluate the performance of optimal implementations of the proposed I/O architecture against other I/O architectures in three popular, PC-based, distributed environments: network file server, video server, and video conferencing. Using the RESearch Queueing Modeling Environment (RESQME), we have developed detailed simulation models for various implementations of the proposed I/O architecture and two other existing I/O architectures: a conventional, interrupt-based I/O architecture and a peer-to-peer I/O architecture. Our simulation results indicate that, on several different hardware platforms, the proposed I/O architecture outperforms both existing architectures in all three distributed environments considered.
Show less - Date Issued
- 1994
- PURL
- http://purl.flvc.org/fcla/dt/12386
- Subject Headings
- Local area networks (Computer networks), Computer input-output equipment, Computer networks, Videoconferencing, Client/server computing, Object-oriented programming (Computer science)
- Format
- Document (PDF)
- Title
- ADDRESSING HIGHLY IMBALANCED BIG DATA CHALLENGES FOR MEDICARE FRAUD DETECTION.
- Creator
- Johnson, Justin M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Access to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the...
Show moreAccess to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the most severe fraudulent cases, but with millions of healthcare providers and more than 60 million active Medicare beneficiaries, manual fraud detection efforts are not able to make widespread, meaningful impact. Through the proliferation of electronic health records and continuous breakthroughs in data mining and machine learning, there is a great opportunity to develop and leverage advanced machine learning systems for automating healthcare fraud detection. This dissertation identifies key challenges associated with predictive modeling for large-scale Medicare fraud detection and presents innovative solutions to address these challenges in order to provide state-of-the-art results on multiple real-world Medicare fraud data sets. Our methodology for curating nine distinct Medicare fraud classification data sets is presented with comprehensive details describing data accumulation, data pre-processing, data aggregation techniques, data enrichment strategies, and improved fraud labeling. Data-level and algorithm-level methods for treating severe class imbalance, including a flexible output thresholding method and a cost-sensitive framework, are evaluated using deep neural network and ensemble learners. Novel encoding techniques and representation learning methods for high-dimensional categorical features are proposed to create expressive representations of provider attributes and billing procedure codes.
Show less - Date Issued
- 2022
- PURL
- http://purl.flvc.org/fau/fd/FA00014057
- Subject Headings
- Medicare fraud, Big data, Machine learning
- Format
- Document (PDF)
- Title
- An Evaluation of Deep Learning with Class Imbalanced Big Data.
- Creator
- Johnson, Justin Matthew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with...
Show moreEffective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with class imbalanced big data. Following an in-depth survey of deep learning methods for addressing class imbalance, we evaluate various methods for addressing imbalance on the task of detecting Medicare fraud, a big data problem characterized by extreme class imbalance. Case studies herein demonstrate the impact of class imbalance on neural networks, evaluate the efficacy of data-level and algorithm-level methods, and achieve state-of-the-art results on the given Medicare data set. Results indicate that combining under-sampling and over-sampling maximizes both performance and efficiency.
Show less - Date Issued
- 2019
- PURL
- http://purl.flvc.org/fau/fd/FA00013221
- Subject Headings
- Deep learning, Big data, Medicare fraud--Prevention
- Format
- Document (PDF)