Current Search: Big data (x)
-
-
Title
-
INVESTIGATING MACHINE LEARNING ALGORITHMS WITH IMBALANCED BIG DATA.
-
Creator
-
Hasanin, Tawfiq, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
Recent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such...
Show moreRecent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such bias may lead to adverse consequences, some of them even life-threatening, when the existence of false negatives is generally costlier than false positives. The size of the minority class can vary from fair to extraordinary small, which can lead to different performance scores for machine learning algorithms. Class imbalance is a well-studied area for traditional data, i.e., not big data. However, there is limited research focusing on both rarity and severe class imbalance in big data.
Show less
-
Date Issued
-
2019
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013316
-
Subject Headings
-
Algorithms, Machine learning, Big data--Data processing, Big data
-
Format
-
Document (PDF)
-
-
Title
-
Big Data Analytics and Engineering for Medicare Fraud Detection.
-
Creator
-
Herland, Matthew Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
The United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the...
Show moreThe United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the elderly population continues to rise, increasing the need for programs, such as Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected, draining resources and reducing the quality and accessibility of necessary healthcare services. In response, advanced data analytics have recently been explored to detect possible fraudulent activities. The Centers for Medicare and Medicaid Services (CMS) released several ‘Big Data’ Medicare claims datasets for different parts of their Medicare program to help facilitate this effort. In this dissertation, we employ three CMS Medicare Big Data datasets to evaluate the fraud detection performance available using advanced data analytics techniques, specifically machine learning. We use two distinct approaches, designated as anomaly detection and traditional fraud detection, where each have very distinct data processing and feature engineering. Anomaly detection experiments classify by provider specialty, determining whether outlier physicians within the same specialty signal fraudulent behavior. Traditional fraud detection refers to the experiments directly classifying physicians as fraudulent or non-fraudulent, leveraging machine learning algorithms to discriminate between classes. We present our novel data engineering approaches for both anomaly detection and traditional fraud detection including data processing, fraud mapping, and the creation of a combined dataset consisting of all three Medicare parts. We incorporate the List of Excluded Individuals and Entities database to identify real world fraudulent physicians for model evaluation. Regarding features, the final datasets for anomaly detection contain only claim counts for every procedure a physician submits while traditional fraud detection incorporates aggregated counts and payment information, specialty, and gender. Additionally, we compare cross-validation to the real world application of building a model on a training dataset and evaluating on a separate test dataset for severe class imbalance and rarity.
Show less
-
Date Issued
-
2019
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013215
-
Subject Headings
-
Big data, Medicare fraud, Data analytics, Machine learning
-
Format
-
Document (PDF)
-
-
Title
-
FRAUD DETECTION IN HIGHLY IMBALANCED BIG DATA WITH NOVEL AND EFFICIENT DATA REDUCTION TECHNIQUES.
-
Creator
-
Hancock III, John T., Taghi M. Khoshgoftaar, Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
The rapid growth of digital transactions and the increasing sophistication of fraudulent activities have necessitated the development of robust and efficient fraud detection techniques, particularly in the financial and healthcare sectors. This dissertation focuses on the use of novel data reduction techniques for addressing the unique challenges associated with detecting fraud in highly imbalanced Big Data, with a specific emphasis on credit card transactions and Medicare claims. The highly...
Show moreThe rapid growth of digital transactions and the increasing sophistication of fraudulent activities have necessitated the development of robust and efficient fraud detection techniques, particularly in the financial and healthcare sectors. This dissertation focuses on the use of novel data reduction techniques for addressing the unique challenges associated with detecting fraud in highly imbalanced Big Data, with a specific emphasis on credit card transactions and Medicare claims. The highly imbalanced nature of these datasets, where fraudulent instances constitute less than one percent of the data, poses significant challenges for traditional machine learning algorithms. This dissertation explores novel data reduction techniques tailored for fraud detection in highly imbalanced Big Data. The primary objectives include developing efficient data preprocessing and feature selection methods to reduce data dimensionality while preserving the most informative features, investigating various machine learning algorithms for their effectiveness in handling imbalanced data, and evaluating the proposed techniques on real-world credit card and Medicare fraud datasets. This dissertation covers a comprehensive examination of datasets, learners, experimental methodology, sampling techniques, feature selection techniques, and hybrid techniques. Key contributions include the analysis of performance metrics in the context of newly available Big Medicare Data, experiments using Big Medicare data, application of a novel ensemble supervised feature selection technique, and the combined application of data sampling and feature selection. The research demonstrates that, across both domains, the combined application of random undersampling and ensemble feature selection significantly improves classification performance.
Show less
-
Date Issued
-
2024
-
PURL
-
http://purl.flvc.org/fau/fd/FA00014424
-
Subject Headings
-
Fraud, Big data, Data reduction, Credit card fraud, Medicare fraud
-
Format
-
Document (PDF)
-
-
Title
-
Machine Learning Algorithms with Big Medicare Fraud Data.
-
Creator
-
Bauder, Richard Andrew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
Healthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent...
Show moreHealthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent than the normative observations. This difference between the normal observations and those observations of interest can create highly imbalanced datasets. The problem of class imbalance, to include the classification of rare cases indicating extreme class imbalance, is an important and well-studied area in machine learning. The effects of class imbalance with big data in the real-world Medicare fraud application domain, however, is limited. In particular, the impact of detecting fraud in Medicare claims is critical in lessening the financial and personal impacts of these transgressions. Fortunately, the healthcare domain is one such area where the successful detection of fraud can garner meaningful positive results. The application of machine learning techniques, plus methods to mitigate the adverse effects of class imbalance and rarity, can be used to detect fraud and lessen the impacts for all Medicare beneficiaries. This dissertation presents the application of machine learning approaches to detect Medicare provider claims fraud in the United States. We discuss novel techniques to process three big Medicare datasets and create a new, combined dataset, which includes mapping fraud labels associated with known excluded providers. We investigate the ability of machine learning techniques, unsupervised and supervised, to detect Medicare claims fraud and leverage data sampling methods to lessen the impact of class imbalance and increase fraud detection performance. Additionally, we extend the study of class imbalance to assess the impacts of rare cases in big data for Medicare fraud detection.
Show less
-
Date Issued
-
2018
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013108
-
Subject Headings
-
Medicare fraud, Big data, Machine learning, Algorithms
-
Format
-
Document (PDF)
-
-
Title
-
A REVIEW AND ANALYSIS OF BOT-IOT SECURITY DATA FOR MACHINE LEARNING.
-
Creator
-
Peterson, Jared M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
Machine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The...
Show moreMachine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The full dataset contains about 73 million instances (big data), 3 dependent features, and 43 independent features. The purpose of this thesis is to provide researchers with a foundational understanding of Bot-IoT, its development, its features, its composition, and its pitfalls. It will also summarize many of the published works that utilize Bot-IoT and will propose new areas of research based on the issues identified in the current research and in the dataset.
Show less
-
Date Issued
-
2021
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013838
-
Subject Headings
-
Machine learning, Cyber security, Big data
-
Format
-
Document (PDF)
-
-
Title
-
ADDRESSING HIGHLY IMBALANCED BIG DATA CHALLENGES FOR MEDICARE FRAUD DETECTION.
-
Creator
-
Johnson, Justin M., Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
Access to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the...
Show moreAccess to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the most severe fraudulent cases, but with millions of healthcare providers and more than 60 million active Medicare beneficiaries, manual fraud detection efforts are not able to make widespread, meaningful impact. Through the proliferation of electronic health records and continuous breakthroughs in data mining and machine learning, there is a great opportunity to develop and leverage advanced machine learning systems for automating healthcare fraud detection. This dissertation identifies key challenges associated with predictive modeling for large-scale Medicare fraud detection and presents innovative solutions to address these challenges in order to provide state-of-the-art results on multiple real-world Medicare fraud data sets. Our methodology for curating nine distinct Medicare fraud classification data sets is presented with comprehensive details describing data accumulation, data pre-processing, data aggregation techniques, data enrichment strategies, and improved fraud labeling. Data-level and algorithm-level methods for treating severe class imbalance, including a flexible output thresholding method and a cost-sensitive framework, are evaluated using deep neural network and ensemble learners. Novel encoding techniques and representation learning methods for high-dimensional categorical features are proposed to create expressive representations of provider attributes and billing procedure codes.
Show less
-
Date Issued
-
2022
-
PURL
-
http://purl.flvc.org/fau/fd/FA00014057
-
Subject Headings
-
Medicare fraud, Big data, Machine learning
-
Format
-
Document (PDF)
-
-
Title
-
SPATIAL NETWORK BIG DATABASE APPROACH TO RESOURCE ALLOCATION PROBLEMS.
-
Creator
-
Qutbuddin, Ahmad, Yang, KwangSoo, Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
Resource allocation for Spatial Network Big Database is challenging due to the large size of spatial networks, variety of types of spatial data, a fast update rate of spatial and temporal elements. It is challenging to learn, manage and process the collected data and produce meaningful information in a limited time. Produced information must be concise and easy to understand. At the same time, the information must be very descriptive and useful. My research aims to address these challenges...
Show moreResource allocation for Spatial Network Big Database is challenging due to the large size of spatial networks, variety of types of spatial data, a fast update rate of spatial and temporal elements. It is challenging to learn, manage and process the collected data and produce meaningful information in a limited time. Produced information must be concise and easy to understand. At the same time, the information must be very descriptive and useful. My research aims to address these challenges through the development of fundamental data processing components for advanced spatial network queries that clearly and briefly deliver critical information. This thesis proposal studied two challenging Spatial Network Big Database problems: (1) Multiple Resource Network Voronoi Diagram and (2) Node-attributed Spatial Graph Partitioning. To address the challenge of query processing for multiple resource allocation in preparing for or after a disaster, we investigated the problem of the Multiple Resource Network Voronoi Diagram (MRNVD). Given a spatial network and a set of service centers from k different resource types, a Multiple Resource Network Voronoi Diagram (MRNVD) partitions the spatial network into a set of Service Areas that can minimize the total cycle-distances of graph-nodes to allotted k service centers with different resource types. The MRNVD problem is important for critical societal applications such as assigning essential survival supplies (e.g., food, water, gas, and medical assistance) to residents impacted by man-made or natural disasters. The MRNVD problem is NP-hard; it is computationally challenging due to the large size of the transportation network. Previous work proposed the Distance bounded Pruning (DP) approach to produce an optimal solution for MRNVD. However, we found that DP can be generalized to reduce the computational cost for the minimum cycle-distance. We extend our prior work and propose a novel approach that reduces the computational cost. Experiments using real-world datasets from five different regions demonstrate that the proposed approach creates MRNVD and significantly reduces the computational cost.
Show less
-
Date Issued
-
2021
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013854
-
Subject Headings
-
Spatial data infrastructures, Big data--Data processing, Resource allocation, Voronoi polygons
-
Format
-
Document (PDF)
-
-
Title
-
An Evaluation of Deep Learning with Class Imbalanced Big Data.
-
Creator
-
Johnson, Justin Matthew, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with...
Show moreEffective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with class imbalanced big data. Following an in-depth survey of deep learning methods for addressing class imbalance, we evaluate various methods for addressing imbalance on the task of detecting Medicare fraud, a big data problem characterized by extreme class imbalance. Case studies herein demonstrate the impact of class imbalance on neural networks, evaluate the efficacy of data-level and algorithm-level methods, and achieve state-of-the-art results on the given Medicare data set. Results indicate that combining under-sampling and over-sampling maximizes both performance and efficiency.
Show less
-
Date Issued
-
2019
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013221
-
Subject Headings
-
Deep learning, Big data, Medicare fraud--Prevention
-
Format
-
Document (PDF)
-
-
Title
-
SPATIAL NETWORK BIG DATA APPROACHES TO EMERGENCY MANAGEMENT INFORMATION SYSTEMS.
-
Creator
-
Herschelman, Roxana M., Yang, KwangSoo, Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
Emergency Management Information Systems (EMIS) are defined as a set of tools that aid decision-makers in risk assessment and response for significant multi-hazard threats and disasters. Over the past three decades, EMIS have grown in importance as a major component for understanding, managing, and governing transportation-related systems. To increase resilience against potential threats, the main goal of EMIS is to timely utilize spatial and network datasets about (1) locations of hazard...
Show moreEmergency Management Information Systems (EMIS) are defined as a set of tools that aid decision-makers in risk assessment and response for significant multi-hazard threats and disasters. Over the past three decades, EMIS have grown in importance as a major component for understanding, managing, and governing transportation-related systems. To increase resilience against potential threats, the main goal of EMIS is to timely utilize spatial and network datasets about (1) locations of hazard areas (2) shelters and resources, (3) and how to respond to emergencies. The main concern about these datasets has always been the very large size, variety, and update rate required to ensure the timely delivery of useful emergency information and response for disastrous events. Another key issue is that the information should be concise and easy to understand, but at the same time very descriptive and useful in the case of emergency or disaster. Advancement in EMIS is urgently needed to develop fundamental data processing components for advanced spatial network queries that clearly and succinctly deliver critical information in emergencies. To address these challenges, we investigate Spatial Network Database Systems and study three challenging Transportation Resilience problems: producing large scale evacuation plans, identifying major traffic patterns during emergency evacuations, and identifying the highest areas in need of resources.
Show less
-
Date Issued
-
2020
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013576
-
Subject Headings
-
Emergency management, Big data, Emergency management--Information technology
-
Format
-
Document (PDF)
-
-
Title
-
PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES.
-
Creator
-
Richter, Aaron N., Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized...
Show moreMelanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data.
Show less
-
Date Issued
-
2019
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013342
-
Subject Headings
-
Melanoma, Electronic Health Records, Machine learning--Technique, Big Data
-
Format
-
Document (PDF)
-
-
Title
-
REAL-TIME HIGHWAY TRAFFIC FLOW AND ACCIDENT SEVERITY PREDICTION IN VEHICULAR NETWORKS USING DISTRIBUTED MACHINE LEARNING AND BIG DATA ANALYSIS.
-
Creator
-
Alnami, Hani Mohammed, Mahgoub, Imadeldin, Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
In recent years, Florida State recorded thousands of abnormal traffic flows on highways that were caused by traffic incidents. Highway traffic congestion costed the US economy 101 billion dollars in 2020. Therefore, it is imperative to develop effective real-time traffic flow prediction schemes to mitigate the impact of traffic congestion. In this dissertation, we utilized real-life highway segment-based traffic and incident data obtained from Florida Department of Transportation (FDOT) for...
Show moreIn recent years, Florida State recorded thousands of abnormal traffic flows on highways that were caused by traffic incidents. Highway traffic congestion costed the US economy 101 billion dollars in 2020. Therefore, it is imperative to develop effective real-time traffic flow prediction schemes to mitigate the impact of traffic congestion. In this dissertation, we utilized real-life highway segment-based traffic and incident data obtained from Florida Department of Transportation (FDOT) for real-time incident prediction. We used eight years of FDOT real-life traffic and incident data for Florida I-95 highway to build prediction models for traffic accident severity. Accurate severity prediction is beneficial for responders since it allows the emergency center to dispatch the right number of vehicles without wasting additional resources.
Show less
-
Date Issued
-
2022
-
PURL
-
http://purl.flvc.org/fau/fd/FA00014089
-
Subject Headings
-
Traffic flow, Traffic accidents, Machine learning, Big data, Traffic estimation
-
Format
-
Document (PDF)
-
-
Title
-
ADVANCING ONE-CLASS CLASSIFICATION: A COMPREHENSIVE ANALYSIS FROM THEORY TO NOVEL APPLICATIONS.
-
Creator
-
Abdollah, Zadeh Azadeh, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
This dissertation explores one-class classification (OCC) in the context of big data and fraud detection, addressing challenges posed by imbalanced datasets. A detailed survey of OCC-related literature forms a core part of the study, categorizing works into outlier detection, novelty detection, and deep learning applications. This survey reveals a gap in the application of OCC to the inherent problems of big data, such as class rarity and noisy data. Building upon the foundational insights...
Show moreThis dissertation explores one-class classification (OCC) in the context of big data and fraud detection, addressing challenges posed by imbalanced datasets. A detailed survey of OCC-related literature forms a core part of the study, categorizing works into outlier detection, novelty detection, and deep learning applications. This survey reveals a gap in the application of OCC to the inherent problems of big data, such as class rarity and noisy data. Building upon the foundational insights gained from the comprehensive literature review on OCC, the dissertation progresses to a detailed comparative analysis between OCC and binary classification methods. This comparison is pivotal in understanding their respective strengths and limitations across various applications, emphasizing their roles in addressing imbalanced datasets. The research then specifically evaluates binary and OCC using credit card fraud data. This practical application highlights the nuances and effectiveness of these classification methods in real-world scenarios, offering insights into their performance in detecting fraudulent activities. After the evaluation of binary and OCC using credit card fraud data, the dissertation extends this inquiry with a detailed investigation into the effectiveness of both methodologies in fraud detection. This extended analysis involves utilizing not only the Credit Card Fraud Detection Dataset but also the Medicare Part D dataset. The findings show the comparative performance and suitability of these classification methods in practical fraud detection scenarios. Finally, the dissertation examines the impact of training OCC algorithms on majority versus minority classes, using the two previously mentioned datasets in addition to Medicare Part B and Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS) datasets. This exploration offers critical insights into model training strategies and their implications, suggesting that training on the majority class can often lead to more robust classification results. In summary, this dissertation provides a deep understanding of OCC, effectively bridging theoretical concepts with novel applications in big data and fraud detection. It contributes to the field by offering a comprehensive analysis of OCC methodologies, their practical implications, and their effectiveness in addressing class imbalance in big data.
Show less
-
Date Issued
-
2024
-
PURL
-
http://purl.flvc.org/fau/fd/FA00014387
-
Subject Headings
-
Classification, Big data, Deep learning (Machine learning), Computer engineering
-
Format
-
Document (PDF)
-
-
Title
-
Big data and analytics: the future of music marketing.
-
Creator
-
Capodilupo, Daniella, Abrams, Ira, Florida Atlantic University, College of Business, Department of Management
-
Abstract/Description
-
This is a comprehensive study of how Big Data and analytics will be the future of music marketing. There has been a recent trend of being able to turn metrics into quantifiable, real-word predictions. With an increase in online music consumption along with the use of social media there is now a clearer view than ever before about how this will happen. Instead of solely relying on big record companies for an artist to make it to the big time, there is now a plethora of data and analytics...
Show moreThis is a comprehensive study of how Big Data and analytics will be the future of music marketing. There has been a recent trend of being able to turn metrics into quantifiable, real-word predictions. With an increase in online music consumption along with the use of social media there is now a clearer view than ever before about how this will happen. Instead of solely relying on big record companies for an artist to make it to the big time, there is now a plethora of data and analytics available not just to a small number of big companies, but to anyone.
Show less
-
Date Issued
-
2015
-
PURL
-
http://purl.flvc.org/fau/fd/FA00004353, http://purl.flvc.org/fau/fd/FA00004353
-
Subject Headings
-
Big data -- Economic aspects, Consumer behavior, Internet marketing, Marketing -- Data processing, Music and the Internet, Musical analysis -- Data processing
-
Format
-
Document (PDF)
-
-
Title
-
Multimedia Big Data Processing Using Hpcc Systems.
-
Creator
-
Chinta, Vishnu, Kalva, Hari, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
There is now more data being created than ever before and this data can be any form of data, textual, multimedia, spatial etc. To process this data, several big data processing platforms have been developed including Hadoop, based on the MapReduce model and LexisNexis’ HPCC systems. In this thesis we evaluate the HPCC Systems framework with a special interest in multimedia data analysis and propose a framework for multimedia data processing. It is important to note that multimedia data...
Show moreThere is now more data being created than ever before and this data can be any form of data, textual, multimedia, spatial etc. To process this data, several big data processing platforms have been developed including Hadoop, based on the MapReduce model and LexisNexis’ HPCC systems. In this thesis we evaluate the HPCC Systems framework with a special interest in multimedia data analysis and propose a framework for multimedia data processing. It is important to note that multimedia data encompasses a wide variety of data including but not limited to image data, video data, audio data and even textual data. While developing a unified framework for such wide variety of data, we have to consider computational complexity in dealing with the data. Preliminary results show that HPCC can potentially reduce the computational complexity significantly.
Show less
-
Date Issued
-
2017
-
PURL
-
http://purl.flvc.org/fau/fd/FA00004875, http://purl.flvc.org/fau/fd/FA00004875
-
Subject Headings
-
Big data., High performance computing., Software engineering., Artificial intelligence--Data processing., Management information systems., Multimedia systems.
-
Format
-
Document (PDF)
-
-
Title
-
HPCC based Platform for COPD Readmission Risk Analysis with implementation of Dimensionality reduction and balancing techniques.
-
Creator
-
Jain, Piyush, Agarwal, Ankur, Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
Hospital readmission rates are considered to be an important indicator of quality of care because they may be a consequence of actions of commission or omission made during the initial hospitalization of the patient, or as a consequence of poorly managed transition of the patient back into the community. The negative impact on patient quality of life and huge burden on healthcare system have made reducing hospital readmissions a central goal of healthcare delivery and payment reform efforts....
Show moreHospital readmission rates are considered to be an important indicator of quality of care because they may be a consequence of actions of commission or omission made during the initial hospitalization of the patient, or as a consequence of poorly managed transition of the patient back into the community. The negative impact on patient quality of life and huge burden on healthcare system have made reducing hospital readmissions a central goal of healthcare delivery and payment reform efforts. In this study, we will be proposing a framework on how the readmission analysis and other healthcare models could be deployed in real world and a Machine learning based solution which uses patients discharge summaries as a dataset to train and test the machine learning model created. Current systems does not take into consideration one of the very important aspect of solving readmission problem by taking Big data into consideration. This study also takes into consideration Big data aspect of solutions which can be deployed in the field for real world use. We have used HPCC compute platform which provides distributed parallel programming platform to create, run and manage applications which involves large amount of data. We have also proposed some feature engineering and data balancing techniques which have shown to greatly enhance the machine learning model performance. This was achieved by reducing the dimensionality in the data and fixing the imbalance in the dataset. The system presented in this study provides a real world machine learning based predictive modeling for reducing readmissions which could be templatized for other diseases.
Show less
-
Date Issued
-
2020
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013560
-
Subject Headings
-
Machine learning, Big data, Patient Readmission, Hospitals--Admission and discharge--Data processing, High performance computing
-
Format
-
Document (PDF)
-
-
Title
-
DATA COLLECTION FRAMEWORK AND MACHINE LEARNING ALGORITHMS FOR THE ANALYSIS OF CYBER SECURITY ATTACKS.
-
Creator
-
Calvert, Chad, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
The integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors...
Show moreThe integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors of persistent and emerging threats. When collecting modern day network traffic for intrusion detection, substantial amounts of traffic can be collected, much of which consists of relatively few attack instances as compared to normal traffic. This skewed distribution between normal and attack data can lead to high levels of class imbalance. Machine learning techniques can be used to aid in attack detection, but large levels of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results.
Show less
-
Date Issued
-
2019
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013289
-
Subject Headings
-
Machine learning, Algorithms, Anomaly detection (Computer security), Intrusion detection systems (Computer security), Big data
-
Format
-
Document (PDF)
-
-
Title
-
Real-time traffic incidents prediction in vehicular networks using big data analytics.
-
Creator
-
Al-Najada, Hamzah, Mahgoub, Imad, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
-
Abstract/Description
-
The United States has been going through a road accident crisis for many years. The National Safety Council estimates 40,000 people were killed and 4.57 million injured on U.S. roads in 2017. Direct and indirect loss from tra c congestion only is more than $140 billion every year. Vehicular Ad-hoc Networks (VANETs) are envisioned as the future of Intelligent Transportation Systems (ITSs). They have a great potential to enable all kinds of applications that will enhance road safety and...
Show moreThe United States has been going through a road accident crisis for many years. The National Safety Council estimates 40,000 people were killed and 4.57 million injured on U.S. roads in 2017. Direct and indirect loss from tra c congestion only is more than $140 billion every year. Vehicular Ad-hoc Networks (VANETs) are envisioned as the future of Intelligent Transportation Systems (ITSs). They have a great potential to enable all kinds of applications that will enhance road safety and transportation efficiency. In this dissertation, we have aggregated seven years of real-life tra c and incidents data, obtained from the Florida Department of Transportation District 4. We have studied and investigated the causes of road incidents by applying machine learning approaches to this aggregated big dataset. A scalable, reliable, and automatic system for predicting road incidents is an integral part of any e ective ITS. For this purpose, we propose a cloud-based system for VANET that aims at preventing or at least decreasing tra c congestions as well as crashes in real-time. We have created, tested, and validated a VANET traffic dataset by applying the connected vehicle behavioral changes to our aggregated dataset. To achieve the scalability, speed, and fault-tolerance in our developed system, we built our system in a lambda architecture fashion using Apache Spark and Spark Streaming with Kafka. We used our system in creating optimal and safe trajectories for autonomous vehicles based on the user preferences. We extended the use of our developed system in predicting the clearance time on the highway in real-time, as an important component of the traffic incident management system. We implemented the time series analysis and forecasting in our real-time system as a component for predicting traffic flow. Our system can be applied to use dedicated short communication (DSRC), cellular, or hybrid communication schema to receive streaming data and send back the safety messages. The performance of the proposed system has been extensively tested on the FAUs High Performance Computing Cluster (HPCC), as well as on a single node virtual machine. Results and findings confirm the applicability of the proposed system in predicting traffic incidents with low processing latency.
Show less
-
Date Issued
-
2018
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013114
-
Subject Headings
-
Vehicular ad hoc networks (Computer networks), Big data, Intelligent transportation systems, Prediction, traffic incidents
-
Format
-
Document (PDF)
-
-
Title
-
MACHINE LEARNING ALGORITHMS FOR PREDICTING BOTNET ATTACKS IN IOT NETWORKS.
-
Creator
-
Leevy, Joffrey, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
-
Abstract/Description
-
The proliferation of Internet of Things (IoT) devices in various networks is being matched by an increase in related cybersecurity risks. To help counter these risks, big datasets such as Bot-IoT were designed to train machine learning algorithms on network-based intrusion detection for IoT devices. From a binary classification perspective, there is a high-class imbalance in Bot-IoT between each of the attack categories and the normal category, and also between the combined attack categories...
Show moreThe proliferation of Internet of Things (IoT) devices in various networks is being matched by an increase in related cybersecurity risks. To help counter these risks, big datasets such as Bot-IoT were designed to train machine learning algorithms on network-based intrusion detection for IoT devices. From a binary classification perspective, there is a high-class imbalance in Bot-IoT between each of the attack categories and the normal category, and also between the combined attack categories and the normal category. Within the scope of predicting botnet attacks in IoT networks, this dissertation demonstrates the usefulness and efficiency of novel machine learning methods, such as an easy-to-classify method and a unique set of ensemble feature selection techniques. The focus of this work is on the full Bot-IoT dataset, as well as each of the four attack categories of Bot-IoT, namely, Denial-of-Service (DoS), Distributed Denial-of-Service (DDoS), Reconnaissance, and Information Theft. Since resources and services become inaccessible during DoS and DDoS attacks, this interruption is costly to an organization in terms of both time and money. Reconnaissance attacks often signify the first stage of a cyberattack and preventing them from occurring usually means the end of the intended cyberattack. Information Theft attacks not only erode consumer confidence but may also compromise intellectual property and national security. For the DoS experiment, the ensemble feature selection approach led to the best performance, while for the DDoS experiment, the full set of Bot-IoT features resulted in the best performance. Regarding the Reconnaissance experiment, the ensemble feature selection approach effected the best performance. In relation to the Information Theft experiment, the ensemble feature selection techniques did not affect performance, positively or negatively. However, the ensemble feature selection approach is recommended for this experiment because feature reduction eases computational burden and may provide clarity through improved data visualization. For the full Bot-IoT big dataset, an explainable machine learning approach was taken using the Decision Tree classifier. An easy-to-learn Decision Tree model for predicting attacks was obtained with only three features, which is a significant result for big data.
Show less
-
Date Issued
-
2022
-
PURL
-
http://purl.flvc.org/fau/fd/FA00013933
-
Subject Headings
-
Machine learning, Internet of things--Security measures, Big data, Intrusion detection systems (Computer security)
-
Format
-
Document (PDF)