Current Search: Data mining. (x)
View All Items
Pages
- Title
- A review of the stability of feature selection techniques for bioinformatics data.
- Creator
- Awada, Wael, Khoshgoftaar, Taghi M., Dittman, David, Wald, Randall, Napolitano, Amri E., Graduate College
- Date Issued
- 2013-04-12
- PURL
- http://purl.flvc.org/fcla/dt/3361293
- Subject Headings
- Bioinformatics, DNA microarrays, Data mining
- Format
- Document (PDF)
- Title
- Ensemble Learning Algorithms for the Analysis of Bioinformatics Data.
- Creator
- Fazelpour, Alireza, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Developments in advanced technologies, such as DNA microarrays, have generated tremendous amounts of data available to researchers in the field of bioinformatics. These state-of-the-art technologies present not only unprecedented opportunities to study biological phenomena of interest, but significant challenges in terms of processing the data. Furthermore, these datasets inherently exhibit a number of challenging characteristics, such as class imbalance, high dimensionality, small dataset...
Show moreDevelopments in advanced technologies, such as DNA microarrays, have generated tremendous amounts of data available to researchers in the field of bioinformatics. These state-of-the-art technologies present not only unprecedented opportunities to study biological phenomena of interest, but significant challenges in terms of processing the data. Furthermore, these datasets inherently exhibit a number of challenging characteristics, such as class imbalance, high dimensionality, small dataset size, noisy data, and complexity of data in terms of hard to distinguish decision boundaries between classes within the data. In recognition of the aforementioned challenges, this dissertation utilizes a variety of machine-learning and data-mining techniques, such as ensemble classification algorithms in conjunction with data sampling and feature selection techniques to alleviate these problems, while improving the classification results of models built on these datasets. However, in building classification models researchers and practitioners encounter the challenge that there is not a single classifier that performs relatively well in all cases. Thus, numerous classification approaches, such as ensemble learning methods, have been developed to address this problem successfully in a majority of circumstances. Ensemble learning is a promising technique that generates multiple classification models and then combines their decisions into a single final result. Ensemble learning often performs better than single-base classifiers in performing classification tasks. This dissertation conducts thorough empirical research by implementing a series of case studies to evaluate how ensemble learning techniques can be utilized to enhance overall classification performance, as well as improve the generalization ability of ensemble models. This dissertation investigates ensemble learning techniques of the boosting, bagging, and random forest algorithms, and proposes a number of modifications to the existing ensemble techniques in order to improve further the classification results. This dissertation examines the effectiveness of ensemble learning techniques on accounting for challenging characteristics of class imbalance and difficult-to-learn class decision boundaries. Next, it looks into ensemble methods that are relatively tolerant to class noise, and not only can account for the problem of class noise, but improves classification performance. This dissertation also examines the joint effects of data sampling along with ensemble techniques on whether sampling techniques can further improve classification performance of built ensemble models.
Show less - Date Issued
- 2016
- PURL
- http://purl.flvc.org/fau/fd/FA00004588
- Subject Headings
- Bioinformatics., Data mining -- Technological innovations., Machine learning.
- Format
- Document (PDF)
- Title
- Data warehousing and mining: Customer churn analysis in the wireless industry.
- Creator
- Nath, Shyam Varan., Florida Atlantic University, Behara, Ravi
- Abstract/Description
-
This study looks at the database technique of data warehousing and data mining to analyze the business problems related to customer churn in the wireless industry. The customer churn due to new industry regulations has hit the wireless industry hard. The study uses data warehousing and data mining to model the customer database to predict churn rates and suggest timely recommendations to increase customer retention and thereby increase overall profitability. The Naive Bayes algorithm for...
Show moreThis study looks at the database technique of data warehousing and data mining to analyze the business problems related to customer churn in the wireless industry. The customer churn due to new industry regulations has hit the wireless industry hard. The study uses data warehousing and data mining to model the customer database to predict churn rates and suggest timely recommendations to increase customer retention and thereby increase overall profitability. The Naive Bayes algorithm for supervised learning was the prediction algorithm used for data modeling in the study. The data set used in the study consists of one hundred thousand real wireless customers. The study uses database tools such as Oracle database with data mining options and JDeveloper for implementing the models. The data model developed with the calibration data was used to predict the churn for the wireless customers along with the predictive accuracy and probabilities of the results.
Show less - Date Issued
- 2003
- PURL
- http://purl.flvc.org/fcla/dt/12992
- Subject Headings
- Data warehousing, Data mining, Telecommunication--Customer services
- Format
- Document (PDF)
- Title
- Machine Learning Methods to Understand Textual Data.
- Creator
- Sohangir, Sahar, Wang, Dingding, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The amount of textual data that produce every minute on the internet is extremely high. Processing of this tremendous volume of mostly unstructured data is not a straightforward function. But the enormous amount of useful information that lay down on them motivate scientists to investigate efficient and effective techniques and algorithms to discover meaningful patterns. Social network applications provide opportunities for people around the world to be in contact and share their valuable...
Show moreThe amount of textual data that produce every minute on the internet is extremely high. Processing of this tremendous volume of mostly unstructured data is not a straightforward function. But the enormous amount of useful information that lay down on them motivate scientists to investigate efficient and effective techniques and algorithms to discover meaningful patterns. Social network applications provide opportunities for people around the world to be in contact and share their valuable knowledge, such as chat, comments, and discussion boards. People usually do not care about spelling and accurate grammatical construction of a sentence in everyday life conversations. Therefore, extracting information from such datasets are more complicated. Text mining can be a solution to this problem. Text mining is a knowledge discovery process used to extract patterns from natural language. Application of text mining techniques on social networking websites can reveal a significant amount of information. Text mining in conjunction with social networks can be used for finding a general opinion about any special subject, human thinking patterns, and group identification. In this study, we investigate machine learning methods in textual data in six chapters.
Show less - Date Issued
- 2018
- PURL
- http://purl.flvc.org/fau/fd/FA00013107
- Subject Headings
- Machine learning, Internet--Data processing, Text Mining
- Format
- Document (PDF)
- Title
- Alleviating class imbalance using data sampling: Examining the effects on classification algorithms.
- Creator
- Napolitano, Amri E., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Imbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling...
Show moreImbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling techniques using five different classifiers and four different datasets. The work addresses which sampling techniques produce the best performance in the presence of class unbalance, which classifiers are most robust to the problem, as well as which sampling techniques perform better or worse with each classifier. Extensive statistical analysis of these results is provided, in addition to an examination of the qualitative effects of the sampling techniques on the types of predictions made by the C4.5 classifier.
Show less - Date Issued
- 2006
- PURL
- http://purl.flvc.org/fcla/dt/13413
- Subject Headings
- Combinatorial group theory, Data mining, Decision trees, Machine learning
- Format
- Document (PDF)
- Title
- Binary representation of DNA sequences towards developing useful algorithms in bioinformatic data-mining.
- Creator
- Pandya, Shivani., Florida Atlantic University, Neelakanta, Perambur S., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
This thesis refers to a research addressing the use of binary representation of the DNA for the purpose of developing useful algorithms for Bioinformatics. Pertinent studies address the use of a binary form of the DNA base chemicals in information-theoretic base so as to identify symmetry between DNA and complementary DNA. This study also refers to "fuzzy" (codon-noncodon) considerations in delinating codon and noncodon regimes in a DNA sequences. The research envisaged further includes a...
Show moreThis thesis refers to a research addressing the use of binary representation of the DNA for the purpose of developing useful algorithms for Bioinformatics. Pertinent studies address the use of a binary form of the DNA base chemicals in information-theoretic base so as to identify symmetry between DNA and complementary DNA. This study also refers to "fuzzy" (codon-noncodon) considerations in delinating codon and noncodon regimes in a DNA sequences. The research envisaged further includes a comparative analysis of the test results on the aforesaid efforts using different statistical metrics such as Hamming distance Kullback-Leibler measure etc. the observed details supports the symmetry aspect between DNA and CDNA strands. It also demonstrates capability of identifying non-codon regions in DNA even under diffused (overlapped) fuzzy states.
Show less - Date Issued
- 2003
- PURL
- http://purl.flvc.org/fcla/dt/13089
- Subject Headings
- Bioinformatics, Data mining, Nucleotide sequence--Databases, Computer algorithms
- Format
- Document (PDF)
- Title
- A comparative study of classification algorithms for network intrusion detection.
- Creator
- Wang, Yunling., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
As network-based computer systems play increasingly vital roles in modern society, they have become the targets of criminals. Network security has never been more important a subject than in today's extensively interconnected computer world. Intrusion Detection Systems (IDS) have been used along with the data mining techniques to detect intrusions. In this thesis, we present a comparative study of intrusion detection using a decision-tree learner (C4.5), two rule-based learners (ripper and...
Show moreAs network-based computer systems play increasingly vital roles in modern society, they have become the targets of criminals. Network security has never been more important a subject than in today's extensively interconnected computer world. Intrusion Detection Systems (IDS) have been used along with the data mining techniques to detect intrusions. In this thesis, we present a comparative study of intrusion detection using a decision-tree learner (C4.5), two rule-based learners (ripper and ridor), a learner to combine decision trees and rules (PART), and two instance-based learners (IBK and Nnge). We investigate and compare the performance of IDSs based on the six techniques, with respect to a case study of the DAPAR KDD-1999 network intrusion detection project. Investigation results demonstrated that data mining techniques are very useful in the area of intrusion detection.
Show less - Date Issued
- 2004
- PURL
- http://purl.flvc.org/fcla/dt/13102
- Subject Headings
- Computer networks--Security measures, Data mining, Decision trees
- Format
- Document (PDF)
- Title
- Studies on information-theoretics based data-sequence pattern-discriminant algorithms: Applications in bioinformatic data mining.
- Creator
- Arredondo, Tomas Vidal., Florida Atlantic University, Neelakanta, Perambur S., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
This research refers to studies on information-theoretic (IT) aspects of data-sequence patterns and developing thereof discriminant algorithms that enable distinguishing the features of underlying sequence patterns having characteristic, inherent stochastical attributes. The application potentials of such algorithms include bioinformatic data mining efforts. Consistent with the scope of the study as above, considered in this research are specific details on information-theoretics and entropy...
Show moreThis research refers to studies on information-theoretic (IT) aspects of data-sequence patterns and developing thereof discriminant algorithms that enable distinguishing the features of underlying sequence patterns having characteristic, inherent stochastical attributes. The application potentials of such algorithms include bioinformatic data mining efforts. Consistent with the scope of the study as above, considered in this research are specific details on information-theoretics and entropy considerations vis-a-vis sequence patterns (having stochastical attributes) such as DNA sequences of molecular biology. Applying information-theoretic concepts (essentially in Shannon's sense), the following distinct sets of metrics are developed and applied in the algorithms developed for data-sequence pattern-discrimination applications: (i) Divergence or cross-entropy algorithms of Kullback-Leibler type and of general Czizar class; (ii) statistical distance measures; (iii) ratio-metrics; (iv) Fisher type linear-discriminant measure and (v) complexity metric based on information redundancy. These measures are judiciously adopted in ascertaining codon-noncodon delineations in DNA sequences that consist of crisp and/or fuzzy nucleotide domains across their chains. The Fisher measure is also used in codon-noncodon delineation and in motif detection. Relevant algorithms are used to test DNA sequences of human and some bacterial organisms. The relative efficacy of the metrics and the algorithms is determined and discussed. The potentials of such algorithms in supplementing the prevailing methods are indicated. Scope for future studies is identified in terms of persisting open questions.
Show less - Date Issued
- 2003
- PURL
- http://purl.flvc.org/fau/fd/FADT12057
- Subject Headings
- Data mining, Bioinformatics, Discriminant analysis, Information theory in biology
- Format
- Document (PDF)
- Title
- A COMPARATIVE STUDY OF STRUCTURED VERSUS UNSTRUCTURED TEXT DATA.
- Creator
- Cardenas, Erika, Khoshgoftaar, Taghi M., Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
In today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of...
Show moreIn today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of structured versus unstructured text data in two different applications. The first application is in the field of real estate. We compare the performance of tabular real-estate data and unstructured text descriptions of homes to predict the house price. The second application is in translating Electronic Health Records (EHR) tabular data to text data for survival classification of COVID-19 patients. Lastly, we present a range of strategies and perspectives for future research.
Show less - Date Issued
- 2023
- PURL
- http://purl.flvc.org/fau/fd/FA00014220
- Subject Headings
- Natural language processing (Computer science), Text data mining
- Format
- Document (PDF)
- Title
- Real-Time Data Analytics and Optimization for Computational Advertising.
- Creator
- Liu, Hui, Zhu, Xingquan, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Online advertising has built a market of hundreds of billions of dollars and still continues to grow. With well developed techniques in big data storage, data mining and analytics, online advertising is able to reach targeted audiences e ctively. Real- time bidding refers to the buying and selling of online ad impressions through ad inventory auctions which occur in the time it takes a webpage to load. How to de- termine the bidding price and how to allocate the budget of advertising is the...
Show moreOnline advertising has built a market of hundreds of billions of dollars and still continues to grow. With well developed techniques in big data storage, data mining and analytics, online advertising is able to reach targeted audiences e ctively. Real- time bidding refers to the buying and selling of online ad impressions through ad inventory auctions which occur in the time it takes a webpage to load. How to de- termine the bidding price and how to allocate the budget of advertising is the key to successful ad campaigns. Both of these aspects are fundamental to most campaign optimizations and we will introduce both of them in this thesis. For bidding price determination, we improved the estimation of CTR (Click Through Rate) (one of the most important factors of determining the bidding price) by using a re ned hierar- chical tree structure for the estimation. The result of the experiment and the A/B test showed our proposal can provide stable improvement. For budget allocation, we introduce SCO (Single Campaign Optimization) and CCO (Cross Campaign Opti- mization). SCO has been applied by our commercial partner while CCO needs more research. We will rst introduce the methods of SCO and then give our proposal about CCO. We modeled CCO as a LP (Linear Programming) problem as well as designed an e ective procedure to implement optimal impressions distribution. Our simulation showed our proposal can signi cantly increase global Gross Pro t (GP).
Show less - Date Issued
- 2017
- PURL
- http://purl.flvc.org/fau/fd/FA00004940, http://purl.flvc.org/fau/fd/FA00004940
- Subject Headings
- Internet marketing--Technological innovations., Internet advertising--Technological innovations., Data mining., Web usage mining., Business--Data processing.
- Format
- Document (PDF)
- Title
- Mining and fusing data for ocean turbine condition monitoring.
- Creator
- Duhaney, Janell A., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
An ocean turbine extarcts the kinetic energy from ocean currents to generate electricity. Machine Condition Monitoring (MCM) / Prognostic Health Monitoring (PHM) systems allow for self-checking and automated fault detection, and are integral in the construction of a highly reliable ocean turbine. MCM/PHM systems enable real time health assessment, prognostics and advisory generation by interpreting data from sensors installed on the machine being monitored. To effectively utilize sensor...
Show moreAn ocean turbine extarcts the kinetic energy from ocean currents to generate electricity. Machine Condition Monitoring (MCM) / Prognostic Health Monitoring (PHM) systems allow for self-checking and automated fault detection, and are integral in the construction of a highly reliable ocean turbine. MCM/PHM systems enable real time health assessment, prognostics and advisory generation by interpreting data from sensors installed on the machine being monitored. To effectively utilize sensor readings for determining the health of individual components, macro-components and the overall system, these measurements must somehow be combined or integrated to form a holistic picture. The process used to perform this combination is called data fusion. Data mining and machine learning techniques allow for the analysis of these sensor signals, any maintenance history and other available information (like expert knowledge) to automate decision making and other such processes within MCM/PHM systems. ... This dissertation proposes an MCM/PHM software architecture employing those techniques which were determined from the experiments to be ideal for this application. Our work also offers a data fusion framework applicable to ocean machinery MCM/PHM. Finally, it presents a software tool for monitoring ocean turbines and other submerged vessels, implemented according to industry standards.
Show less - Date Issued
- 2012
- PURL
- http://purl.flvc.org/FAU/3358556
- Subject Headings
- Marine turbines, Mathematical models, Fluid dynamics, Data mining, Machine learning, Multisensor data fusion
- Format
- Document (PDF)
- Title
- Data Quality in Data Mining and Machine Learning.
- Creator
- Van Hulse, Jason, Khoshgoftaar, Taghi M., Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
With advances in data storage and data transmission technologies, and given the increasing use of computers by both individuals and corporations, organizations are accumulating an ever-increasing amount of information in data warehouses and databases. The huge surge in data, however, has made the process of extracting useful, actionable, and interesting knowled_qe from the data extremely difficult. In response to the challenges posed by operating in a data-intensive environment, the fields of...
Show moreWith advances in data storage and data transmission technologies, and given the increasing use of computers by both individuals and corporations, organizations are accumulating an ever-increasing amount of information in data warehouses and databases. The huge surge in data, however, has made the process of extracting useful, actionable, and interesting knowled_qe from the data extremely difficult. In response to the challenges posed by operating in a data-intensive environment, the fields of data mining and machine learning (DM/ML) have successfully provided solutions to help uncover knowledge buried within data. DM/ML techniques use automated (or semi-automated) procedures to process vast quantities of data in search of interesting patterns. DM/ML techniques do not create knowledge, instead the implicit assumption is that knowledge is present within the data, and these procedures are needed to uncover interesting, important, and previously unknown relationships. Therefore, the quality of the data is absolutely critical in ensuring successful analysis. Having high quality data, i.e., data which is (relatively) free from errors and suitable for use in data mining tasks, is a necessary precondition for extracting useful knowledge. In response to the important role played by data quality, this dissertation investigates data quality and its impact on DM/ML. First, we propose several innovative procedures for coping with low quality data. Another aspect of data quality, the occurrence of missing values, is also explored. Finally, a detailed experimental evaluation on learning from noisy and imbalanced datasets is provided, supplying valuable insight into how class noise in skewed datasets affects learning algorithms.
Show less - Date Issued
- 2007
- PURL
- http://purl.flvc.org/fau/fd/FA00000858
- Subject Headings
- Data mining--Quality control, Machine learning, Electronic data processing--Quality control
- Format
- Document (PDF)
- Title
- Evolutionary Methods for Mining Data with Class Imbalance.
- Creator
- Drown, Dennis J., Khoshgoftaar, Taghi M., Florida Atlantic University
- Abstract/Description
-
Class imbalance tends to cause inferior performance in data mining learners, particularly with regard to predicting the minority class, which generally imposes a higher misclassification cost. This work explores the benefits of using genetic algorithms (GA) to develop classification models which are better able to deal with the problems encountered when mining datasets which suffer from class imbalance. Using GA we evolve configuration parameters suited for skewed datasets for three different...
Show moreClass imbalance tends to cause inferior performance in data mining learners, particularly with regard to predicting the minority class, which generally imposes a higher misclassification cost. This work explores the benefits of using genetic algorithms (GA) to develop classification models which are better able to deal with the problems encountered when mining datasets which suffer from class imbalance. Using GA we evolve configuration parameters suited for skewed datasets for three different learners: artificial neural networks, 0 4.5 decision trees, and RIPPER. We also propose a novel technique called evolutionary sampling which works to remove noisy and unnecessary duplicate instances so that the sampled training data will produce a superior classifier for the imbalanced dataset. Our GA fitness function uses metrics appropriate for dealing with class imbalance, in particular the area under the ROC curve. We perform extensive empirical testing on these techniques and compare the results with seven exist ing sampling methods.
Show less - Date Issued
- 2007
- PURL
- http://purl.flvc.org/fau/fd/FA00012515
- Subject Headings
- Combinatorial group theory, Data mining, Machine learning, Data structure (Computer science)
- Format
- Document (PDF)
- Title
- TACKLING BIAS, PRIVACY, AND SCARCITY CHALLENGES IN HEALTH DATA ANALYTICS.
- Creator
- Wang, Shuwen, Zhu, Xingquan, Florida Atlantic University, Department of Computer and Electrical Engineering and Computer Science, College of Engineering and Computer Science
- Abstract/Description
-
Health data analysis has emerged as a critical domain with immense potential to revolutionize healthcare delivery, disease management, and medical research. However, it is confronted by formidable challenges, including sample bias, data privacy concerns, and the cost and scarcity of labeled data. These challenges collectively impede the development of accurate and robust machine learning models for various healthcare applications, from disease diagnosis to treatment recommendations. Sample...
Show moreHealth data analysis has emerged as a critical domain with immense potential to revolutionize healthcare delivery, disease management, and medical research. However, it is confronted by formidable challenges, including sample bias, data privacy concerns, and the cost and scarcity of labeled data. These challenges collectively impede the development of accurate and robust machine learning models for various healthcare applications, from disease diagnosis to treatment recommendations. Sample bias and specificity refer to the inherent challenges in working with health datasets that may not be representative of the broader population or may exhibit disparities in their distributions. These biases can significantly impact the generalizability and effectiveness of machine learning models in healthcare, potentially leading to suboptimal outcomes for certain patient groups. Data privacy and locality are paramount concerns in the era of digital health records and wearable devices. The need to protect sensitive patient information while still extracting valuable insights from these data sources poses a delicate balancing act. Moreover, the geographic and jurisdictional differences in data regulations further complicate the use of health data in a global context. Label cost and scarcity pertain to the often labor-intensive and expensive process of obtaining ground-truth labels for supervised learning tasks in healthcare. The limited availability of labeled data can hinder the development and deployment of machine learning models, particularly in specialized medical domains.
Show less - Date Issued
- 2023
- PURL
- http://purl.flvc.org/fau/fd/FA00014336
- Subject Headings
- Data analytics, Data mining, Ensemble learning (Machine learning), Machine learning, Health
- Format
- Document (PDF)
- Title
- Analyzing software repository data to synthesize and visualize relationships between development artifacts.
- Creator
- Mulcahy, James J., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
As computing technology continues to advance, it has become increasingly difficult to find businesses that do not rely, at least in part, upon the collection and analysis of data for the purpose of project management and process improvement. The cost of software tends to increase over time due to its complexity and the cost of employing humans to develop, maintain, and evolve it. To help control the costs, organizations often seek to improve the process by which software systems are developed...
Show moreAs computing technology continues to advance, it has become increasingly difficult to find businesses that do not rely, at least in part, upon the collection and analysis of data for the purpose of project management and process improvement. The cost of software tends to increase over time due to its complexity and the cost of employing humans to develop, maintain, and evolve it. To help control the costs, organizations often seek to improve the process by which software systems are developed and evolved. Improvements can be realized by discovering previously unknown or hidden relationships between the artifacts generated as a result of developing a software system. The objective of the work described in this thesis is to provide a visualization tool that helps managers and engineers better plan for future projects by discovering new knowledge gained by synthesizing and visualizing data mined from software repository records from previous projects.
Show less - Date Issued
- 2011
- PURL
- http://purl.flvc.org/FAU/3333053
- Subject Headings
- Data mining, Mathematical models, Software engineering, Inofrmation visualization, Data processing, Application software, Development, Object-oriented programming (Computer science)
- Format
- Document (PDF)
- Title
- An Empirical Study of Random Forests for Mining Imbalanced Data.
- Creator
- Golawala, Moiz M., Khoshgoftaar, Taghi M., Florida Atlantic University
- Abstract/Description
-
Skewed or imbalanced data presents a significant problem for many standard learners which focus on optimizing the overall classification accuracy. When the class distribution is skewed, priority is given to classifying examples from the majority class, at the expense of the often more important minority class. The random forest (RF) classification algorithm, which is a relatively new learner with appealing theoretical properties, has received almost no attention in the context of skewed...
Show moreSkewed or imbalanced data presents a significant problem for many standard learners which focus on optimizing the overall classification accuracy. When the class distribution is skewed, priority is given to classifying examples from the majority class, at the expense of the often more important minority class. The random forest (RF) classification algorithm, which is a relatively new learner with appealing theoretical properties, has received almost no attention in the context of skewed datasets. This work presents a comprehensive suite of experimentation evaluating the effectiveness of random forests for learning from imbalanced data. Reasonable parameter settings (for the Weka implementation) for ensemble size and number of random features selected are determined through experimentation oil 10 datasets. Further, the application of seven different data sampling techniques that are common methods for handling imbalanced data, in conjunction with RF, is also assessed. Finally, RF is benchmarked against 10 other commonly-used machine learning algorithms, and is shown to provide very strong performance. A total of 35 imbalanced datasets are used, and over one million classifiers are constructed in this work.
Show less - Date Issued
- 2007
- PURL
- http://purl.flvc.org/fau/fd/FA00012520
- Subject Headings
- Data mining--Case studies, Machine learning--Case studies, Data structure (Computer science), Trees (Graph theory)--Case studies
- Format
- Document (PDF)
- Title
- Intrusion detection in wireless networks: A data mining approach.
- Creator
- Nath, Shyam Varan., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The security of wireless networks has gained considerable importance due to the rapid proliferation of wireless communications. While computer network heuristics and rules are being used to control and monitor the security of Wireless Local Area Networks (WLANs), mining and learning behaviors of network users can provide a deeper level of security analysis. The objective and contribution of this thesis is three fold: exploring the security vulnerabilities of the IEEE 802.11 standard for...
Show moreThe security of wireless networks has gained considerable importance due to the rapid proliferation of wireless communications. While computer network heuristics and rules are being used to control and monitor the security of Wireless Local Area Networks (WLANs), mining and learning behaviors of network users can provide a deeper level of security analysis. The objective and contribution of this thesis is three fold: exploring the security vulnerabilities of the IEEE 802.11 standard for wireless networks; extracting features or metrics, from a security point of view, for modeling network traffic in a WLAN; and proposing a data mining-based approach to intrusion detection in WLANs. A clustering- and expert-based approach to intrusion detection in a wireless network is presented in this thesis. The case study data is obtained from a real-word WLAN and contains over one million records. Given the clusters of network traffic records, a distance-based heuristic measure is proposed for labeling clusters as either normal or intrusive. The empirical results demonstrate the promise of the proposed approach, laying the groundwork for a clustering-based framework for intrusion detection in computer networks.
Show less - Date Issued
- 2005
- PURL
- http://purl.flvc.org/fcla/dt/13246
- Subject Headings
- Wireless communication systems, Data warehousing, Data mining, Telecommunication--Security measures, Computer networks--Security measures, Computer security
- Format
- Document (PDF)
- Title
- Asset identification using image descriptors.
- Creator
- Friedel, Reena Ursula., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Asset management is a time consuming and error prone process. Information Technology (IT) personnel typically perform this task manually by visually inspecting assets to identify misplaced assets. If this process is automated and provided to IT personnel it would prove very useful in keeping track of assets in a server rack. A mobile based solution is proposed to automate this process. The asset management application on the tablet captures images of assets and searches an annotated database...
Show moreAsset management is a time consuming and error prone process. Information Technology (IT) personnel typically perform this task manually by visually inspecting assets to identify misplaced assets. If this process is automated and provided to IT personnel it would prove very useful in keeping track of assets in a server rack. A mobile based solution is proposed to automate this process. The asset management application on the tablet captures images of assets and searches an annotated database to identify the asset. We evaluate the matching performance and speed of asset matching using three different image feature descriptors. Methods to reduce feature extraction and matching complexity were developed. Performance and accuracy tradeoffs were studied, domain specific problems were identified, and optimizations for mobile platforms were made. The results show that the proposed methods reduce complexity of asset matching by 67% when compared to the matching process using unmodified image feature descriptors.
Show less - Date Issued
- 2012
- PURL
- http://purl.flvc.org/FAU/3342051
- Subject Headings
- Data mining, Technological innovations, Mobile computing, User-centered system design, Application software, Development
- Format
- Document (PDF)
- Title
- Collabortive filtering using machine learning and statistical techniques.
- Creator
- Su, Xiaoyuan., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Collaborative filtering (CF), a very successful recommender system, is one of the applications of data mining for incomplete data. The main objective of CF is to make accurate recommendations from highly sparse user rating data. My contributions to this research topic include proposing the frameworks of imputation-boosted collaborative filtering (IBCF) and imputed neighborhood based collaborative filtering (INCF). We also proposed a model-based CF technique, TAN-ELR CF, and two hybrid CF...
Show moreCollaborative filtering (CF), a very successful recommender system, is one of the applications of data mining for incomplete data. The main objective of CF is to make accurate recommendations from highly sparse user rating data. My contributions to this research topic include proposing the frameworks of imputation-boosted collaborative filtering (IBCF) and imputed neighborhood based collaborative filtering (INCF). We also proposed a model-based CF technique, TAN-ELR CF, and two hybrid CF algorithms, sequential mixture CF and joint mixture CF. Empirical results show that our proposed CF algorithms have very good predictive performances. In the investigation of applying imputation techniques in mining incomplete data, we proposed imputation-helped classifiers, and VCI predictors (voting on classifications from imputed learning sets), both of which resulted in significant improvement in classification performance for incomplete data over conventional machine learned classifiers, including kNN, neural network, one rule, decision table, SVM, logistic regression, decision tree (C4.5), random forest, and decision list (PART), and the well known Bagging predictors. The main imputation techniques involved in these algorithms include EM (expectation maximization) and BMI (Bayesian multiple imputation).
Show less - Date Issued
- 2008
- PURL
- http://purl.flvc.org/FAU/186301
- Subject Headings
- Filters (Mathematics), Machine learning, Data mining, Technological innovations, Database management, Combinatorial group theory
- Format
- Document (PDF)
- Title
- Classification techniques for noisy and imbalanced data.
- Creator
- Napolitano, Amri E., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Machine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are...
Show moreMachine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are recorded erroneously, misleads a classifier and reduces predictive performance. Class imbalance occurs when one class represents only a small portion of the examples in a dataset, and, in such cases, classifiers often display poor accuracy on the minority class. The reduction in classification performance becomes even worse when the two issues occur simultaneously. To address the magnified difficulty caused by this interaction, this dissertation performs thorough empirical investigations of several techniques for dealing with class noise and imbalanced data. Comprehensive experiments are performed to assess the effects of the classification techniques on classifier performance, as well as how the level of class imbalance, level of class noise, and distribution of class noise among the classes affects results. An empirical analysis of classifier based noise detection efficiency appears first. Subsequently, an intelligent data sampling technique, based on noise detection, is proposed and tested. Several hybrid classifier ensemble techniques for addressing class noise and imbalance are introduced. Finally, a detailed empirical investigation of classification filtering is performed to determine best practices.
Show less - Date Issued
- 2009
- PURL
- http://purl.flvc.org/FAU/369201
- Subject Headings
- Combinatorial group theory, Data mining, Technological innovations, Decision trees, Machine learning, Filters (Mathematics)
- Format
- Document (PDF)