Current Search: Napolitano, Amri E. (x)
View All Items
- Title
- Comparison of Data Sampling Approaches for Imbalanced Bioinformatics Data.
- Creator
- Dittman, David, Wald, Randall, Napolitano, Amri E., Graduate College, Khoshgoftaar, Taghi M.
- Abstract/Description
-
Class imbalance is a frequent problem found in bioinformatics datasets. Unfortunately, the minority class is usually also the class of interest. One of the methods to improve this situation is data sampling. There are a number of different data sampling methods, each with their own strengths and weaknesses, which makes choosing one a difficult prospect. In our work we compare three data sampling techniques Random Undersampling, Random Oversampling, and SMOTE on six bioinformatics datasets...
Show moreClass imbalance is a frequent problem found in bioinformatics datasets. Unfortunately, the minority class is usually also the class of interest. One of the methods to improve this situation is data sampling. There are a number of different data sampling methods, each with their own strengths and weaknesses, which makes choosing one a difficult prospect. In our work we compare three data sampling techniques Random Undersampling, Random Oversampling, and SMOTE on six bioinformatics datasets with varying levels of class imbalance. Additionally, we apply two different classifiers to the problem 5-NN and SVM, and use feature selection to reduce our datasets to 25 features prior to applying sampling. Our results show that there is very little difference between the data sampling techniques, although Random Undersampling is the most frequent top performing data sampling technique for both of our classifiers. We also performed statistical analysis which confirms that there is no statistical difference between the techniques. Therefore, our recommendation is to use Random Undersampling when choosing a data sampling technique, because it is less computationally expensive to implement than SMOTE and it also reduces the size of the dataset, which will improve subsequent computational costs without sacrificing classification performance.
Show less - Date Issued
- 2014
- PURL
- http://purl.flvc.org/fau/fd/FA00005811
- Format
- Document (PDF)
- Title
- A review of the stability of feature selection techniques for bioinformatics data.
- Creator
- Awada, Wael, Khoshgoftaar, Taghi M., Dittman, David, Wald, Randall, Napolitano, Amri E., Graduate College
- Date Issued
- 2013-04-12
- PURL
- http://purl.flvc.org/fcla/dt/3361293
- Subject Headings
- Bioinformatics, DNA microarrays, Data mining
- Format
- Document (PDF)
- Title
- Classification techniques for noisy and imbalanced data.
- Creator
- Napolitano, Amri E., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Machine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are...
Show moreMachine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are recorded erroneously, misleads a classifier and reduces predictive performance. Class imbalance occurs when one class represents only a small portion of the examples in a dataset, and, in such cases, classifiers often display poor accuracy on the minority class. The reduction in classification performance becomes even worse when the two issues occur simultaneously. To address the magnified difficulty caused by this interaction, this dissertation performs thorough empirical investigations of several techniques for dealing with class noise and imbalanced data. Comprehensive experiments are performed to assess the effects of the classification techniques on classifier performance, as well as how the level of class imbalance, level of class noise, and distribution of class noise among the classes affects results. An empirical analysis of classifier based noise detection efficiency appears first. Subsequently, an intelligent data sampling technique, based on noise detection, is proposed and tested. Several hybrid classifier ensemble techniques for addressing class noise and imbalance are introduced. Finally, a detailed empirical investigation of classification filtering is performed to determine best practices.
Show less - Date Issued
- 2009
- PURL
- http://purl.flvc.org/FAU/369201
- Subject Headings
- Combinatorial group theory, Data mining, Technological innovations, Decision trees, Machine learning, Filters (Mathematics)
- Format
- Document (PDF)
- Title
- Alleviating class imbalance using data sampling: Examining the effects on classification algorithms.
- Creator
- Napolitano, Amri E., Florida Atlantic University, Khoshgoftaar, Taghi M., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Imbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling...
Show moreImbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling techniques using five different classifiers and four different datasets. The work addresses which sampling techniques produce the best performance in the presence of class unbalance, which classifiers are most robust to the problem, as well as which sampling techniques perform better or worse with each classifier. Extensive statistical analysis of these results is provided, in addition to an examination of the qualitative effects of the sampling techniques on the types of predictions made by the C4.5 classifier.
Show less - Date Issued
- 2006
- PURL
- http://purl.flvc.org/fcla/dt/13413
- Subject Headings
- Combinatorial group theory, Data mining, Decision trees, Machine learning
- Format
- Document (PDF)