Current Search: Data mining (x)
View All Items
Pages
- Title
- Feature selection techniques and applications in bioinformatics.
- Creator
- Dittman, David, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Possibly the largest problem when working in bioinformatics is the large amount of data to sift through to find useful information. This thesis shows that the use of feature selection (a method of removing irrelevant and redundant information from the dataset) is a useful and even necessary technique to use in these large datasets. This thesis also presents a new method in comparing classes to each other through the use of their features. It also provides a thorough analysis of the use of...
Show morePossibly the largest problem when working in bioinformatics is the large amount of data to sift through to find useful information. This thesis shows that the use of feature selection (a method of removing irrelevant and redundant information from the dataset) is a useful and even necessary technique to use in these large datasets. This thesis also presents a new method in comparing classes to each other through the use of their features. It also provides a thorough analysis of the use of various feature selection techniques and classifier in different scenarios from bioinformatics. Overall, this thesis shows the importance of the use of feature selection in bioinformatics.
Show less - Date Issued
- 2011
- PURL
- http://purl.flvc.org/FAU/3175016
- Subject Headings
- Bioinformatifcs, Data mining, Technological innovations, Computational biology, Combinatorial group theory, Filters (Mathematics), Ranking and selection (Statistics)
- Format
- Document (PDF)
- Title
- Stability analysis of feature selection approaches with low quality data.
- Creator
- Altidor, Wilker., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
One of the greatest challenges to data mining is erroneous or noisy data. Several studies have noted the weak performance of classification models trained from low quality data. This dissertation shows that low quality data can also impact the effectiveness of feature selection, and considers the effect of class noise on various feature ranking techniques. It presents a novel approach to feature ranking based on ensemble learning and assesses these ensemble feature selection techniques in...
Show moreOne of the greatest challenges to data mining is erroneous or noisy data. Several studies have noted the weak performance of classification models trained from low quality data. This dissertation shows that low quality data can also impact the effectiveness of feature selection, and considers the effect of class noise on various feature ranking techniques. It presents a novel approach to feature ranking based on ensemble learning and assesses these ensemble feature selection techniques in terms of their robustness to class noise. It presents a noise-based stability analysis that measures the degree of agreement between a feature ranking techniques output on a clean dataset versus its outputs on the same dataset but corrupted with different combinations of noise level and noise distribution. It then considers classification performances from models built with a subset of the original features obtained after applying feature ranking techniques on noisy data. It proposes the focused ensemble feature ranking as a noise-tolerant approach to feature selection and compares focused ensembles with general ensembles in terms of the ability of the selected features to withstand the impact of class noise when used to build classification models. Finally, it explores three approaches for addressing the combined problem of high dimensionality and class imbalance. Collectively, this research shows the importance of considering class noise when performing feature selection.
Show less - Date Issued
- 2011
- PURL
- http://purl.flvc.org/FAU/3174501
- Subject Headings
- Data mining, Technological innovations, Combinatorial group theory, Filters (Mathematics), Ranking and selection (Statistics)
- Format
- Document (PDF)
- Title
- Sensitivity analysis of predictive data analytic models to attributes.
- Creator
- Chiou, James, Zhu, Xingquan, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Classification algorithms represent a rich set of tools, which train a classification model from a given training and test set, to classify previously unseen test instances. Although existing methods have studied classification algorithm performance with respect to feature selection, noise condition, and sample distributions, our existing studies have not addressed an important issue on the classification algorithm performance relating to feature deletion and addition. In this thesis, we...
Show moreClassification algorithms represent a rich set of tools, which train a classification model from a given training and test set, to classify previously unseen test instances. Although existing methods have studied classification algorithm performance with respect to feature selection, noise condition, and sample distributions, our existing studies have not addressed an important issue on the classification algorithm performance relating to feature deletion and addition. In this thesis, we carry out sensitive study of classification algorithms by using feature deletion and addition. Three types of classifiers: (1) weak classifiers; (2) generic and strong classifiers; and (3) ensemble classifiers are validated on three types of data (1) feature dimension data, (2) gene expression data and (3) biomedical document data. In the experiments, we continuously add redundant features to the training and test set in order to observe the classification algorithm performance, and also continuously remove features to find the performance of the underlying classifiers. Our studies draw a number of important findings, which will help data mining and machine learning community under the genuine performance of common classification algorithms on real-world data.
Show less - Date Issued
- 2014
- PURL
- http://purl.flvc.org/fau/fd/FA00004274, http://purl.flvc.org/fau/fd/FA00004274
- Subject Headings
- Data mining, Forecasting -- Mathematical models, Social sciences -- Statistical methods, Ubiquitous computing
- Format
- Document (PDF)
- Title
- Pattern mining and visualization for molecular dynamics simulation.
- Creator
- Kong, Xue, Zhu, Xingquan, Florida Atlantic University, College of Computer Science and Engineering, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Molecular dynamics is a computer simulation technique for expressing the ultimate details of individual particle motions and can be used in many fields, such as chemical physics, materials science, and the modeling of biomolecules. In this thesis, we study visualization and pattern mining in molecular dynamics simulation. The molecular data set has a large number of atoms in each frame and range of frames. The features of the data set include atom ID; frame number; position in x, y, and z...
Show moreMolecular dynamics is a computer simulation technique for expressing the ultimate details of individual particle motions and can be used in many fields, such as chemical physics, materials science, and the modeling of biomolecules. In this thesis, we study visualization and pattern mining in molecular dynamics simulation. The molecular data set has a large number of atoms in each frame and range of frames. The features of the data set include atom ID; frame number; position in x, y, and z plane; charge; and mass. The three main challenges of this thesis are to display a larger number of atoms and range of frames, to visualize this large data set in 3-dimension, and to cluster the abnormally shifting atoms that move with the same pace and direction in different frames. Focusing on these three challenges, there are three contributions of this thesis. First, we design an abnormal pattern mining and visualization framework for molecular dynamics simulation. The proposed framework can visualize the clusters of abnormal shifting atom groups in a three-dimensional space, and show their temporal relationships. Second, we propose a pattern mining method to detect abnormal atom groups which share similar movement and have large variance compared to the majority atoms. We propose a general molecular dynamics simulation tool, which can visualize a large number of atoms, including their movement and temporal relationships, to help domain experts study molecular dynamics simulation results. The main functions for this visualization and pattern mining tool include atom number, cluster visualization, search across different frames, multiple frame range search, frame range switch, and line demonstration for atom motions in different frames. Therefore, this visualization and pattern mining tool can be used in the field of chemical physics, materials science, and the modeling of biomolecules for the molecular dynamic simulation outcomes.
Show less - Date Issued
- 2014
- PURL
- http://purl.flvc.org/fau/fd/FA00004212, http://purl.flvc.org/fau/fd/FA00004212
- Subject Headings
- Data mining, Information visualization, Molecular dynamics -- Computer simulation, Molecules -- Mathematical models, Pattern perception
- Format
- Document (PDF)
- Title
- Application level intrusion detection using a sequence learning algorithm.
- Creator
- Dong, Yuhong., Florida Atlantic University, Hsu, Sam, Rajput, Saeed, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
An un-supervised learning algorithm on application level intrusion detection, named Graph Sequence Learning Algorithm (GSLA), is proposed in this dissertation. Experiments prove its effectiveness. Similar to most intrusion detection algorithms, in GSLA, the normal profile needs to be learned first. The normal profile is built using a session learning method, which is combined with the one-way Analysis of Variance method (ANOVA) to determine the value of an anomaly threshold. In the proposed...
Show moreAn un-supervised learning algorithm on application level intrusion detection, named Graph Sequence Learning Algorithm (GSLA), is proposed in this dissertation. Experiments prove its effectiveness. Similar to most intrusion detection algorithms, in GSLA, the normal profile needs to be learned first. The normal profile is built using a session learning method, which is combined with the one-way Analysis of Variance method (ANOVA) to determine the value of an anomaly threshold. In the proposed approach, a hash table is used to store a sparse data matrix in triple data format that is collected from a web transition log instead of an n-by-n dimension matrix. Furthermore, in GSLA, the sequence learning matrix can be dynamically changed according to a different volume of data sets. Therefore, this approach is more efficient, easy to manipulate, and saves memory space. To validate the effectiveness of the algorithm, extensive simulations have been conducted by applying the GSLA algorithm to the homework submission system at our computer science and engineering department. The performance of GSLA is evaluated and compared with traditional Markov Model (MM) and K-means algorithms. Specifically, three major experiments have been done: (1) A small data set is collected as a sample data, and is applied to GSLA, MM, and K-means algorithms to illustrate the operation of the proposed algorithm and demonstrate the detection of abnormal behaviors. (2) The Random Walk-Through sampling method is used to generate a larger sample data set, and the resultant anomaly score is classified into several clusters in order to visualize and demonstrate the normal and abnormal behaviors with K-means and GSLA algorithms. (3) Multiple professors' data sets are collected and used to build the normal profiles, and the ANOVA method is used to test the significant difference among professors' normal profiles. The GSLA algorithm can be made as a module and plugged into the IDS as an anomaly detection system.
Show less - Date Issued
- 2006
- PURL
- http://purl.flvc.org/fcla/dt/12220
- Subject Headings
- Data mining, Parallel processing (Electronic computers), Computer algorithms, Computer security, Pattern recognition systems
- Format
- Document (PDF)
- Title
- Evaluating indirect and direct classification techniques for network intrusion detection.
- Creator
- Ibrahim, Nawal H., Florida Atlantic University, Khoshgoftaar, Taghi M.
- Abstract/Description
-
Increasing aggressions through cyber terrorism pose a constant threat to information security in our day to day life. Implementing effective intrusion detection systems (IDSs) is an essential task due to the great dependence on networked computers for the operational control of various infrastructures. Building effective IDSs, unfortunately, has remained an elusive goal owing to the great technical challenges involved, and applied data mining techniques are increasingly being utilized in...
Show moreIncreasing aggressions through cyber terrorism pose a constant threat to information security in our day to day life. Implementing effective intrusion detection systems (IDSs) is an essential task due to the great dependence on networked computers for the operational control of various infrastructures. Building effective IDSs, unfortunately, has remained an elusive goal owing to the great technical challenges involved, and applied data mining techniques are increasingly being utilized in attempts to overcome the difficulties. This thesis presents a comparative study of the traditional "direct" approaches with the recently explored "indirect" approaches of classification which use class binarization and combiner techniques for intrusion detection. We evaluate and compare the performance of IDSs based on various data mining algorithms, in the context of a well known network intrusion evaluation data set. It is empirically shown that data mining algorithms when applied using the indirect classification approach yield better intrusion detection models.
Show less - Date Issued
- 2004
- PURL
- http://purl.flvc.org/fcla/dt/13128
- Subject Headings
- Computer networks--Security measures, Computer security, Software measurement, Data mining
- Format
- Document (PDF)
- Title
- Resource-sensitive intrusion detection models for network traffic.
- Creator
- Abushadi, Mohamed E., Florida Atlantic University, Khoshgoftaar, Taghi M.
- Abstract/Description
-
Network security is an important subject in today's extensively interconnected computer world. The industry, academic institutions, small and large businesses and even residences are now greatly at risk from the increasing onslaught of computer attacks. Such malicious efforts cause damage ranging from mere violation of confidentiality and issues of privacy up to actual financial loss if business operations are compromised, or even further, loss of human lives in the case of mission-critical...
Show moreNetwork security is an important subject in today's extensively interconnected computer world. The industry, academic institutions, small and large businesses and even residences are now greatly at risk from the increasing onslaught of computer attacks. Such malicious efforts cause damage ranging from mere violation of confidentiality and issues of privacy up to actual financial loss if business operations are compromised, or even further, loss of human lives in the case of mission-critical networked computer applications. Intrusion Detection Systems (IDS) have been used along with the help of data mining modeling efforts to detect intruders, yet with the limitation of organizational resources it is unreasonable to inspect every network alarm raised by the IDS. Modified Expected Cost of Misclassification ( MECM) is a model selection measure that is resource-aware and cost-sensitive at the same time, and has proven to be effective for the identification of the best resource-based intrusion detection model.
Show less - Date Issued
- 2003
- PURL
- http://purl.flvc.org/fcla/dt/13054
- Subject Headings
- Computer networks--Security measures--Automation, Computers--Access control, Data mining, Computer security
- Format
- Document (PDF)
- Title
- Visualization of Impact Analysis on Configuration Management Data for Software Process Improvement.
- Creator
- Lo, Christopher Hoi-Yin, Huang, Shihong, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The software development process is an incremental and iterative activity. Source code is constantly altered to reflect changing requirements, to respond to testing results, and to address problem reports. Proper software measurement that derives meaningful numeric values for some attributes of a software product or process can help in identifying problem areas and development bottlenecks. Impact analysis is the evaluation of the risks associated with change requests or problem reports,...
Show moreThe software development process is an incremental and iterative activity. Source code is constantly altered to reflect changing requirements, to respond to testing results, and to address problem reports. Proper software measurement that derives meaningful numeric values for some attributes of a software product or process can help in identifying problem areas and development bottlenecks. Impact analysis is the evaluation of the risks associated with change requests or problem reports, including estimates of effects on resources, effort, and schedule. This thesis presents a methodology called VITA for applying software analysis techniques to configuration management repository data with the aim of identifying the impact on file changes due to change requests and problem reports. The repository data can be analyzed and visualized in a semi-automated manner according to user-selectable criteria. The approach is illustrated with a model problem concerning software process improvement of an embedded software system in the context of performing high-quality software maintenance.
Show less - Date Issued
- 2007
- PURL
- http://purl.flvc.org/fau/fd/FA00012535
- Subject Headings
- Software mesurement, Software engineering--Quality control, Data mining--Quality control
- Format
- Document (PDF)
- Title
- Text Mining and Topic Modeling for Social and Medical Decision Support.
- Creator
- Hurtado, Jose Luis, Zhu, Xingquan, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Effective decision support plays vital roles in people's daily life, as well as for professional practitioners such as health care providers. Without correct information and timely derived knowledge, a decision is often suboptimal and may result in signi cant nancial loss or compromises of the performance. In this dissertation, we study text mining and topic modeling and propose to use text mining methods, in combination with topic models, to discover knowledge from texts popularly available...
Show moreEffective decision support plays vital roles in people's daily life, as well as for professional practitioners such as health care providers. Without correct information and timely derived knowledge, a decision is often suboptimal and may result in signi cant nancial loss or compromises of the performance. In this dissertation, we study text mining and topic modeling and propose to use text mining methods, in combination with topic models, to discover knowledge from texts popularly available from a wide variety of sources, such as research publications, news, medical diagnose notes, and further employ discovered knowledge to assist social and medical decision support. Examples of such decisions include hospital patient readmission prediction, which is a national initiative for health care cost reduction, academic research topics discovery and trend modeling, and social preference modeling for friend recommendation in social networks etc. To carry out text mining, our research, in Chapter 3, first emphasizes on single document analyzing to investigate textual stylometric features for user pro ling and recognition. Our research confirms that by using properly designed features, it is possible to identify the authors who wrote the article, using a number of sample articles written by the author as the training data. This study serves as the base to assert that text mining is a powerful tool for capturing knowledge in texts for better decision making. In the Chapter 4, we advance our research from single documents to documents with interdependency relationships, and propose to model and predict citation relationship between documents. Given a collection of documents with known linkage relationships, our research will discover e ective features to train prediction models, and predict the likelihood of two documents involving a citation relationships. This study will help accurately model social network linkage relationships, and can be used to assist e ective decision making for friend recommendation in social networking, and reference recommendation in scienti c writing etc. In the Chapter 5, we advance a topic discovery and trend prediction principle to discover meaningful topics from a set of data collection, and further model the evolution trend of the topic. By proposing techniques to discover topics from text, and using temporal correlation between trend for prediction, our techniques can be used to summarize a large collection of documents as meaningful topics, and further forecast the popularity of the topic in a near future. This study can help design systems to discover popular topics in social media, and further assist resource planning and scheduling based on the discovered topics and the their evolution trend. In the Chapter 6, we employ both text mining and topic modeling to the medical domain for effective decision making. The goal is to discover knowledge from medical notes to predict the risk of a patient being re-admitted in a near future. Our research emphasizes on the challenge that re-admitted patients are only a small portion of the patient population, although they bring signficant financial loss. As a result, the datasets are highly imbalanced which often result in poor accuracy for decision making. Our research will propose to use latent topic modeling to carryout localized sampling, and combine models trained from multiple copies of sampled data for accurate prediction. This study can be directly used to assist hospital re-admission assessment for early warning and decision support. The text mining and topic modeling techniques investigated in the dissertation can be applied to many other domains, involving texts and social relationships, towards pattern and knowledge based e ective decision making.
Show less - Date Issued
- 2016
- PURL
- http://purl.flvc.org/fau/fd/FA00004782, http://purl.flvc.org/fau/fd/FA00004782
- Subject Headings
- Social sciences--Research--Methodology., Data mining., Machine learning., Database searching., Discourse analysis--Data processing., Communication--Network analysis., Medical care--Quality control.
- Format
- Document (PDF)
- Title
- Racial Inequalities in America: Examining Socieoeconomic Statistics Using the Semantic Web.
- Creator
- Terrell, David J, Shankar, Ravi, Florida Atlantic University, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The visualization of recent episodes regarding apparently unjustifiable deaths of minorities, caused by police and federal law enforcement agencies, has been amplified through today's social media and television networks. Such events may seem to imply that issues concerning racial inequalities in America are getting worse. However, we do not know whether such indications are factual; whether this is a recent phenomenon, whether racial inequality is escalating relative to earlier decades, or...
Show moreThe visualization of recent episodes regarding apparently unjustifiable deaths of minorities, caused by police and federal law enforcement agencies, has been amplified through today's social media and television networks. Such events may seem to imply that issues concerning racial inequalities in America are getting worse. However, we do not know whether such indications are factual; whether this is a recent phenomenon, whether racial inequality is escalating relative to earlier decades, or whether it is better in certain regions of the nation compared to others. We have built a semantic engine for the purpose of querying statistics on various metropolitan areas, based on a database of individual deaths. Separately, we have built a database of demographic data on poverty, income, education attainment, and crime statistics for the top 25 most populous metropolitan areas. These data will ultimately be combined with government data to evaluate this hyp othesis, and provide a tool for predictive analytics. In this thesis, we will provide preliminary results in that direction. The methodology in our research consisted of multiple steps. We initially described our requirements and drew data from numerous datasets, which contained information on the 23 highest populated Metropolitan Statistical Areas in the United States. After all of the required data was obtained we decomposed the Metropolitan Statistical Area records into domain components and created an Ontology/Taxonomy via Protege to determine an hierarchy level of nouns towards identifying significant keywords throughout the datasets to use as search queries. Next, we used a Semantic Web implementation accompanied with Python programming language, and FuXi to build and instantiate a vocabulary. The Ontology was then parsed for the entered search query and returned corresponding results providing a semantically organized a nd relevant output in RDF/XML format.
Show less - Date Issued
- 2015
- PURL
- http://purl.flvc.org/fau/fd/FA00004550, http://purl.flvc.org/fau/fd/FA00004550
- Subject Headings
- Data mining, Education -- Demographic aspects -- United States -- Statistics, Minorities -- United States -- Social conditions, Minorities -- United States -- Statistics, Race -- United States -- Statistics, Semantic Web, United States -- Ethnic relations -- Statistics, United States -- Race relations -- Statistics
- Format
- Document (PDF)