CSCI 6505 Machine Learning Project

•Download as PPT, PDF•

0 likes•382 views

This document evaluates several supervised machine learning algorithms for classifying gene expression data from microarray experiments. It describes analyzing two gene expression datasets, the leukemia and DLBCL datasets, using k-nearest neighbors, naive Bayes, decision trees, and support vector machines with and without feature selection. The results show that support vector machines achieved the best performance overall, and that feature selection improved the accuracy of all the algorithms.

Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya [email_address] Winter 2006 Dalhousie University Machine Learning Prediction

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Definition of the Problem ,[object Object],[object Object],[object Object],[object Object],[object Object]

Related Work ,[object Object],[object Object],[object Object],[object Object],[object Object]

Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Description of the Data ,[object Object],[object Object],[object Object],[object Object]

Methodology of Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Feature Selection (gene subset) Algorithm All features

Methodology of Experiments (cont…) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Results ,[object Object],[object Object],[object Object],Cross-validation results are lower; it uses nearly all the data for training and testing, giving a more realistic estimation.

Results (cont…) ,[object Object],[object Object],[object Object],There is an increase in the overall accuracy, more notorious in DLBCL

Results (cont…) ,[object Object],[object Object]

Relevance of Results ,[object Object],[object Object],[object Object],[object Object]

Relevance of Results (cont…) ,[object Object],[object Object],[object Object],[object Object]

Conclusions & Future Work ,[object Object],[object Object],[object Object],[object Object]

Conclusions & Future Work (cont…) ,[object Object],[object Object],[object Object]

References ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery. Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA microarray technology has made monitoring of thousands of gene expression simultaneously. By using this abundance of gene expression data researchers are exploring the possibilities of cancer classification. There are number of methods proposed with good results, but lot of issues still need to be addressed. This paper present an overview of various cancer classification methods and evaluate these proposed methods based on their classification accuracy, computational time and ability to reveal gene information. We have also evaluated and introduced various proposed gene selection method. In this paper, several issues related to cancer classification have also been discussed.

Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...

ahmad abdelhafeez

Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.

EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...

IJDKP

Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.

EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...

IJDKP

Comparing prediction accuracy for machine learning andAlexander Decker

2014 Gene expressionmicroarrayclassification usingPCA–BEL.Ehsan Lotfi

Classification of medical datasets using back propagation neural network powe...

IJECEIAES

The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).

1207.2600

Risjunardi Damanik

One of the human diseases with a high rate of mortality each year is breast cancer (BC). Among all the forms of cancer, BC is the commonest cause of death among women globally. Some of the effective ways of data classification are data mining and classification methods. These methods are particularly efficient in the medical field due to the presence of irrelevant and redundant attributes in medical datasets. Such redundant attributes are not needed to obtain an accurate estimation of disease diagnosis. Teaching learning-based optimization (TLBO) is a new metaheuristic that has been successfully applied to several intractable optimization problems in recent years. This paper presents the use of a multi-objective TLBO algorithm for the selection of feature subsets in automatic BC diagnosis. For the classification task in this work, the logistic regression (LR) method was deployed. From the results, the projected method produced better BC dataset classification accuracy (classified into malignant and benign). This result showed that the projected TLBO is an efficient features optimization technique for sustaining data-based decision-making systems.

Classification of pneumonia from X-ray images using siamese convolutional net...

TELKOMNIKA JOURNAL

Pneumonia is one of the highest global causes of deaths especially for children under 5 years old. This happened mainly because of the difficulties in identifying the cause of pneumonia. As a result, the treatment given may not be suitable for each pneumonia case. Recent studies have used deep learning approaches to obtain better classification within the cause of pneumonia. In this research, we used siamese convolutional network (SCN) to classify chest x-ray pneumonia image into 3 classes, namely normal conditions, bacterial pneumonia, and viral pneumonia. Siamese convolutional network is a neural network architecture that learns similarity knowledge between pairs of image inputs based on the differences between its features. One of the important benefits of classifying data with SCN is the availability of comparable images that can be used as a reference when determining class. Using SCN, our best model achieved 80.03% accuracy, 79.59% f1 score, and an improved result reasoning by providing the comparable images.

DREAM ChallengeTulip Nandu

AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...

ijsc

As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection based ensemble learning models is to classify the high dimensional data with high computational efficiency and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.

IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network

IRJET Journal

Sample Work For Engineering Literature Review and Gap Identification

PhD Assistance

Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence

IJSTA

An approach for breast cancer diagnosis classification using neural network

acijjournal

Artificial neural network has been widely used in various fields as an intelligent tool in recent years, such as artificial intelligence, pattern recognition, medical diagnosis, machine learning and so on. The classification of breast cancer is a medical application that poses a great challenge for researchers and scientists. Recently, the neural network has become a popular tool in the classification of cancer datasets. Classification is one of the most active research and application areas of neural networks. Major disadvantages of artificial neural network (ANN) classifier are due to its sluggish convergence and always being trapped at the local minima. To overcome this problem, differential evolution algorithm (DE) has been used to determine optimal value or near optimal value for ANN parameters. DE has been applied successfully to improve ANN learning from previous studies. However, there are still some issues on DE approach such as longer training time and lower classification accuracy. To overcome these problems, island based model has been proposed in this system. The aim of our study is to propose an approach for breast cancer distinguishing between different classes of breast cancer. This approach is based on the Wisconsin Diagnostic and Prognostic Breast Cancer and the classification of different types of breast cancer datasets. The proposed system implements the island-based training method to be better accuracy and less training time by using and analysing between two different migration topologies

IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...

IRJET Journal

IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...

IRJET Journal

American Statistical Association October 23 2009 Presentation Part 1

Double Check ĆŐNSULTING

Regularized Weighted Ensemble of Deep Classifiers

ijcsa

Ensemble of classifiers increases the performance of the classification since the decision of many experts are fused together to generate the resultant decision for prediction making. Deep learning is a classification algorithm where along with the basic learning technique, fine tuning learning is done for improved precision of learning. Deep classifier ensemble learning is having a good scope of research.Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep support vector machine performs the prediction analysis on the three UCI repository problems IRIS,Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level deep classifier ensemble gives the best result in our experiments.

Define cancer treatment using knn and naive bayes algorithms

rajab ssemwogerere

Classification of Microarray Gene Expression Data by Gene Combinations using ...

IJCSEA Journal

Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.

Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...

Devansh16

YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2107.00471 [eess.IV] (or arXiv:2107.00471v1 [eess.IV] for this version) Reach out to me: Check out my other articles on Medium. : https://machine-learning-made-simple.... My YouTube: https://rb.gy/88iwdd Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d... My Instagram: https://rb.gy/gmvuy9 My Twitter: https://twitter.com/Machine01776819 My Substack: https://devanshacc.substack.com/ Live conversations at twitch here: https://rb.gy/zlhk9y Get a free stock on Robinhood: https://join.robinhood.com/fnud75

Nat poster

Yoonho Na

Bioinformatics

dagunisa

Identification of Disease in Leaves using Genetic Algorithm

ijtsrd

Plant disease is an impairment of normal state of a plant that interrupts or modifies its vital functions. Many leaf diseases are caused by pathogens. Agriculture is the mains try of the Indian economy. Perception of human eye is not so much stronger so as to observe minute variation in the infected part of leaf. In this paper, we are providing software solution to automatically detect and classify plant leaf diseases. In this we are using image processing techniques to classify diseases and quickly diagnosis can be carried out as per disease. This approach will enhance productivity of crops. It includes image processing techniques starting from image acquisition, preprocessing, testing, and training. K. Beulah Suganthy ""Identification of Disease in Leaves using Genetic Algorithm"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd22901.pdf Paper URL: https://www.ijtsrd.com/engineering/electronics-and-communication-engineering/22901/identification-of-disease-in-leaves-using-genetic-algorithm/k-beulah-suganthy

Accounting for variance in machine learning benchmarks

Devansh16

Accounting for Variance in Machine Learning Benchmarks Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.

A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...

IJTET Journal

inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.

Comparing prediction accuracy for machine learning andAlexander Decker

Controlling informative features for improved accuracy and faster predictions...

Damian R. Mingle, MBA

What's hot

A new model for large dataset dimensionality reduction based on teaching lear...

TELKOMNIKA JOURNAL

Classification of pneumonia from X-ray images using siamese convolutional net...

TELKOMNIKA JOURNAL

DREAM ChallengeTulip Nandu

AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...

ijsc

IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network

IRJET Journal

Sample Work For Engineering Literature Review and Gap Identification

PhD Assistance

Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence

IJSTA

An approach for breast cancer diagnosis classification using neural network

acijjournal

IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...

IRJET Journal

IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...

IRJET Journal

American Statistical Association October 23 2009 Presentation Part 1

Double Check ĆŐNSULTING

Regularized Weighted Ensemble of Deep Classifiers

ijcsa

Define cancer treatment using knn and naive bayes algorithms

rajab ssemwogerere

Classification of Microarray Gene Expression Data by Gene Combinations using ...

IJCSEA Journal

Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...

Devansh16

Nat poster

Yoonho Na

Bioinformatics

dagunisa

Identification of Disease in Leaves using Genetic Algorithm

ijtsrd

Accounting for variance in machine learning benchmarks

Devansh16

What's hot (19)

A new model for large dataset dimensionality reduction based on teaching lear...

Classification of pneumonia from X-ray images using siamese convolutional net...

DREAM Challenge