SlideShare a Scribd company logo
A Comprehensive Evaluation
     of Multicategory
 Classification Methods for
Microarray Gene Expression
     Cancer Diagnosis


   Presented by: Renikko Alleyne
Outline
•   Motivation
•   Major Concerns
•   Methods
     – SVMs
     – Non-SVMs
     – Ensemble Classification
•   Datasets
•   Experimental Design
•   Gene Selection
•   Performance Metrics
•   Overall Design
•   Results
•   Discussion & Limitations
•   Contributions
•   Conclusions
Why?
                                Clinical
                             Applications of
                            Gene Expression
                               Microarray
                              Technology




                                                            Prediction of
                                                         clinical outcomes
Gene Discovery   Disease Diagnosis      Drug Discovery
                                                           in response to
                                                              treatment




             Cancer        Infectious Diseases
GEMS (Gene Expression Model Selector)


Microarray data                                    Creation of powerful and
                                                   reliable cancer diagnostic
                                                   models




                                Equip with best classifier, gene
                                selection, and cross-validation
                                methods




   11 datasets spanning 74                           Evaluation of major algorithms
   diagnostic categories & 41                        for multicategory
   cancer types & 12 normal                          classification, gene selection
   tissue types                                      methods, ensemble classifier
                                                     methods & 2 cross validation
                                                     designs
Major Concerns
• The studies conducted limited experiments in terms of the number of
  classifiers, gene selection algorithms, number of datasets and types
  of cancer involved.

• Cannot determine which classifier performs best.

• It is poorly understood what are the best combinations of
  classification and gene selection algorithms across most array-based
  cancer datasets.

• Overfitting.

• Underfitting.
Goals for the Development of an Automated
System that creates high-quality diagnostic
models for use in clinical applications

• Investigate which classifier currently available for gene expression
  diagnosis performs the best across many cancer types

• How classifiers interact with existing gene selection methods in
  datasets with varying sample size, number of genes and cancer
  types

• Whether it is possible to increase diagnostic performance further
  using meta-learning in the form of ensemble classification

• How to parameterize the classifiers and gene selection
  procedures to avoid overfitting
Why use Support Vector Machines
             (SVMs)?
• Achieve superior classification performance
  compared to other learning algorithms

• Fairly insensitive to the curse of dimensionality

• Efficient enough to handle very large-scale
  classification in both sample and variables
How SVMs Work




•   Objects in the input space are mapped using a set of mathematical
    functions (kernels).

•   The mapped objects in the feature (transformed) space are linearly
    separable, and instead of drawing a complex curve, an optimal line
    (maximum-margin hyperplane) can be found to separate the two
    classes.
SVM Classification Methods

               SVMs




 Binary SVMs          Multiclass SVMs




    OVR        OVO       DAGSVM         WW   SW
Binary SVMs
                      •   Main idea is to identify the
                          maximum-margin hyperplane
     Support Vector
                          that separates training
                          instances.

                      •   Selects a hyperplane that
                          maximizes the width of the gap
                          between the two classes.

                      •   The hyperplane is specified by
                          support vectors.

                      •   New classes are classified
        Hyperplane
                          depending on the side of the
                          hyperplane they belong to.
1. Multiclass SVMs: one-versus-rest (OVR)

                      •   Simplest MC-SVM

                      •   Construct k binary SVM
                          classifiers:
                          – Each class (positive) vs all
                            other classes (negatives).

                      •   Computationally Expensive
                          because there are k quadratic
                          programming (QP) optimization
                          problems of size n to solve.
2. Multiclass SVMs: one-versus-one (OVO)

                     • Involves construction of
                       binary SVM classifiers for all
                       pairs of classes

                     • A decision function assigns
                       an instance to a class that
                       has the largest number of
                       votes (Max Wins strategy)

                     • Computationally less
                       expensive
3. Multiclass SVMs: DAGSVM

                    •   Constructs a decision tree

                    •   Each node is a binary SVM for
                        a pair of classes

                    •   k leaves: k classification
                        decisions

                    •   Non-leaf (p, q): two edges
                         – Left edge: not p decision
                         – Right edge: not q decision
4 & 5. Multiclass SVMs: Weston & Watkins
(WW) and Crammer & Singer (CS)

                     •   Constructs a single classifier by
                         maximizing the margin between
                         all the classes simultaneously

                     •   Both require the solution of a
                         single QP problem of size
                         (k-1)n, but the CS MC-SVM
                         uses less slack variables in the
                         constraints of the optimization
                         problem, thereby making it
                         computationally less expensive
Non-SVM Classification Methods


              Non-SVMs




    KNN         NN         PNN
K-Nearest Neighbors (KNN)

                  •   For each case to be classified,
                      locate the k closest members
                      of the training dataset.

        ?         •   A Euclidean Distance
                      measure is used to calculate
                      the distance between the
                      training dataset members and
                      the target case.

    ?             •   The weighted sum of the
                      variable of interest is found for
                      the k nearest neighbors.

                  •   Repeat this procedure for the
                      other target set cases.
Backpropagation Neural Networks (NN) &
      Probabilistic Neural Networks (PNNs)

•   Back Propagation Neural Networks:
     – Feed forward neural networks with
       signals propagated forward through
       the layers of units.

     – The unit connections have weights
       which are adjusted when there is an
       error, by the backpropagation
       learning algorithm.

•   Probabilistic Neural Networks:
     – Design similar to NNs except that the
       hidden layer is made up of a
       competitive layer and a pattern layer
       and the unit connections do not have
       weights.
Ensemble Classification Methods
In order to improve performance:


          Classifier 1             Classifier 2            Classifier N




                 Output 1                  Output 2                Output N


      Techniques: Major Voting, Decision Trees, MC-SVM (OVR, OVO, DAGSVM)




                              Ensembled Classifiers
Datasets & Data Preparatory Steps

• Nine multicategory cancer diagnosis datasets

• Two binary cancer diagnosis datasets

• All datasets were produced by oligonucleotide-based
  technology

• The oligonucleotides or genes with absent calls in all samples
  were excluded from analysis to reduce any noise.
Datasets
Experimental Designs

                 •   Two Experimental Designs to obtain
                     reliable performance estimates and
                     avoid overfitting.

                 •   Data split into mutually exclusive
                     sets.

                 •   Outer Loop estimates performance
                     by:
                      – Training on all splits but one (use for
                        testing).


                 •   Inner Loop determines the best
                     parameter of the classifier.
Experimental Designs

• Design I uses stratified 10 fold cross-validation in both loops
  while Design II uses 10 fold cross-validation in its inner loop
  and leave-one-out-cross-validation in its outer loop.

• Building the final diagnostic model involves:
   – Finding the best parameters for the classification using a single loop
     of cross-validation
   – Building the classifier on all data using the previously found best
     parameters
   – Estimating a conservative bound on the classifier’s accuracy by using
     either Designs
Gene Selection

                           Gene Selection
                               Methods




Ratio of genes                                        Kruskal-Wallis non-
                       Signal-to-noise scores
between-categories                                     parametric one-way
to within-category             (S2N)                      ANOVA (KW)
sum of squares (BW)




                 S2N-OVR                    S2N-OVO
Performance Metrics

• Accuracy
   –   Easy to interpret
   –   Simplifies statistical testing
   –   Sensitive to prior class probabilities
   –   Does not describe the actual difficulty of the decision problem
       for unbalanced distributions

• Relative classifier information (RCI)
   – Corrects for the differences in:
        • Prior probabilities of the diagnostic categories
        • Number of categories
Overall Research Design

   Stage 1:Conducted a Factorial design involving datasets & classifiers w/o gene
                                      selection




Stage 2: Conducted a Factorial Design w/ gene selection using datasets for which the
                       full gene sets yielded poor performance




                      2.6 million diagnostic models generated




       Selection of one model for each combination of algorithm and dataset
Statistical Comparison among classifiers
 To test that differences b/t the best method and the other methods are non-random

                       Null Hypothesis: Classification algorithm
                                    X is as good as Y



                        Obtain permutation distribution of XY ∆
                               by repeatedly rearranging the
                             outcomes of X and Y at random



                        Compute the p-value of XY ∆ being greater
                          than or equal to observed difference XY
                                ∆ over 10000 permutations



        If p < 0.05  Reject H0                          If p > 0.05  Accept H0
Algorithm X is not as good as Y in terms         Algorithm X is as good as Y in terms of
        of classification accuracy                        classification accuracy
Performance Results (Accuracies) without Gene Selection
                     Using Design I
Performance Results (RCI) without Gene Selection Using
                       Design I
Total Time of Classification Experiments w/o gene
selection for all 11 datasets and two experimental
designs

                            •   Executed in a Matlab R13
                                environment on 8 dual-CPU
                                workstations connected in a
                                cluster.

                            •   Fastest MC-SVMs: WW & CS
                            •   Fastest overall algorithm: KNN

                            •   Slowest MC-SVM: OVR
                            •   Slowest overall algorithms: NN
                                and PNN
Performance Results (Accuracies) with Gene Selection
                                                  Using Design I
Improvement by gene selection




    Applied the 4 gene selection methods to the 4 most challenging datasets
Performance Results (RCI) with Gene Selection Using
                                                     Design I
Improvement by gene selection




    Applied the 4 gene selection methods to the 4 most challenging datasets
Discussion & Limitations
• Limitations:
   – Use of the two performance metrics
   – Choice of KNN, PNN and NN classifiers


• Future Research:
   – Improve existing gene selection procedures with the selection of
     optimal number of genes by cross-validation
   – Applying multivariate Markov blanket and local neighborhood
     algorithms
   – Extend comparisons with more MC-SVMs as they become
     available
   – Updating GEMS system to make it more user-friendly.
Contributions of Study
• Conducted the most comprehensive systematic evaluation to
  date of multicategory diagnosis algorithms applied to the
  majority of multicategory cancer-related gene expression
  human datasets.

• Creation of the GEMS system that automates the experimental
  procedures in the study in order to:
   – Develop optimal classification models for the domain of cancer
     diagnosis with microarray gene expression data.
   – Estimate their performance in future patients.
Conclusions
• MSVMs are the best family of algorithms for these types of data
  and medical tasks. They outperform non-SVM machine
  learning techniques

• Among MC-SVM methods OVR, CS and WW are the best w.r.t
  classification performance

• Gene selection can improve the performance of MC and non-
  SVM methods

• Ensemble classification does not further improve the
  classification performance of the best MC-SVM methods

More Related Content

Similar to Renikko

Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
Vahid Mirjalili
 
Robust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labelsRobust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labels
Kimin Lee
 
Different Algorithms used in classification [Auto-saved].pptx
Different Algorithms used in classification [Auto-saved].pptxDifferent Algorithms used in classification [Auto-saved].pptx
Different Algorithms used in classification [Auto-saved].pptx
Azad988896
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
Derek Kane
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
Value Amplify Consulting
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
Stefan Duprey
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
Subhayan Mukerjee
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
Fatimakhan325
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University Chhattisgarh
Poorabpatel
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
malathieswaran29
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
Kamel Mansouri
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
pannicle
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysis
Prabhat gangwar
 
General pipeline of transcriptomics analysis
General pipeline of transcriptomics analysisGeneral pipeline of transcriptomics analysis
General pipeline of transcriptomics analysis
Santy Marques-Ladeira
 
P1121133727
P1121133727P1121133727
P1121133727
Ashraf Aboshosha
 
Chapter6.doc
Chapter6.docChapter6.doc
Chapter6.doc
butest
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
Natalio Krasnogor
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
South West Data Meetup
 

Similar to Renikko (20)

Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Robust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labelsRobust inference via generative classifiers for handling noisy labels
Robust inference via generative classifiers for handling noisy labels
 
Different Algorithms used in classification [Auto-saved].pptx
Different Algorithms used in classification [Auto-saved].pptxDifferent Algorithms used in classification [Auto-saved].pptx
Different Algorithms used in classification [Auto-saved].pptx
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University Chhattisgarh
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysis
 
General pipeline of transcriptomics analysis
General pipeline of transcriptomics analysisGeneral pipeline of transcriptomics analysis
General pipeline of transcriptomics analysis
 
P1121133727
P1121133727P1121133727
P1121133727
 
Chapter6.doc
Chapter6.docChapter6.doc
Chapter6.doc
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 

Recently uploaded

OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 

Recently uploaded (20)

OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 

Renikko

  • 1. A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis Presented by: Renikko Alleyne
  • 2. Outline • Motivation • Major Concerns • Methods – SVMs – Non-SVMs – Ensemble Classification • Datasets • Experimental Design • Gene Selection • Performance Metrics • Overall Design • Results • Discussion & Limitations • Contributions • Conclusions
  • 3. Why? Clinical Applications of Gene Expression Microarray Technology Prediction of clinical outcomes Gene Discovery Disease Diagnosis Drug Discovery in response to treatment Cancer Infectious Diseases
  • 4. GEMS (Gene Expression Model Selector) Microarray data Creation of powerful and reliable cancer diagnostic models Equip with best classifier, gene selection, and cross-validation methods 11 datasets spanning 74 Evaluation of major algorithms diagnostic categories & 41 for multicategory cancer types & 12 normal classification, gene selection tissue types methods, ensemble classifier methods & 2 cross validation designs
  • 5. Major Concerns • The studies conducted limited experiments in terms of the number of classifiers, gene selection algorithms, number of datasets and types of cancer involved. • Cannot determine which classifier performs best. • It is poorly understood what are the best combinations of classification and gene selection algorithms across most array-based cancer datasets. • Overfitting. • Underfitting.
  • 6. Goals for the Development of an Automated System that creates high-quality diagnostic models for use in clinical applications • Investigate which classifier currently available for gene expression diagnosis performs the best across many cancer types • How classifiers interact with existing gene selection methods in datasets with varying sample size, number of genes and cancer types • Whether it is possible to increase diagnostic performance further using meta-learning in the form of ensemble classification • How to parameterize the classifiers and gene selection procedures to avoid overfitting
  • 7. Why use Support Vector Machines (SVMs)? • Achieve superior classification performance compared to other learning algorithms • Fairly insensitive to the curse of dimensionality • Efficient enough to handle very large-scale classification in both sample and variables
  • 8. How SVMs Work • Objects in the input space are mapped using a set of mathematical functions (kernels). • The mapped objects in the feature (transformed) space are linearly separable, and instead of drawing a complex curve, an optimal line (maximum-margin hyperplane) can be found to separate the two classes.
  • 9. SVM Classification Methods SVMs Binary SVMs Multiclass SVMs OVR OVO DAGSVM WW SW
  • 10. Binary SVMs • Main idea is to identify the maximum-margin hyperplane Support Vector that separates training instances. • Selects a hyperplane that maximizes the width of the gap between the two classes. • The hyperplane is specified by support vectors. • New classes are classified Hyperplane depending on the side of the hyperplane they belong to.
  • 11. 1. Multiclass SVMs: one-versus-rest (OVR) • Simplest MC-SVM • Construct k binary SVM classifiers: – Each class (positive) vs all other classes (negatives). • Computationally Expensive because there are k quadratic programming (QP) optimization problems of size n to solve.
  • 12. 2. Multiclass SVMs: one-versus-one (OVO) • Involves construction of binary SVM classifiers for all pairs of classes • A decision function assigns an instance to a class that has the largest number of votes (Max Wins strategy) • Computationally less expensive
  • 13. 3. Multiclass SVMs: DAGSVM • Constructs a decision tree • Each node is a binary SVM for a pair of classes • k leaves: k classification decisions • Non-leaf (p, q): two edges – Left edge: not p decision – Right edge: not q decision
  • 14. 4 & 5. Multiclass SVMs: Weston & Watkins (WW) and Crammer & Singer (CS) • Constructs a single classifier by maximizing the margin between all the classes simultaneously • Both require the solution of a single QP problem of size (k-1)n, but the CS MC-SVM uses less slack variables in the constraints of the optimization problem, thereby making it computationally less expensive
  • 15. Non-SVM Classification Methods Non-SVMs KNN NN PNN
  • 16. K-Nearest Neighbors (KNN) • For each case to be classified, locate the k closest members of the training dataset. ? • A Euclidean Distance measure is used to calculate the distance between the training dataset members and the target case. ? • The weighted sum of the variable of interest is found for the k nearest neighbors. • Repeat this procedure for the other target set cases.
  • 17. Backpropagation Neural Networks (NN) & Probabilistic Neural Networks (PNNs) • Back Propagation Neural Networks: – Feed forward neural networks with signals propagated forward through the layers of units. – The unit connections have weights which are adjusted when there is an error, by the backpropagation learning algorithm. • Probabilistic Neural Networks: – Design similar to NNs except that the hidden layer is made up of a competitive layer and a pattern layer and the unit connections do not have weights.
  • 18. Ensemble Classification Methods In order to improve performance: Classifier 1 Classifier 2 Classifier N Output 1 Output 2 Output N Techniques: Major Voting, Decision Trees, MC-SVM (OVR, OVO, DAGSVM) Ensembled Classifiers
  • 19. Datasets & Data Preparatory Steps • Nine multicategory cancer diagnosis datasets • Two binary cancer diagnosis datasets • All datasets were produced by oligonucleotide-based technology • The oligonucleotides or genes with absent calls in all samples were excluded from analysis to reduce any noise.
  • 21. Experimental Designs • Two Experimental Designs to obtain reliable performance estimates and avoid overfitting. • Data split into mutually exclusive sets. • Outer Loop estimates performance by: – Training on all splits but one (use for testing). • Inner Loop determines the best parameter of the classifier.
  • 22. Experimental Designs • Design I uses stratified 10 fold cross-validation in both loops while Design II uses 10 fold cross-validation in its inner loop and leave-one-out-cross-validation in its outer loop. • Building the final diagnostic model involves: – Finding the best parameters for the classification using a single loop of cross-validation – Building the classifier on all data using the previously found best parameters – Estimating a conservative bound on the classifier’s accuracy by using either Designs
  • 23. Gene Selection Gene Selection Methods Ratio of genes Kruskal-Wallis non- Signal-to-noise scores between-categories parametric one-way to within-category (S2N) ANOVA (KW) sum of squares (BW) S2N-OVR S2N-OVO
  • 24. Performance Metrics • Accuracy – Easy to interpret – Simplifies statistical testing – Sensitive to prior class probabilities – Does not describe the actual difficulty of the decision problem for unbalanced distributions • Relative classifier information (RCI) – Corrects for the differences in: • Prior probabilities of the diagnostic categories • Number of categories
  • 25. Overall Research Design Stage 1:Conducted a Factorial design involving datasets & classifiers w/o gene selection Stage 2: Conducted a Factorial Design w/ gene selection using datasets for which the full gene sets yielded poor performance 2.6 million diagnostic models generated Selection of one model for each combination of algorithm and dataset
  • 26. Statistical Comparison among classifiers To test that differences b/t the best method and the other methods are non-random Null Hypothesis: Classification algorithm X is as good as Y Obtain permutation distribution of XY ∆ by repeatedly rearranging the outcomes of X and Y at random Compute the p-value of XY ∆ being greater than or equal to observed difference XY ∆ over 10000 permutations If p < 0.05  Reject H0 If p > 0.05  Accept H0 Algorithm X is not as good as Y in terms Algorithm X is as good as Y in terms of of classification accuracy classification accuracy
  • 27. Performance Results (Accuracies) without Gene Selection Using Design I
  • 28. Performance Results (RCI) without Gene Selection Using Design I
  • 29. Total Time of Classification Experiments w/o gene selection for all 11 datasets and two experimental designs • Executed in a Matlab R13 environment on 8 dual-CPU workstations connected in a cluster. • Fastest MC-SVMs: WW & CS • Fastest overall algorithm: KNN • Slowest MC-SVM: OVR • Slowest overall algorithms: NN and PNN
  • 30. Performance Results (Accuracies) with Gene Selection Using Design I Improvement by gene selection Applied the 4 gene selection methods to the 4 most challenging datasets
  • 31. Performance Results (RCI) with Gene Selection Using Design I Improvement by gene selection Applied the 4 gene selection methods to the 4 most challenging datasets
  • 32. Discussion & Limitations • Limitations: – Use of the two performance metrics – Choice of KNN, PNN and NN classifiers • Future Research: – Improve existing gene selection procedures with the selection of optimal number of genes by cross-validation – Applying multivariate Markov blanket and local neighborhood algorithms – Extend comparisons with more MC-SVMs as they become available – Updating GEMS system to make it more user-friendly.
  • 33. Contributions of Study • Conducted the most comprehensive systematic evaluation to date of multicategory diagnosis algorithms applied to the majority of multicategory cancer-related gene expression human datasets. • Creation of the GEMS system that automates the experimental procedures in the study in order to: – Develop optimal classification models for the domain of cancer diagnosis with microarray gene expression data. – Estimate their performance in future patients.
  • 34. Conclusions • MSVMs are the best family of algorithms for these types of data and medical tasks. They outperform non-SVM machine learning techniques • Among MC-SVM methods OVR, CS and WW are the best w.r.t classification performance • Gene selection can improve the performance of MC and non- SVM methods • Ensemble classification does not further improve the classification performance of the best MC-SVM methods

Editor's Notes

  1. Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology.
  2. Most real life diagnostic tasks are not binary