SlideShare a Scribd company logo
1 of 56
Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 1
Thesis objective
Develop algorithms for microarray
classification
–Predictive performance
–Results stability
–Biological interpretability
27/06/2014 Mattia Bosio PhD thesis defense 2
Roadmap
3
27/06/2014 Mattia Bosio PhD thesis defense
1- Microarrays
2- Challenges & Opportunities
3- Contributions
4- How did we get there?
5- Conclusions
27/06/2014 Mattia Bosio PhD thesis defense 4
Challenges & Opportunities
1- Microarrays
A platform to measure gene expression
27/06/2014 Mattia Bosio PhD thesis defense 5
• Give a picture of the whole cellular state
• Thousands of parallel measures
• Measure how much each gene is being used
• Can be used to discriminate between
populations
Microarrays: what do they measure
27/06/2014 Mattia Bosio PhD thesis defense 6
Microarrays: how do they look like
27/06/2014 Mattia Bosio PhD thesis defense 7
45’000 ‘Genes’
72
Samples
27/06/2014 Mattia Bosio PhD thesis defense 8
Challenges & Opportunities
2- CHALLENGES &
OPPORTUNITIES
Challenges
27/06/2014 Mattia Bosio PhD thesis defense 9
Lack of structure
Noise
Sample size vs dimensions
45’000 ‘Genes’
72
Samples
Opportunities
27/06/2014 Mattia Bosio PhD thesis defense 10
• Established tool for research but no optimum
algorithm yet for classification
• Machine learning has already been used
– Good results that can be improved
• Signal processing dealt with similar problems
27/06/2014 Mattia Bosio PhD thesis defense 11
Contributions
3- CONTRIBUTIONS
27/06/2014 Mattia Bosio PhD thesis defense 12
Two-step classification framework
Genes
Feature set
Enhancement
Feature
Selection
Classifier
Train Data
Validation Data
Class Estimations
Metagenes
1. Metagenes 2. IFFS
3. Ensemble
4. Knowledge
Integration
5. Multiclass
algorithm
4- HOW DID WE GET THERE?
27/06/2014 Mattia Bosio PhD thesis defense 14
4.1 FEATURE SET ENHANCEMENT
A structure is inferred from the data and new metagenes are created.
27/06/2014 Mattia Bosio PhD thesis defense
16
Feature set enhancement
Addresses Noise and Lack of structure
• A binary tree is inferred
• Each node is a new feature
• New features are called
metagenes
• Metagenes reduce noise by
clustering similar genes
27/06/2014 Mattia Bosio PhD thesis defense
17
Feature set enhancement
The iterative process of metagene generation
• Iterative process based on
Treelets [1]
• The two most similar features
are substituted by a metagene
• Two key elements:
– Similarity Metric
– Metagene generation algorithm
18
[1] A. B. Lee, B. Nadler, L. Wasserman, Treelets - an adaptive multi-scale basis
for sparse unordered data, Annals of Applied Statistics 2 (2) (2008) 435 {471}.
4.2 FEATURE SELECTION: IFFS
How to select the right features to discriminate between classes with an iterative, wrapper
algorithm
27/06/2014 Mattia Bosio PhD thesis defense
19
IFFS:Find the few best features to classify
• “Improved Sequential Floating Forward
Selection (IFFS)” [2]:
– Sequential, deterministic wrapper algorithm
• Flexible method : at each iteration decide if
Add, Delete or Substitute a feature
• Alternatives are compared by a J(·) score
20
[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern
Recognition.
IFFS:Find the few best features to classify
Deterministic sequential wrapper algorithm
• All the decisions determined by a J(·) score
• Usually J(·) is an error rate estimation
– Ties are frequent due to the sample scarcity
27/06/2014 Mattia Bosio PhD thesis defense 21
[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern
Recognition.
J(·) score tailored for microarrays
27/06/2014 Mattia Bosio PhD thesis defense
22
Reliability measure to break ties in J(·)
Three rules to define the score
combining error rate and reliability:
1. Lexicographic sorting
2. Exponential penalization
3. Linear combination
J(·) score depends on 2 parameters:
1. Error rate
2. Reliability
IFFS: Experimental setup
• Datasets from MAQC study phase II [4]
• 7 datasets with hundreds of samples
– 30.000+ models evaluated
– Independent validation sets available
– Common evaluation procedure
23
[4] L. Shi, et al., The microarray quality control (MAQC)-II study of common practices for the development and
validation of microarray-based predictive models., Nature biotechnology 28 (2010) 827-38.
IFFS: experiment objectives
• Evaluate if metagenes are useful
• Benchmark with state of the art
• Comparison following MAQC standard:
Matthews Correlation Coefficient
27/06/2014 Mattia Bosio PhD thesis defense 25
𝑀𝐶𝐶 =
𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁
(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
Results: Metagenes are useful
27/06/2014 Mattia Bosio PhD thesis defense 26
• Introducing metagenes gives better results
The proposed framework improves state
of the art results
27/06/2014 Mattia Bosio PhD thesis defense 27
0.423
0.486
0.495
0.490
0.25
0.30
0.35
0.40
0.45
0.50
0.55
Observations
• The proposed framework works with both its
key elements
• Metagenes are useful (contrib #1)
• IFFS adapted to microarrays improves the state
of the art (contrib #2)
27/06/2014 Mattia Bosio PhD thesis defense 28
4.3 FEATURE SELECTION: ENSEMBLE
How to select the right features to discriminate between classes with a novel ensemble
learning algorithm
27/06/2014 Mattia Bosio PhD thesis defense
29
Ensemble learning - voting scheme
• Ensemble combine experts with a voting
scheme
• One expert for each available feature
– Expert = Trained Classifier output on analyzed data
– 1 Expert = 1 feature
• The feature selection becomes an Expert
subset selection problem
27/06/2014 Mattia Bosio PhD thesis defense 30
Accuracy In Diversity [7]
the original algorithm
• Starts with p experts : One for each feature
• Sequentially removes the expert with worst
error rate on a subset S
• In [6], a simpler version is defined: Kun
algorithm
27/06/2014 Mattia Bosio PhD thesis defense 31
[6] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms.Wiley, 2004.
[7]R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “A new ensemble diversity measure applied to
thinning ensembles.” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, T. Windeatt and F. Roli,
Eds., vol. 2709. Springer, 2003, pp. 306–316.
Accuracy In Diversity
the original algorithm
27/06/2014 Mattia Bosio PhD thesis defense 32
• PCDM (d) = % of experts correctly classifying sample i
• S set formed of samples with 𝑙𝑏 ≤ 𝑑 ≤ 𝑈𝑏
• The expert with worst error rate on S is excluded
90%
50%
80%
100%
100%
EXPERTS
SAMPLES
PCDM VOTE
AID Kun
𝑙𝑏 = 𝜇 ⋅ 𝑑 +
1 − 𝑑
𝑛
𝑙𝑏 = 10%
𝑈𝑏 = 𝛼 ⋅ 𝑑 + 𝜇(1 − 𝑑) 𝑈𝑏 = 90%
Adaptations to microarrays
• Nonexpert: Exclude experts unable to find 2
classes in the training set
• Metagenes : included as experts
• Tie-break rule: the expert upper in the tree is
excluded
27/06/2014 Mattia Bosio PhD thesis defense 33
Ensemble: experiment objectives
• Comparison between AID and Kun ensemble
algorithms.
• Benchmark with state of the art.
• Comparison following MAQC standard:
Matthews Correlation Coefficient
27/06/2014 Mattia Bosio PhD thesis defense 34
𝑀𝐶𝐶 =
𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁
(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
Ensemble algorithms improve the state of
the art
27/06/2014 Mattia Bosio PhD thesis defense 35
• Both algorithms improve state of the art
• The simpler Kun algorithm is the best option
0.230
0.490
0.495
0.514
0.533
0.2
0.3
0.3
0.4
0.4
0.5
0.5
0.6
Observations
• Ensemble learning feature selection led to
encouraging results.
• The proposed ensemble learning improves the
state of the art. (contrib #3)
• Tailoring the algorithm to the data benefits the
results.
27/06/2014 Mattia Bosio PhD thesis defense 36
4.4 KNOWLEDGE INTEGRATION
Introducing prior biologial knowledge to improve the metagene generation phase. The aim is
to obtain more robust performance and more biologically interpretable gene selections
27/06/2014 Mattia Bosio PhD thesis defense
37
Integration of external biological data
when producing metagenes
27/06/2014 Mattia Bosio PhD thesis defense 38
Genes
Feature set
Enhancement
Feature
Selection
Classifier
Train Data
Validation Data
Class Estimations
New metagenes
Biological Knowledge
(MSigDb...)
Objectives of this section
• Measures to quantify biological similarity
• Develop ways to integrate both sources of info
Numerical correlation & Biological similarity
• Benchmarking :
Predictive power | Results stability |Biological interpretability
27/06/2014 Mattia Bosio PhD thesis defense 39
Distances and merging algorithms
• 4 similarity metrics studied:
Godall | Smirnov | NoisyOR | Anderberg
• 2 criteria to merge numerical and biological info
Average | pdf equalization
27/06/2014 Mattia Bosio PhD thesis defense 40
Experimental setup
• 7 MAQC datasets
• 50-run Monte Carlo experiments
• Novel scoring system integrating Numerical
results and Biological analysis tools
27/06/2014 Mattia Bosio PhD thesis defense 41
Comparative scoring system
Predictive performance
𝑑 =
𝜇
𝜖+𝜎
from MCC values
Rank by decreasing 𝑑
= best
Biological analysis
4 parallel analysis tools
GSEA | Biograph | Genie |Enrichr
4 parallel rankings
Average biological rankings
27/06/2014 Mattia Bosio PhD thesis defense
42
1
1 3 6 2
3
Final score = rank average
2
The best algorithm has the smallest final score
Predictive power scoring & ranking shows
G_pdf as the best solution
27/06/2014 Mattia Bosio PhD thesis defense 43
The smallest Final Score is the best alternative
MCC BIO
Bio.
Analysis
Predictive
Rank.
Final
Score
pdf_equalization average
Compared with state of the art, G_pdf
confirms to be the best alternative
27/06/2014 Mattia Bosio PhD thesis defense 44
The smallest final score is the best alternative
MCC BIO
Final
Score
Observations about knowledge
integration
• Improved results in terms of results stability
and interpretability
• Godall similarity with pdf-equalization scheme
is the best way to integrate prior databases
• G-pdf performance confirmed against state of
the art alternatives too (contrib #4)
27/06/2014 Mattia Bosio PhD thesis defense 45
4.5 MULTICLASS CLASSIFICATION
Study of a novel algorithm for multiclass classification applying coding theory on multiple
binary classifiers
27/06/2014 Mattia Bosio PhD thesis defense
46
Multiclass approach combining multiple
binary classifiers
• Common methods like One Against All (OAA)
or One Against One (OAO) can be improved.
• Information coding  good results[119]
• Propose a novel approach with ECOC ideas
27/06/2014 Mattia Bosio PhD thesis defense 47
[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with
a reduced number of genes. BMC Bioinformatics 2011.
Our proposal: OAA+PAA
• Choice to combine several experts:
– OAA = one classifier per class
– PAA = one classifier separating each class-pair
• Expert = bit in a codeword
• Class estimation by distance with reference words
27/06/2014 Mattia Bosio PhD thesis defense 48
𝑐1
𝑐2
𝑐3
𝑐4
1 0 0 0 1 1 1 0 0 0
0 1 0 0 1 0 0 1 1 0
0 0 1 0 0 1 0 1 0 1
0 0 0 1 0 0 1 0 1 1
M binary classifiers
h1 h2 … hM
N
=
4
Classes
Experiments on 7 public datasets
• Binary classifiers trained with Treelet + IFFS
• Compared with OAA, OAO and state of the art
alternatives[119 ]
• 50 run Monte Carlo run of 4:1 cross validation.
27/06/2014 Mattia Bosio PhD thesis defense 49
[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with
a reduced number of genes. BMC Bioinformatics 2011.
Average accuracy
27/06/2014 Mattia Bosio PhD thesis defense 50
OAA+PAA is better than OAA, OAO and state of the art alternatives
OAA OAO [119] LDPC [119] OAA OAA+PAA L1
70%
75%
80%
85%
Accuracy
Observations about OAA+PAA
• It consistently outperforms OAA and OAO
algorithms
• Obtains better accuracy than state of the art
alternatives from [119 ]
• OAA+PAA is a valid multiclass algorithm
(contrib#5)
27/06/2014 Mattia Bosio PhD thesis defense 51
[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with
a reduced number of genes. BMC Bioinformatics 2011.
27/06/2014 Mattia Bosio PhD thesis defense 52
5- CONCLUSIONS
Two-step approach is the main
contribution
• Feature set enhancement
– Addresses lack of structure
– Addresses noise
• Feature selection & classification
– Choose the best variables among thousands
available with new algorithms
27/06/2014 Mattia Bosio PhD thesis defense 53
Validated contributions
• Metagenes are helpful for classification
• Tailored IFFS algorithm  improves state of
the art
• Ensemble learning algorithm led to interesting
results
• Knowledge integration framework improves
interpretability and robustness
• OAA+PAA as a valid multiclass algorithm
27/06/2014 Mattia Bosio PhD thesis defense 54
Publications
Bosio M, Bellot P, Salembier P, Oliveras A. “Gene Expression Data Classification Combining Hierarchical
Representation and Efficient Feature Selection”. Journal of Biological Systems. 2012;20:349-375.
Bosio M, Bellot P, Salembier P, Oliveras A. “Feature set enhancement via hierarchical clustering for microarray
classification”. IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2011. ; 2011.
pp. 226 -229
Bosio M, Bellot P, Salembier P, Oliveras A. “Microarray classification with hierarchical data representation and
novel feature selection criteria”. In: IEEE 12th International Conference on BioInformatics and BioEngineering.
Larnaca, Cyprus; 2012.
Bosio M, Bellot P, Salembier P, Oliveras A. “Multiclass cancer microarray classification algorithm with Pair-
Against-All redundancy”. In: The 2012 IEEE International Workshop on Genomic Signal Processing and Statistics
(GENSIPS’12). Washington, DC, USA; 2012.
Bosio M, Salembier P, Bellot P, Oliveras A. “Hierarchical clustering combining numerical and biological similarities
for gene expression data classification”. 35th Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC'13). Osaka, Japan 07/2013
M. Bosio, Salembier, P., Oliveras, A., and Bellot, P., “Ensemble feature selection and hierarchical data
representation for microarray classification”, in 13th IEEE International Conference on BioInformatics and
BioEngineering BIBE, Chania, Crete, 2013.
27/06/2014 Mattia Bosio PhD thesis defense 55
IFFS
KUN
BIOINFO
MCLASS
METAGENES
Future research directions
• Study a better use of the tree structure
• Integrate more information sources
• Deepen knowledge for ensemble learning
• Study applicability for Next Generation Seq
analysis or other ‘omics’ platforms
27/06/2014 Mattia Bosio PhD thesis defense 56
Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 57
Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 58
Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 59

More Related Content

Similar to phd ppt2 sample reference download1.pptx

Structural bioinformatics and ELIXIR UK by Christine Orengo
Structural bioinformatics and ELIXIR UK by Christine OrengoStructural bioinformatics and ELIXIR UK by Christine Orengo
Structural bioinformatics and ELIXIR UK by Christine OrengoELIXIR UK
 
50_Research methodology and Biostatistics.pdf
50_Research methodology and Biostatistics.pdf50_Research methodology and Biostatistics.pdf
50_Research methodology and Biostatistics.pdfVamsi kumar
 
Aminullah assagaf p13 15-metode penelitian (2)_lanjutan
Aminullah assagaf p13 15-metode penelitian (2)_lanjutanAminullah assagaf p13 15-metode penelitian (2)_lanjutan
Aminullah assagaf p13 15-metode penelitian (2)_lanjutanAminullah Assagaf
 
Applying ‘best fit’ frameworks to systematic review data extraction
Applying ‘best fit’ frameworks to systematic review data extractionApplying ‘best fit’ frameworks to systematic review data extraction
Applying ‘best fit’ frameworks to systematic review data extractionAndrea Miller-Nesbitt
 
Predicting student performance using aggregated data sources
Predicting student performance using aggregated data sourcesPredicting student performance using aggregated data sources
Predicting student performance using aggregated data sourcesOlugbenga Wilson Adejo
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Elia Brodsky
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Dmitry Grapov
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategyAnton Yuryev
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Intobutest
 
Lecture 1- Introduction.pptx
Lecture 1- Introduction.pptxLecture 1- Introduction.pptx
Lecture 1- Introduction.pptxssuserb0d8b4
 
Predicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B PatientsPredicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B Patientsnabeelali11101999
 
Survey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni ConnawaySurvey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni ConnawayLynn Connaway
 

Similar to phd ppt2 sample reference download1.pptx (20)

Structural bioinformatics and ELIXIR UK by Christine Orengo
Structural bioinformatics and ELIXIR UK by Christine OrengoStructural bioinformatics and ELIXIR UK by Christine Orengo
Structural bioinformatics and ELIXIR UK by Christine Orengo
 
50_Research methodology and Biostatistics.pdf
50_Research methodology and Biostatistics.pdf50_Research methodology and Biostatistics.pdf
50_Research methodology and Biostatistics.pdf
 
Aminullah assagaf p13 15-metode penelitian (2)_lanjutan
Aminullah assagaf p13 15-metode penelitian (2)_lanjutanAminullah assagaf p13 15-metode penelitian (2)_lanjutan
Aminullah assagaf p13 15-metode penelitian (2)_lanjutan
 
Applying ‘best fit’ frameworks to systematic review data extraction
Applying ‘best fit’ frameworks to systematic review data extractionApplying ‘best fit’ frameworks to systematic review data extraction
Applying ‘best fit’ frameworks to systematic review data extraction
 
Basic quantitative research
Basic quantitative researchBasic quantitative research
Basic quantitative research
 
Predicting student performance using aggregated data sources
Predicting student performance using aggregated data sourcesPredicting student performance using aggregated data sources
Predicting student performance using aggregated data sources
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0
 
Model management for systems biology projects
Model management for systems biology projectsModel management for systems biology projects
Model management for systems biology projects
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategy
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
Bolouri qualitative method
Bolouri qualitative methodBolouri qualitative method
Bolouri qualitative method
 
Lecture 1- Introduction.pptx
Lecture 1- Introduction.pptxLecture 1- Introduction.pptx
Lecture 1- Introduction.pptx
 
Systematic Reviews and Research Synthesis, Part 2
Systematic Reviews and Research Synthesis, Part 2Systematic Reviews and Research Synthesis, Part 2
Systematic Reviews and Research Synthesis, Part 2
 
An Introduction to Biology with Computers
An Introduction to Biology with ComputersAn Introduction to Biology with Computers
An Introduction to Biology with Computers
 
Predicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B PatientsPredicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B Patients
 
(2012) The Role of Test Administrator and Error proposal
(2012) The Role of Test Administrator and Error proposal(2012) The Role of Test Administrator and Error proposal
(2012) The Role of Test Administrator and Error proposal
 
Validation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb RooneyValidation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb Rooney
 
Survey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni ConnawaySurvey Research Methods with Lynn Silipigni Connaway
Survey Research Methods with Lynn Silipigni Connaway
 

Recently uploaded

Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineJulioCesarSalazarHer1
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...MohammadAliNayeem
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdfKamal Acharya
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsMathias Magdowski
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New HorizonMorshed Ahmed Rahath
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxCHAIRMAN M
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfragupathi90
 
EMPLOYEE MANAGEMENT SYSTEM FINAL presentation
EMPLOYEE MANAGEMENT SYSTEM FINAL presentationEMPLOYEE MANAGEMENT SYSTEM FINAL presentation
EMPLOYEE MANAGEMENT SYSTEM FINAL presentationAmayJaiswal4
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor banktawat puangthong
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxRashidFaridChishti
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdfKamal Acharya
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoAbhimanyu Sangale
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Prakhyath Rai
 
Circuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringCircuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringKanchhaTamang
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfqasastareekh
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdfKamal Acharya
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...archanaece3
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentjatinraor66
 

Recently uploaded (20)

Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdf
 
EMPLOYEE MANAGEMENT SYSTEM FINAL presentation
EMPLOYEE MANAGEMENT SYSTEM FINAL presentationEMPLOYEE MANAGEMENT SYSTEM FINAL presentation
EMPLOYEE MANAGEMENT SYSTEM FINAL presentation
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdf
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of Arduino
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Circuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringCircuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineering
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 

phd ppt2 sample reference download1.pptx

  • 1. Hierarchical information representation and efficient classification of gene expression microarray data PhD candidate: Mattia Bosio Advisors: Philippe Salembier Albert Oliveras Vergés 27/06/2014 Mattia Bosio PhD thesis defense 1
  • 2. Thesis objective Develop algorithms for microarray classification –Predictive performance –Results stability –Biological interpretability 27/06/2014 Mattia Bosio PhD thesis defense 2
  • 3. Roadmap 3 27/06/2014 Mattia Bosio PhD thesis defense 1- Microarrays 2- Challenges & Opportunities 3- Contributions 4- How did we get there? 5- Conclusions
  • 4. 27/06/2014 Mattia Bosio PhD thesis defense 4 Challenges & Opportunities 1- Microarrays
  • 5. A platform to measure gene expression 27/06/2014 Mattia Bosio PhD thesis defense 5 • Give a picture of the whole cellular state • Thousands of parallel measures • Measure how much each gene is being used • Can be used to discriminate between populations
  • 6. Microarrays: what do they measure 27/06/2014 Mattia Bosio PhD thesis defense 6
  • 7. Microarrays: how do they look like 27/06/2014 Mattia Bosio PhD thesis defense 7 45’000 ‘Genes’ 72 Samples
  • 8. 27/06/2014 Mattia Bosio PhD thesis defense 8 Challenges & Opportunities 2- CHALLENGES & OPPORTUNITIES
  • 9. Challenges 27/06/2014 Mattia Bosio PhD thesis defense 9 Lack of structure Noise Sample size vs dimensions 45’000 ‘Genes’ 72 Samples
  • 10. Opportunities 27/06/2014 Mattia Bosio PhD thesis defense 10 • Established tool for research but no optimum algorithm yet for classification • Machine learning has already been used – Good results that can be improved • Signal processing dealt with similar problems
  • 11. 27/06/2014 Mattia Bosio PhD thesis defense 11 Contributions 3- CONTRIBUTIONS
  • 12. 27/06/2014 Mattia Bosio PhD thesis defense 12 Two-step classification framework Genes Feature set Enhancement Feature Selection Classifier Train Data Validation Data Class Estimations Metagenes 1. Metagenes 2. IFFS 3. Ensemble 4. Knowledge Integration 5. Multiclass algorithm
  • 13. 4- HOW DID WE GET THERE? 27/06/2014 Mattia Bosio PhD thesis defense 14
  • 14. 4.1 FEATURE SET ENHANCEMENT A structure is inferred from the data and new metagenes are created. 27/06/2014 Mattia Bosio PhD thesis defense 16
  • 15. Feature set enhancement Addresses Noise and Lack of structure • A binary tree is inferred • Each node is a new feature • New features are called metagenes • Metagenes reduce noise by clustering similar genes 27/06/2014 Mattia Bosio PhD thesis defense 17
  • 16. Feature set enhancement The iterative process of metagene generation • Iterative process based on Treelets [1] • The two most similar features are substituted by a metagene • Two key elements: – Similarity Metric – Metagene generation algorithm 18 [1] A. B. Lee, B. Nadler, L. Wasserman, Treelets - an adaptive multi-scale basis for sparse unordered data, Annals of Applied Statistics 2 (2) (2008) 435 {471}.
  • 17. 4.2 FEATURE SELECTION: IFFS How to select the right features to discriminate between classes with an iterative, wrapper algorithm 27/06/2014 Mattia Bosio PhD thesis defense 19
  • 18. IFFS:Find the few best features to classify • “Improved Sequential Floating Forward Selection (IFFS)” [2]: – Sequential, deterministic wrapper algorithm • Flexible method : at each iteration decide if Add, Delete or Substitute a feature • Alternatives are compared by a J(·) score 20 [2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern Recognition.
  • 19. IFFS:Find the few best features to classify Deterministic sequential wrapper algorithm • All the decisions determined by a J(·) score • Usually J(·) is an error rate estimation – Ties are frequent due to the sample scarcity 27/06/2014 Mattia Bosio PhD thesis defense 21 [2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern Recognition.
  • 20. J(·) score tailored for microarrays 27/06/2014 Mattia Bosio PhD thesis defense 22 Reliability measure to break ties in J(·) Three rules to define the score combining error rate and reliability: 1. Lexicographic sorting 2. Exponential penalization 3. Linear combination J(·) score depends on 2 parameters: 1. Error rate 2. Reliability
  • 21. IFFS: Experimental setup • Datasets from MAQC study phase II [4] • 7 datasets with hundreds of samples – 30.000+ models evaluated – Independent validation sets available – Common evaluation procedure 23 [4] L. Shi, et al., The microarray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models., Nature biotechnology 28 (2010) 827-38.
  • 22. IFFS: experiment objectives • Evaluate if metagenes are useful • Benchmark with state of the art • Comparison following MAQC standard: Matthews Correlation Coefficient 27/06/2014 Mattia Bosio PhD thesis defense 25 𝑀𝐶𝐶 = 𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁 (𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
  • 23. Results: Metagenes are useful 27/06/2014 Mattia Bosio PhD thesis defense 26 • Introducing metagenes gives better results
  • 24. The proposed framework improves state of the art results 27/06/2014 Mattia Bosio PhD thesis defense 27 0.423 0.486 0.495 0.490 0.25 0.30 0.35 0.40 0.45 0.50 0.55
  • 25. Observations • The proposed framework works with both its key elements • Metagenes are useful (contrib #1) • IFFS adapted to microarrays improves the state of the art (contrib #2) 27/06/2014 Mattia Bosio PhD thesis defense 28
  • 26. 4.3 FEATURE SELECTION: ENSEMBLE How to select the right features to discriminate between classes with a novel ensemble learning algorithm 27/06/2014 Mattia Bosio PhD thesis defense 29
  • 27. Ensemble learning - voting scheme • Ensemble combine experts with a voting scheme • One expert for each available feature – Expert = Trained Classifier output on analyzed data – 1 Expert = 1 feature • The feature selection becomes an Expert subset selection problem 27/06/2014 Mattia Bosio PhD thesis defense 30
  • 28. Accuracy In Diversity [7] the original algorithm • Starts with p experts : One for each feature • Sequentially removes the expert with worst error rate on a subset S • In [6], a simpler version is defined: Kun algorithm 27/06/2014 Mattia Bosio PhD thesis defense 31 [6] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms.Wiley, 2004. [7]R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “A new ensemble diversity measure applied to thinning ensembles.” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, T. Windeatt and F. Roli, Eds., vol. 2709. Springer, 2003, pp. 306–316.
  • 29. Accuracy In Diversity the original algorithm 27/06/2014 Mattia Bosio PhD thesis defense 32 • PCDM (d) = % of experts correctly classifying sample i • S set formed of samples with 𝑙𝑏 ≤ 𝑑 ≤ 𝑈𝑏 • The expert with worst error rate on S is excluded 90% 50% 80% 100% 100% EXPERTS SAMPLES PCDM VOTE AID Kun 𝑙𝑏 = 𝜇 ⋅ 𝑑 + 1 − 𝑑 𝑛 𝑙𝑏 = 10% 𝑈𝑏 = 𝛼 ⋅ 𝑑 + 𝜇(1 − 𝑑) 𝑈𝑏 = 90%
  • 30. Adaptations to microarrays • Nonexpert: Exclude experts unable to find 2 classes in the training set • Metagenes : included as experts • Tie-break rule: the expert upper in the tree is excluded 27/06/2014 Mattia Bosio PhD thesis defense 33
  • 31. Ensemble: experiment objectives • Comparison between AID and Kun ensemble algorithms. • Benchmark with state of the art. • Comparison following MAQC standard: Matthews Correlation Coefficient 27/06/2014 Mattia Bosio PhD thesis defense 34 𝑀𝐶𝐶 = 𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁 (𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
  • 32. Ensemble algorithms improve the state of the art 27/06/2014 Mattia Bosio PhD thesis defense 35 • Both algorithms improve state of the art • The simpler Kun algorithm is the best option 0.230 0.490 0.495 0.514 0.533 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0.6
  • 33. Observations • Ensemble learning feature selection led to encouraging results. • The proposed ensemble learning improves the state of the art. (contrib #3) • Tailoring the algorithm to the data benefits the results. 27/06/2014 Mattia Bosio PhD thesis defense 36
  • 34. 4.4 KNOWLEDGE INTEGRATION Introducing prior biologial knowledge to improve the metagene generation phase. The aim is to obtain more robust performance and more biologically interpretable gene selections 27/06/2014 Mattia Bosio PhD thesis defense 37
  • 35. Integration of external biological data when producing metagenes 27/06/2014 Mattia Bosio PhD thesis defense 38 Genes Feature set Enhancement Feature Selection Classifier Train Data Validation Data Class Estimations New metagenes Biological Knowledge (MSigDb...)
  • 36. Objectives of this section • Measures to quantify biological similarity • Develop ways to integrate both sources of info Numerical correlation & Biological similarity • Benchmarking : Predictive power | Results stability |Biological interpretability 27/06/2014 Mattia Bosio PhD thesis defense 39
  • 37. Distances and merging algorithms • 4 similarity metrics studied: Godall | Smirnov | NoisyOR | Anderberg • 2 criteria to merge numerical and biological info Average | pdf equalization 27/06/2014 Mattia Bosio PhD thesis defense 40
  • 38. Experimental setup • 7 MAQC datasets • 50-run Monte Carlo experiments • Novel scoring system integrating Numerical results and Biological analysis tools 27/06/2014 Mattia Bosio PhD thesis defense 41
  • 39. Comparative scoring system Predictive performance 𝑑 = 𝜇 𝜖+𝜎 from MCC values Rank by decreasing 𝑑 = best Biological analysis 4 parallel analysis tools GSEA | Biograph | Genie |Enrichr 4 parallel rankings Average biological rankings 27/06/2014 Mattia Bosio PhD thesis defense 42 1 1 3 6 2 3 Final score = rank average 2 The best algorithm has the smallest final score
  • 40. Predictive power scoring & ranking shows G_pdf as the best solution 27/06/2014 Mattia Bosio PhD thesis defense 43 The smallest Final Score is the best alternative MCC BIO Bio. Analysis Predictive Rank. Final Score pdf_equalization average
  • 41. Compared with state of the art, G_pdf confirms to be the best alternative 27/06/2014 Mattia Bosio PhD thesis defense 44 The smallest final score is the best alternative MCC BIO Final Score
  • 42. Observations about knowledge integration • Improved results in terms of results stability and interpretability • Godall similarity with pdf-equalization scheme is the best way to integrate prior databases • G-pdf performance confirmed against state of the art alternatives too (contrib #4) 27/06/2014 Mattia Bosio PhD thesis defense 45
  • 43. 4.5 MULTICLASS CLASSIFICATION Study of a novel algorithm for multiclass classification applying coding theory on multiple binary classifiers 27/06/2014 Mattia Bosio PhD thesis defense 46
  • 44. Multiclass approach combining multiple binary classifiers • Common methods like One Against All (OAA) or One Against One (OAO) can be improved. • Information coding  good results[119] • Propose a novel approach with ECOC ideas 27/06/2014 Mattia Bosio PhD thesis defense 47 [119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.
  • 45. Our proposal: OAA+PAA • Choice to combine several experts: – OAA = one classifier per class – PAA = one classifier separating each class-pair • Expert = bit in a codeword • Class estimation by distance with reference words 27/06/2014 Mattia Bosio PhD thesis defense 48 𝑐1 𝑐2 𝑐3 𝑐4 1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 1 1 M binary classifiers h1 h2 … hM N = 4 Classes
  • 46. Experiments on 7 public datasets • Binary classifiers trained with Treelet + IFFS • Compared with OAA, OAO and state of the art alternatives[119 ] • 50 run Monte Carlo run of 4:1 cross validation. 27/06/2014 Mattia Bosio PhD thesis defense 49 [119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.
  • 47. Average accuracy 27/06/2014 Mattia Bosio PhD thesis defense 50 OAA+PAA is better than OAA, OAO and state of the art alternatives OAA OAO [119] LDPC [119] OAA OAA+PAA L1 70% 75% 80% 85% Accuracy
  • 48. Observations about OAA+PAA • It consistently outperforms OAA and OAO algorithms • Obtains better accuracy than state of the art alternatives from [119 ] • OAA+PAA is a valid multiclass algorithm (contrib#5) 27/06/2014 Mattia Bosio PhD thesis defense 51 [119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.
  • 49. 27/06/2014 Mattia Bosio PhD thesis defense 52 5- CONCLUSIONS
  • 50. Two-step approach is the main contribution • Feature set enhancement – Addresses lack of structure – Addresses noise • Feature selection & classification – Choose the best variables among thousands available with new algorithms 27/06/2014 Mattia Bosio PhD thesis defense 53
  • 51. Validated contributions • Metagenes are helpful for classification • Tailored IFFS algorithm  improves state of the art • Ensemble learning algorithm led to interesting results • Knowledge integration framework improves interpretability and robustness • OAA+PAA as a valid multiclass algorithm 27/06/2014 Mattia Bosio PhD thesis defense 54
  • 52. Publications Bosio M, Bellot P, Salembier P, Oliveras A. “Gene Expression Data Classification Combining Hierarchical Representation and Efficient Feature Selection”. Journal of Biological Systems. 2012;20:349-375. Bosio M, Bellot P, Salembier P, Oliveras A. “Feature set enhancement via hierarchical clustering for microarray classification”. IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2011. ; 2011. pp. 226 -229 Bosio M, Bellot P, Salembier P, Oliveras A. “Microarray classification with hierarchical data representation and novel feature selection criteria”. In: IEEE 12th International Conference on BioInformatics and BioEngineering. Larnaca, Cyprus; 2012. Bosio M, Bellot P, Salembier P, Oliveras A. “Multiclass cancer microarray classification algorithm with Pair- Against-All redundancy”. In: The 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS’12). Washington, DC, USA; 2012. Bosio M, Salembier P, Bellot P, Oliveras A. “Hierarchical clustering combining numerical and biological similarities for gene expression data classification”. 35th Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'13). Osaka, Japan 07/2013 M. Bosio, Salembier, P., Oliveras, A., and Bellot, P., “Ensemble feature selection and hierarchical data representation for microarray classification”, in 13th IEEE International Conference on BioInformatics and BioEngineering BIBE, Chania, Crete, 2013. 27/06/2014 Mattia Bosio PhD thesis defense 55 IFFS KUN BIOINFO MCLASS METAGENES
  • 53. Future research directions • Study a better use of the tree structure • Integrate more information sources • Deepen knowledge for ensemble learning • Study applicability for Next Generation Seq analysis or other ‘omics’ platforms 27/06/2014 Mattia Bosio PhD thesis defense 56
  • 54. Hierarchical information representation and efficient classification of gene expression microarray data PhD candidate: Mattia Bosio Advisors: Philippe Salembier Albert Oliveras Vergés 27/06/2014 Mattia Bosio PhD thesis defense 57
  • 55. Hierarchical information representation and efficient classification of gene expression microarray data PhD candidate: Mattia Bosio Advisors: Philippe Salembier Albert Oliveras Vergés 27/06/2014 Mattia Bosio PhD thesis defense 58
  • 56. Hierarchical information representation and efficient classification of gene expression microarray data PhD candidate: Mattia Bosio Advisors: Philippe Salembier Albert Oliveras Vergés 27/06/2014 Mattia Bosio PhD thesis defense 59

Editor's Notes

  1. Specify the A vs B test with an example (tumor no tumo Nice to know this, even if from a sig proc point of view they are just a matrix of numbers, it-s useful to know where do they come from and what they measure! Microarrays as a platform to measure the expression of genes in a sample. They measure thousands of different expression simultaneously Each measure, to simplify, quantifies how much a gene is being used by the cell. Being used means activated or expressed, Non expressed means that a gene is not being used by a cell or organisim Now… why are these important Hope with genomic data to have a picture of all the genes in a cell, which are more used by tumor or those that are switched off by a tumor. Idea is that these measurements can help identifying relevant genes that change between subgroups (tumor vs nontumor for example)
  2. What actually microarrays measure to know gene expression and why Central dogma DNA – RNA – Protein Measure RNA = gene is being copied to RNA a lot because its protein is needed. Gene activity proportional to RNA measured quantity Of course it-s not so easy but that-s how it works
  3. What actually microarrays measure to know gene expression and why Central dogma DNA – RNA – Protein Measure RNA = gene is being copied to RNA a lot because its protein is needed. Gene activity proportional to RNA measured quantity Of course it-s not so easy but that-s how it works
  4. 2 classes say what they are: Example tumor vs Nontumor Start with the problems: What are they: Noise Lack of structure -> dont know who’s really a neighbor , no regularity Sample scarcity and high dimensionality
  5. Opportunities Why signal processing can be used and why our thesis in this field Microarrays are a useful and usd tool for clinical research -> no
  6. Don’t say 2x feature selection needed, just needed anyways Ejemplo de multiclass: multiple classes of a lymphoma for example or comparing several tumor classes Two-step approach Feature set enhancement Addresses lack of structure Addresses noise Aim is to generate new variables with less noise by grouping genes that behave similarly across the samples Feature selection & classification Choose the best variables among thousands available with new algorithms
  7. Two-step approach Feature set enhancement Addresses lack of structure Addresses noise Aim is to generate new variables with less noise by grouping genes that behave similarly across the samples Feature selection & classification Choose the best variables among thousands available with new algorithms
  8. Say that we want to produce a structure where it doesn’t exist Say that we want metagenes to group similarly behaving genes so that we can reduce noise by averagin out similar ones Say that the output from this phase will be a p genes + p-1 metagenes
  9. Describe quickly the iterative process. Focus on two key aspects Similarity metric: decides who gets merged with whom Metagene generation rule : decides how the merging process is We sudied variants for both of the key parameters Haar vs PCA a Euclidean vs Treelets
  10. Say how the algorithm itself is iterative, Meaning that at each step it actually trains classifyiers on the training set and evaluates them on it in terms of a fitness score J It’s flexible
  11. Say how the algorithm itself is iterative, Meaning that at each step it actually trains classifyiers on the training set and evaluates them on it in terms of a fitness score J It’s flexible
  12. Give something more about reliablity Defined with the sample distance respect to the boundary Then have final slides with formulas
  13. Say why ensemble method can be interesting
  14. Ensemble methods can be interesint because they limit overfitting risk by voting with monodimensional classifiers They have been successfully used in other fields of machine learning and also in computational biology and bioinformatics
  15. Say why the process is this one The relevant samples are those on which the experts agree the less so the impact of removing one expert will affect more.
  16. Justify tie break rule by saying that the highest nodes are less reliable since they merge more and more genes.
  17. A slide to say that we also studied several Kun variances and that, that number can go up tu 0.555 ? Or we don’t say a thing about that?
  18. Say why ensemble method can be interesting
  19. Talk here about the data sources of MsigDb that are high quality and reliable sources Talk about their form and the challenges it implies for the metagenes (binary to continuous valued variable)
  20. Say why ensemble method can be interesting
  21. Say why ECOC base on redundancy and error correcting algorithms They work in communications The same assumptions cannot be made here and some algorithms don’t work the same Actually none work the same otherwise we would have perfect classifiers  Experiments done wit h LDPC coding with very good results in communications and some improvements in Microarray classification We want to take the idea of redundancy but not using random algorithms with no connection to the nature of the phenomenon Our bet is that using a easy and reasonable rule to define experts can lead to better results Our rule is to use a common used OAA and add redundancy by grouping class pairs and separating from the rest Well, idea is that class pairs are more likely to exist than bigger groups (this can be questionable but we wanted to try this)
  22. To drive feature selection and elminate early unreliable alternatives Itnegrate from other sources with natural language processing for example or import several data