phd ppt2 sample reference download1.pptx

Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 1

Thesis objective
Develop algorithms for microarray
classification
–Predictive performance
–Results stability
–Biological interpretability

Roadmap
3
27/06/2014 Mattia Bosio PhD thesis defense
1- Microarrays
2- Challenges & Opportunities
3- Contributions
4- How did we get there?
5- Conclusions

Challenges & Opportunities
1- Microarrays

A platform to measure gene expression
• Give a picture of the whole cellular state
• Thousands of parallel measures
• Measure how much each gene is being used
• Can be used to discriminate between
populations

Microarrays: what do they measure

Microarrays: how do they look like
45’000 ‘Genes’
72
Samples

Challenges & Opportunities
2- CHALLENGES &
OPPORTUNITIES

Challenges
Lack of structure
Noise
Sample size vs dimensions
45’000 ‘Genes’
72
Samples

Opportunities
• Established tool for research but no optimum
algorithm yet for classification
• Machine learning has already been used
– Good results that can be improved
• Signal processing dealt with similar problems

Contributions
3- CONTRIBUTIONS

Two-step classification framework
Genes
Feature set
Enhancement
Feature
Selection
Classifier
Train Data
Validation Data
Class Estimations
Metagenes
1. Metagenes 2. IFFS
3. Ensemble
4. Knowledge
Integration
5. Multiclass
algorithm

4- HOW DID WE GET THERE?

4.1 FEATURE SET ENHANCEMENT
A structure is inferred from the data and new metagenes are created.
16

Feature set enhancement
Addresses Noise and Lack of structure
• A binary tree is inferred
• Each node is a new feature
• New features are called
metagenes
• Metagenes reduce noise by
clustering similar genes
17

Feature set enhancement
The iterative process of metagene generation
• Iterative process based on
Treelets [1]
• The two most similar features
are substituted by a metagene
• Two key elements:
– Similarity Metric
– Metagene generation algorithm
18
[1] A. B. Lee, B. Nadler, L. Wasserman, Treelets - an adaptive multi-scale basis
for sparse unordered data, Annals of Applied Statistics 2 (2) (2008) 435 {471}.

4.2 FEATURE SELECTION: IFFS
How to select the right features to discriminate between classes with an iterative, wrapper
algorithm
19

IFFS:Find the few best features to classify
• “Improved Sequential Floating Forward
Selection (IFFS)” [2]:
– Sequential, deterministic wrapper algorithm
• Flexible method : at each iteration decide if
Add, Delete or Substitute a feature
• Alternatives are compared by a J(·) score
20
[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern
Recognition.

IFFS:Find the few best features to classify
Deterministic sequential wrapper algorithm
• All the decisions determined by a J(·) score
• Usually J(·) is an error rate estimation
– Ties are frequent due to the sample scarcity
[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern
Recognition.

J(·) score tailored for microarrays
22
Reliability measure to break ties in J(·)
Three rules to define the score
combining error rate and reliability:
1. Lexicographic sorting
2. Exponential penalization
3. Linear combination
J(·) score depends on 2 parameters:
1. Error rate
2. Reliability

IFFS: Experimental setup
• Datasets from MAQC study phase II [4]
• 7 datasets with hundreds of samples
– 30.000+ models evaluated
– Independent validation sets available
– Common evaluation procedure
23
[4] L. Shi, et al., The microarray quality control (MAQC)-II study of common practices for the development and
validation of microarray-based predictive models., Nature biotechnology 28 (2010) 827-38.

IFFS: experiment objectives
• Evaluate if metagenes are useful
• Benchmark with state of the art
• Comparison following MAQC standard:
Matthews Correlation Coefficient
𝑀𝐶𝐶 =
𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁
(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)

Results: Metagenes are useful
• Introducing metagenes gives better results

The proposed framework improves state
of the art results
0.423
0.486
0.495
0.490
0.25
0.30
0.35
0.40
0.45
0.50
0.55

Observations
• The proposed framework works with both its
key elements
• Metagenes are useful (contrib #1)
• IFFS adapted to microarrays improves the state
of the art (contrib #2)

4.3 FEATURE SELECTION: ENSEMBLE
How to select the right features to discriminate between classes with a novel ensemble
learning algorithm
29

Ensemble learning - voting scheme
• Ensemble combine experts with a voting
scheme
• One expert for each available feature
– Expert = Trained Classifier output on analyzed data
– 1 Expert = 1 feature
• The feature selection becomes an Expert
subset selection problem

Accuracy In Diversity [7]
the original algorithm
• Starts with p experts : One for each feature
• Sequentially removes the expert with worst
error rate on a subset S
• In [6], a simpler version is defined: Kun
algorithm
[6] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms.Wiley, 2004.
[7]R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “A new ensemble diversity measure applied to
thinning ensembles.” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, T. Windeatt and F. Roli,
Eds., vol. 2709. Springer, 2003, pp. 306–316.

Accuracy In Diversity
the original algorithm
• PCDM (d) = % of experts correctly classifying sample i
• S set formed of samples with 𝑙𝑏 ≤ 𝑑 ≤ 𝑈𝑏
• The expert with worst error rate on S is excluded
90%
50%
80%
100%
100%
EXPERTS
SAMPLES
PCDM VOTE
AID Kun
𝑙𝑏 = 𝜇 ⋅ 𝑑 +
1 − 𝑑
𝑛
𝑙𝑏 = 10%
𝑈𝑏 = 𝛼 ⋅ 𝑑 + 𝜇(1 − 𝑑) 𝑈𝑏 = 90%

Adaptations to microarrays
• Nonexpert: Exclude experts unable to find 2
classes in the training set
• Metagenes : included as experts
• Tie-break rule: the expert upper in the tree is
excluded

Ensemble: experiment objectives
• Comparison between AID and Kun ensemble
algorithms.
• Benchmark with state of the art.
• Comparison following MAQC standard:
Matthews Correlation Coefficient
𝑀𝐶𝐶 =
𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁
(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)

Ensemble algorithms improve the state of
the art
• Both algorithms improve state of the art
• The simpler Kun algorithm is the best option
0.230
0.490
0.495
0.514
0.533
0.2
0.3
0.3
0.4
0.4
0.5
0.5
0.6

Observations
• Ensemble learning feature selection led to
encouraging results.
• The proposed ensemble learning improves the
state of the art. (contrib #3)
• Tailoring the algorithm to the data benefits the
results.

4.4 KNOWLEDGE INTEGRATION
Introducing prior biologial knowledge to improve the metagene generation phase. The aim is
to obtain more robust performance and more biologically interpretable gene selections
37

Integration of external biological data
when producing metagenes
Genes
Feature set
Enhancement
Feature
Selection
Classifier
Train Data
Validation Data
Class Estimations
New metagenes
Biological Knowledge
(MSigDb...)

Objectives of this section
• Measures to quantify biological similarity
• Develop ways to integrate both sources of info
Numerical correlation & Biological similarity
• Benchmarking :
Predictive power | Results stability |Biological interpretability

Distances and merging algorithms
• 4 similarity metrics studied:
Godall | Smirnov | NoisyOR | Anderberg
• 2 criteria to merge numerical and biological info
Average | pdf equalization

Experimental setup
• 7 MAQC datasets
• 50-run Monte Carlo experiments
• Novel scoring system integrating Numerical
results and Biological analysis tools

Comparative scoring system
Predictive performance
𝑑 =
𝜇
𝜖+𝜎
from MCC values
Rank by decreasing 𝑑
= best
Biological analysis
4 parallel analysis tools
GSEA | Biograph | Genie |Enrichr
4 parallel rankings
Average biological rankings
42
1
1 3 6 2
3
Final score = rank average
2
The best algorithm has the smallest final score

Predictive power scoring & ranking shows
G_pdf as the best solution
The smallest Final Score is the best alternative
MCC BIO
Bio.
Analysis
Predictive
Rank.
Final
Score
pdf_equalization average

Compared with state of the art, G_pdf
confirms to be the best alternative
The smallest final score is the best alternative
MCC BIO
Final
Score

Observations about knowledge
integration
• Improved results in terms of results stability
and interpretability
• Godall similarity with pdf-equalization scheme
is the best way to integrate prior databases
• G-pdf performance confirmed against state of
the art alternatives too (contrib #4)

4.5 MULTICLASS CLASSIFICATION
Study of a novel algorithm for multiclass classification applying coding theory on multiple
binary classifiers
46

Multiclass approach combining multiple
binary classifiers
• Common methods like One Against All (OAA)
or One Against One (OAO) can be improved.
• Information coding  good results[119]
• Propose a novel approach with ECOC ideas
[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with
a reduced number of genes. BMC Bioinformatics 2011.

Our proposal: OAA+PAA
• Choice to combine several experts:
– OAA = one classifier per class
– PAA = one classifier separating each class-pair
• Expert = bit in a codeword
• Class estimation by distance with reference words
𝑐1
𝑐2
𝑐3
𝑐4
1 0 0 0 1 1 1 0 0 0
0 1 0 0 1 0 0 1 1 0
0 0 1 0 0 1 0 1 0 1
0 0 0 1 0 0 1 0 1 1
M binary classifiers
h1 h2 … hM
N
=
4
Classes

Experiments on 7 public datasets
• Binary classifiers trained with Treelet + IFFS
• Compared with OAA, OAO and state of the art
alternatives[119 ]
• 50 run Monte Carlo run of 4:1 cross validation.

Average accuracy
OAA+PAA is better than OAA, OAO and state of the art alternatives
OAA OAO [119] LDPC [119] OAA OAA+PAA L1
70%
75%
80%
85%
Accuracy

Observations about OAA+PAA
• It consistently outperforms OAA and OAO
algorithms
• Obtains better accuracy than state of the art
alternatives from [119 ]
• OAA+PAA is a valid multiclass algorithm
(contrib#5)

5- CONCLUSIONS

Two-step approach is the main
contribution
• Feature set enhancement
– Addresses lack of structure
– Addresses noise
• Feature selection & classification
– Choose the best variables among thousands
available with new algorithms

Validated contributions
• Metagenes are helpful for classification
• Tailored IFFS algorithm  improves state of
the art
• Ensemble learning algorithm led to interesting
results
• Knowledge integration framework improves
interpretability and robustness
• OAA+PAA as a valid multiclass algorithm

Publications
Bosio M, Bellot P, Salembier P, Oliveras A. “Gene Expression Data Classification Combining Hierarchical
Representation and Efficient Feature Selection”. Journal of Biological Systems. 2012;20:349-375.
Bosio M, Bellot P, Salembier P, Oliveras A. “Feature set enhancement via hierarchical clustering for microarray
classification”. IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2011. ; 2011.
pp. 226 -229
Bosio M, Bellot P, Salembier P, Oliveras A. “Microarray classification with hierarchical data representation and
novel feature selection criteria”. In: IEEE 12th International Conference on BioInformatics and BioEngineering.
Larnaca, Cyprus; 2012.
Bosio M, Bellot P, Salembier P, Oliveras A. “Multiclass cancer microarray classification algorithm with Pair-
Against-All redundancy”. In: The 2012 IEEE International Workshop on Genomic Signal Processing and Statistics
(GENSIPS’12). Washington, DC, USA; 2012.
Bosio M, Salembier P, Bellot P, Oliveras A. “Hierarchical clustering combining numerical and biological similarities
for gene expression data classification”. 35th Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC'13). Osaka, Japan 07/2013
M. Bosio, Salembier, P., Oliveras, A., and Bellot, P., “Ensemble feature selection and hierarchical data
representation for microarray classification”, in 13th IEEE International Conference on BioInformatics and
BioEngineering BIBE, Chania, Crete, 2013.
IFFS
KUN
BIOINFO
MCLASS
METAGENES

Future research directions
• Study a better use of the tree structure
• Integrate more information sources
• Deepen knowledge for ensemble learning
• Study applicability for Next Generation Seq
analysis or other ‘omics’ platforms

PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier

phd ppt2 sample reference download1.pptx

Recommended

Recommended

More Related Content

Similar to phd ppt2 sample reference download1.pptx

Similar to phd ppt2 sample reference download1.pptx (20)

Recently uploaded

Recently uploaded (20)

phd ppt2 sample reference download1.pptx

Editor's Notes