E-Commerce Shopping using MERN Stack where different modules are present
phd ppt2 sample reference download1.pptx
1. Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 1
5. A platform to measure gene expression
27/06/2014 Mattia Bosio PhD thesis defense 5
• Give a picture of the whole cellular state
• Thousands of parallel measures
• Measure how much each gene is being used
• Can be used to discriminate between
populations
10. Opportunities
27/06/2014 Mattia Bosio PhD thesis defense 10
• Established tool for research but no optimum
algorithm yet for classification
• Machine learning has already been used
– Good results that can be improved
• Signal processing dealt with similar problems
12. 27/06/2014 Mattia Bosio PhD thesis defense 12
Two-step classification framework
Genes
Feature set
Enhancement
Feature
Selection
Classifier
Train Data
Validation Data
Class Estimations
Metagenes
1. Metagenes 2. IFFS
3. Ensemble
4. Knowledge
Integration
5. Multiclass
algorithm
13. 4- HOW DID WE GET THERE?
27/06/2014 Mattia Bosio PhD thesis defense 14
14. 4.1 FEATURE SET ENHANCEMENT
A structure is inferred from the data and new metagenes are created.
27/06/2014 Mattia Bosio PhD thesis defense
16
15. Feature set enhancement
Addresses Noise and Lack of structure
• A binary tree is inferred
• Each node is a new feature
• New features are called
metagenes
• Metagenes reduce noise by
clustering similar genes
27/06/2014 Mattia Bosio PhD thesis defense
17
16. Feature set enhancement
The iterative process of metagene generation
• Iterative process based on
Treelets [1]
• The two most similar features
are substituted by a metagene
• Two key elements:
– Similarity Metric
– Metagene generation algorithm
18
[1] A. B. Lee, B. Nadler, L. Wasserman, Treelets - an adaptive multi-scale basis
for sparse unordered data, Annals of Applied Statistics 2 (2) (2008) 435 {471}.
17. 4.2 FEATURE SELECTION: IFFS
How to select the right features to discriminate between classes with an iterative, wrapper
algorithm
27/06/2014 Mattia Bosio PhD thesis defense
19
18. IFFS:Find the few best features to classify
• “Improved Sequential Floating Forward
Selection (IFFS)” [2]:
– Sequential, deterministic wrapper algorithm
• Flexible method : at each iteration decide if
Add, Delete or Substitute a feature
• Alternatives are compared by a J(·) score
20
[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern
Recognition.
19. IFFS:Find the few best features to classify
Deterministic sequential wrapper algorithm
• All the decisions determined by a J(·) score
• Usually J(·) is an error rate estimation
– Ties are frequent due to the sample scarcity
27/06/2014 Mattia Bosio PhD thesis defense 21
[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern
Recognition.
20. J(·) score tailored for microarrays
27/06/2014 Mattia Bosio PhD thesis defense
22
Reliability measure to break ties in J(·)
Three rules to define the score
combining error rate and reliability:
1. Lexicographic sorting
2. Exponential penalization
3. Linear combination
J(·) score depends on 2 parameters:
1. Error rate
2. Reliability
21. IFFS: Experimental setup
• Datasets from MAQC study phase II [4]
• 7 datasets with hundreds of samples
– 30.000+ models evaluated
– Independent validation sets available
– Common evaluation procedure
23
[4] L. Shi, et al., The microarray quality control (MAQC)-II study of common practices for the development and
validation of microarray-based predictive models., Nature biotechnology 28 (2010) 827-38.
22. IFFS: experiment objectives
• Evaluate if metagenes are useful
• Benchmark with state of the art
• Comparison following MAQC standard:
Matthews Correlation Coefficient
27/06/2014 Mattia Bosio PhD thesis defense 25
𝑀𝐶𝐶 =
𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁
(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
24. The proposed framework improves state
of the art results
27/06/2014 Mattia Bosio PhD thesis defense 27
0.423
0.486
0.495
0.490
0.25
0.30
0.35
0.40
0.45
0.50
0.55
25. Observations
• The proposed framework works with both its
key elements
• Metagenes are useful (contrib #1)
• IFFS adapted to microarrays improves the state
of the art (contrib #2)
27/06/2014 Mattia Bosio PhD thesis defense 28
26. 4.3 FEATURE SELECTION: ENSEMBLE
How to select the right features to discriminate between classes with a novel ensemble
learning algorithm
27/06/2014 Mattia Bosio PhD thesis defense
29
27. Ensemble learning - voting scheme
• Ensemble combine experts with a voting
scheme
• One expert for each available feature
– Expert = Trained Classifier output on analyzed data
– 1 Expert = 1 feature
• The feature selection becomes an Expert
subset selection problem
27/06/2014 Mattia Bosio PhD thesis defense 30
28. Accuracy In Diversity [7]
the original algorithm
• Starts with p experts : One for each feature
• Sequentially removes the expert with worst
error rate on a subset S
• In [6], a simpler version is defined: Kun
algorithm
27/06/2014 Mattia Bosio PhD thesis defense 31
[6] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms.Wiley, 2004.
[7]R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “A new ensemble diversity measure applied to
thinning ensembles.” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, T. Windeatt and F. Roli,
Eds., vol. 2709. Springer, 2003, pp. 306–316.
29. Accuracy In Diversity
the original algorithm
27/06/2014 Mattia Bosio PhD thesis defense 32
• PCDM (d) = % of experts correctly classifying sample i
• S set formed of samples with 𝑙𝑏 ≤ 𝑑 ≤ 𝑈𝑏
• The expert with worst error rate on S is excluded
90%
50%
80%
100%
100%
EXPERTS
SAMPLES
PCDM VOTE
AID Kun
𝑙𝑏 = 𝜇 ⋅ 𝑑 +
1 − 𝑑
𝑛
𝑙𝑏 = 10%
𝑈𝑏 = 𝛼 ⋅ 𝑑 + 𝜇(1 − 𝑑) 𝑈𝑏 = 90%
30. Adaptations to microarrays
• Nonexpert: Exclude experts unable to find 2
classes in the training set
• Metagenes : included as experts
• Tie-break rule: the expert upper in the tree is
excluded
27/06/2014 Mattia Bosio PhD thesis defense 33
31. Ensemble: experiment objectives
• Comparison between AID and Kun ensemble
algorithms.
• Benchmark with state of the art.
• Comparison following MAQC standard:
Matthews Correlation Coefficient
27/06/2014 Mattia Bosio PhD thesis defense 34
𝑀𝐶𝐶 =
𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁
(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
32. Ensemble algorithms improve the state of
the art
27/06/2014 Mattia Bosio PhD thesis defense 35
• Both algorithms improve state of the art
• The simpler Kun algorithm is the best option
0.230
0.490
0.495
0.514
0.533
0.2
0.3
0.3
0.4
0.4
0.5
0.5
0.6
33. Observations
• Ensemble learning feature selection led to
encouraging results.
• The proposed ensemble learning improves the
state of the art. (contrib #3)
• Tailoring the algorithm to the data benefits the
results.
27/06/2014 Mattia Bosio PhD thesis defense 36
34. 4.4 KNOWLEDGE INTEGRATION
Introducing prior biologial knowledge to improve the metagene generation phase. The aim is
to obtain more robust performance and more biologically interpretable gene selections
27/06/2014 Mattia Bosio PhD thesis defense
37
35. Integration of external biological data
when producing metagenes
27/06/2014 Mattia Bosio PhD thesis defense 38
Genes
Feature set
Enhancement
Feature
Selection
Classifier
Train Data
Validation Data
Class Estimations
New metagenes
Biological Knowledge
(MSigDb...)
36. Objectives of this section
• Measures to quantify biological similarity
• Develop ways to integrate both sources of info
Numerical correlation & Biological similarity
• Benchmarking :
Predictive power | Results stability |Biological interpretability
27/06/2014 Mattia Bosio PhD thesis defense 39
37. Distances and merging algorithms
• 4 similarity metrics studied:
Godall | Smirnov | NoisyOR | Anderberg
• 2 criteria to merge numerical and biological info
Average | pdf equalization
27/06/2014 Mattia Bosio PhD thesis defense 40
38. Experimental setup
• 7 MAQC datasets
• 50-run Monte Carlo experiments
• Novel scoring system integrating Numerical
results and Biological analysis tools
27/06/2014 Mattia Bosio PhD thesis defense 41
39. Comparative scoring system
Predictive performance
𝑑 =
𝜇
𝜖+𝜎
from MCC values
Rank by decreasing 𝑑
= best
Biological analysis
4 parallel analysis tools
GSEA | Biograph | Genie |Enrichr
4 parallel rankings
Average biological rankings
27/06/2014 Mattia Bosio PhD thesis defense
42
1
1 3 6 2
3
Final score = rank average
2
The best algorithm has the smallest final score
40. Predictive power scoring & ranking shows
G_pdf as the best solution
27/06/2014 Mattia Bosio PhD thesis defense 43
The smallest Final Score is the best alternative
MCC BIO
Bio.
Analysis
Predictive
Rank.
Final
Score
pdf_equalization average
41. Compared with state of the art, G_pdf
confirms to be the best alternative
27/06/2014 Mattia Bosio PhD thesis defense 44
The smallest final score is the best alternative
MCC BIO
Final
Score
42. Observations about knowledge
integration
• Improved results in terms of results stability
and interpretability
• Godall similarity with pdf-equalization scheme
is the best way to integrate prior databases
• G-pdf performance confirmed against state of
the art alternatives too (contrib #4)
27/06/2014 Mattia Bosio PhD thesis defense 45
43. 4.5 MULTICLASS CLASSIFICATION
Study of a novel algorithm for multiclass classification applying coding theory on multiple
binary classifiers
27/06/2014 Mattia Bosio PhD thesis defense
46
44. Multiclass approach combining multiple
binary classifiers
• Common methods like One Against All (OAA)
or One Against One (OAO) can be improved.
• Information coding good results[119]
• Propose a novel approach with ECOC ideas
27/06/2014 Mattia Bosio PhD thesis defense 47
[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with
a reduced number of genes. BMC Bioinformatics 2011.
45. Our proposal: OAA+PAA
• Choice to combine several experts:
– OAA = one classifier per class
– PAA = one classifier separating each class-pair
• Expert = bit in a codeword
• Class estimation by distance with reference words
27/06/2014 Mattia Bosio PhD thesis defense 48
𝑐1
𝑐2
𝑐3
𝑐4
1 0 0 0 1 1 1 0 0 0
0 1 0 0 1 0 0 1 1 0
0 0 1 0 0 1 0 1 0 1
0 0 0 1 0 0 1 0 1 1
M binary classifiers
h1 h2 … hM
N
=
4
Classes
46. Experiments on 7 public datasets
• Binary classifiers trained with Treelet + IFFS
• Compared with OAA, OAO and state of the art
alternatives[119 ]
• 50 run Monte Carlo run of 4:1 cross validation.
27/06/2014 Mattia Bosio PhD thesis defense 49
[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with
a reduced number of genes. BMC Bioinformatics 2011.
47. Average accuracy
27/06/2014 Mattia Bosio PhD thesis defense 50
OAA+PAA is better than OAA, OAO and state of the art alternatives
OAA OAO [119] LDPC [119] OAA OAA+PAA L1
70%
75%
80%
85%
Accuracy
48. Observations about OAA+PAA
• It consistently outperforms OAA and OAO
algorithms
• Obtains better accuracy than state of the art
alternatives from [119 ]
• OAA+PAA is a valid multiclass algorithm
(contrib#5)
27/06/2014 Mattia Bosio PhD thesis defense 51
[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with
a reduced number of genes. BMC Bioinformatics 2011.
50. Two-step approach is the main
contribution
• Feature set enhancement
– Addresses lack of structure
– Addresses noise
• Feature selection & classification
– Choose the best variables among thousands
available with new algorithms
27/06/2014 Mattia Bosio PhD thesis defense 53
51. Validated contributions
• Metagenes are helpful for classification
• Tailored IFFS algorithm improves state of
the art
• Ensemble learning algorithm led to interesting
results
• Knowledge integration framework improves
interpretability and robustness
• OAA+PAA as a valid multiclass algorithm
27/06/2014 Mattia Bosio PhD thesis defense 54
52. Publications
Bosio M, Bellot P, Salembier P, Oliveras A. “Gene Expression Data Classification Combining Hierarchical
Representation and Efficient Feature Selection”. Journal of Biological Systems. 2012;20:349-375.
Bosio M, Bellot P, Salembier P, Oliveras A. “Feature set enhancement via hierarchical clustering for microarray
classification”. IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2011. ; 2011.
pp. 226 -229
Bosio M, Bellot P, Salembier P, Oliveras A. “Microarray classification with hierarchical data representation and
novel feature selection criteria”. In: IEEE 12th International Conference on BioInformatics and BioEngineering.
Larnaca, Cyprus; 2012.
Bosio M, Bellot P, Salembier P, Oliveras A. “Multiclass cancer microarray classification algorithm with Pair-
Against-All redundancy”. In: The 2012 IEEE International Workshop on Genomic Signal Processing and Statistics
(GENSIPS’12). Washington, DC, USA; 2012.
Bosio M, Salembier P, Bellot P, Oliveras A. “Hierarchical clustering combining numerical and biological similarities
for gene expression data classification”. 35th Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC'13). Osaka, Japan 07/2013
M. Bosio, Salembier, P., Oliveras, A., and Bellot, P., “Ensemble feature selection and hierarchical data
representation for microarray classification”, in 13th IEEE International Conference on BioInformatics and
BioEngineering BIBE, Chania, Crete, 2013.
27/06/2014 Mattia Bosio PhD thesis defense 55
IFFS
KUN
BIOINFO
MCLASS
METAGENES
53. Future research directions
• Study a better use of the tree structure
• Integrate more information sources
• Deepen knowledge for ensemble learning
• Study applicability for Next Generation Seq
analysis or other ‘omics’ platforms
27/06/2014 Mattia Bosio PhD thesis defense 56
54. Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 57
55. Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 58
56. Hierarchical information representation
and efficient classification
of gene expression microarray data
PhD candidate:
Mattia Bosio
Advisors:
Philippe Salembier
Albert Oliveras Vergés
27/06/2014 Mattia Bosio PhD thesis defense 59
Editor's Notes
Specify the A vs B test with an example (tumor no tumo
Nice to know this, even if from a sig proc point of view they are just a matrix of numbers, it-s useful to know where do they come from and what they measure!
Microarrays as a platform to measure the expression of genes in a sample.
They measure thousands of different expression simultaneously
Each measure, to simplify, quantifies how much a gene is being used by the cell.
Being used means activated or expressed,
Non expressed means that a gene is not being used by a cell or organisim
Now… why are these important
Hope with genomic data to have a picture of all the genes in a cell, which are more used by tumor or those that are switched off by a tumor.
Idea is that these measurements can help identifying relevant genes that change between subgroups (tumor vs nontumor for example)
What actually microarrays measure to know gene expression and why
Central dogma DNA – RNA – Protein
Measure RNA = gene is being copied to RNA a lot because its protein is needed.
Gene activity proportional to RNA measured quantity
Of course it-s not so easy but that-s how it works
What actually microarrays measure to know gene expression and why
Central dogma DNA – RNA – Protein
Measure RNA = gene is being copied to RNA a lot because its protein is needed.
Gene activity proportional to RNA measured quantity
Of course it-s not so easy but that-s how it works
2 classes say what they are:
Example tumor vs Nontumor
Start with the problems:
What are they:
Noise
Lack of structure -> dont know who’s really a neighbor , no regularity
Sample scarcity and high dimensionality
Opportunities
Why signal processing can be used and why our thesis in this field
Microarrays are a useful and usd tool for clinical research -> no
Don’t say 2x feature selection needed, just needed anyways
Ejemplo de multiclass: multiple classes of a lymphoma for example or comparing several tumor classes
Two-step approach
Feature set enhancement
Addresses lack of structure
Addresses noise
Aim is to generate new variables with less noise by grouping genes that behave similarly across the samples
Feature selection & classification
Choose the best variables among thousands available with new algorithms
Two-step approach
Feature set enhancement
Addresses lack of structure
Addresses noise
Aim is to generate new variables with less noise by grouping genes that behave similarly across the samples
Feature selection & classification
Choose the best variables among thousands available with new algorithms
Say that we want to produce a structure where it doesn’t exist
Say that we want metagenes to group similarly behaving genes so that we can reduce noise by averagin out similar ones
Say that the output from this phase will be a p genes + p-1 metagenes
Describe quickly the iterative process.
Focus on two key aspects
Similarity metric: decides who gets merged with whom
Metagene generation rule : decides how the merging process is
We sudied variants for both of the key parameters
Haar vs PCA a
Euclidean vs Treelets
Say how the algorithm itself is iterative, Meaning that at each step it actually trains classifyiers on the training set and evaluates them on it in terms of a fitness score J
It’s flexible
Say how the algorithm itself is iterative, Meaning that at each step it actually trains classifyiers on the training set and evaluates them on it in terms of a fitness score J
It’s flexible
Give something more about reliablity
Defined with the sample distance respect to the boundary
Then have final slides with formulas
Say why ensemble method can be interesting
Ensemble methods can be interesint because they limit overfitting risk by voting with monodimensional classifiers
They have been successfully used in other fields of machine learning and also in computational biology and bioinformatics
Say why the process is this one
The relevant samples are those on which the experts agree the less so the impact of removing one expert will affect more.
Justify tie break rule by saying that the highest nodes are less reliable since they merge more and more genes.
A slide to say that we also studied several Kun variances and that, that number can go up tu 0.555 ?
Or we don’t say a thing about that?
Say why ensemble method can be interesting
Talk here about the data sources of MsigDb that are high quality and reliable sources
Talk about their form and the challenges it implies for the metagenes (binary to continuous valued variable)
Say why ensemble method can be interesting
Say why
ECOC base on redundancy and error correcting algorithms
They work in communications
The same assumptions cannot be made here and some algorithms don’t work the same
Actually none work the same otherwise we would have perfect classifiers
Experiments done wit h LDPC coding with very good results in communications and some improvements in Microarray classification
We want to take the idea of redundancy but not using random algorithms with no connection to the nature of the phenomenon
Our bet is that using a easy and reasonable rule to define experts can lead to better results
Our rule is to use a common used OAA and add redundancy by grouping class pairs and separating from the rest
Well, idea is that class pairs are more likely to exist than bigger groups (this can be questionable but we wanted to try this)
To drive feature selection and elminate early unreliable alternatives
Itnegrate from other sources with natural language processing for example or import several data