A presentation of several biomedical applications of prototype-based machine learning and relevance learning. Invited talk at the AlCoB conference 2017 in Aveiro/Portugal.
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
June 2017: Biomedical applications of prototype-based classifiers and relevance learning
1. Michael Biehl Intelligent Systems
Johann Bernoulli Institute for
Mathematics and Computing Science
University of Groningen / NL
Biomedical applications of prototype-based
classifiers and relevance learning
www.cs.rug.nl/~biehl
Introduction: prototype-based classification, relevance learning
Generalized Matrix Relevance LVQ
Illustration: three bio-medical applications
2. AlCoB, June 2017, Aveiro / Portugal
2
supervised learning
classification / regression / prediction
based on labeled example data
generic workflow:
example data model apply to novel datatraining working
obvious performance measures: overall / class-wise accuracy
ROC, Precision Recall ...
validation
estimate working performance
set parameters of model / training
compare different models
accuracy is not enough - interpretable “white-box” systems
example: prototype-based models, distance-based classifiers
3. AlCoB, June 2017, Aveiro / Portugal
distance-based classifiers
a simple distance-based system: (K) NN classifier
• store a set of labeled examples
• classify a query according to the
label of the Nearest Neighbor
(or the majority of K NN)
• piece-wise linear decision
boundaries according
to (e.g.) Euclidean distance
from all examples
?
N-dim. feature space
+ conceptually simple,
+ no training phase
+ only one parameter (K)
- expensive (storage, computation)
- sensitive to mislabeled data
- overly complex decision boundaries
4. AlCoB, June 2017, Aveiro / Portugal
prototype-based classification
• represent the data by one or
several prototypes per class
• classify a query according to the
label of the nearest prototype
(or alternative schemes)
• local decision boundaries acc.
to (e.g.) Euclidean distances
+
+ robust, low storage needs,
little computational effort
- model selection: number of prototypes per class, etc.
requires training: placement of prototypes in feature space
N-dim. feature space
?
parameterization in feature space, interpretability
Learning Vector Quantization [Kohonen, 1990]
5. AlCoB, June 2017, Aveiro / Portugal
∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
• initialize prototype vectors
for different classes
competitive learning: LVQ1 [Kohonen, 1990]
• identify the winner
(closest prototype)
• present a single example
• move the winner
- closer towards the data (same class)
- away from the data (different class)
6. AlCoB, June 2017, Aveiro / Portugal
∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
∙ tesselation of feature space
[piece-wise linear]
∙ distance-based classification
[here: Euclidean distances]
∙ generalization ability
correct classification of new data
∙ aim: discrimination of classes
( ≠ vector quantization
or density estimation )
7. AlCoB, June 2017, Aveiro / Portugal
cost function based LVQ
one example: Generalized LVQ (GLVQ) cost function [Sato&Yamada, 1995]
two winning prototypes:
minimize
E favors
- small number of misclassifications, e.g. with
- large margins between classes
- small , large
- class-typical prototypes
8. AlCoB, June 2017, Aveiro / Portugal
LVQ distance measures
? key question: appropriate distance / (dis-) similarity measure
fixed, pre-defined distance measures:
(G)LVQ can formulated for general (differentiable) distances
examples: Minkowski distances (p≠2), correlation based,
statistical divergences, ... not necessarily metrics!
standard work-flow
- consider several distance measures according to prior knowledge
- compare performances in, e.g., cross-validation
elegant approach: Relevance Learning / adaptive distances
- employ parameterized distance measure
- optimize in the data-driven training process (cost function!)
9. AlCoB, June 2017, Aveiro / Portugal
Generalized Matrix Relevance LVQ:
generalized quadratic distance in LVQ:
[Schneider, Biehl, Hammer, 2009]
GMLVQ
10. AlCoB, June 2017, Aveiro / Portugal
GMLVQ
generalized quadratic distance in LVQ:
[Schneider, Biehl, Hammer, 2009]
variants:
one global, several local, class-wise relevance matrices
rectangular low-dim. representation / visualization
[Bunte et al., 2012]
diagonal matrices: single feature weights [Hammer et al., 2002]
training: adaptation of prototypes
and distance measure guided by
GLVQ cost function
Generalized Matrix Relevance LVQ:
11. AlCoB, June 2017, Aveiro / Portugal 11
interpretation
summarizes
• the contribution of a single dimension
• the relevance of original features in the classifier
Note: interpretation assumes implicitly that
features have equal order of magnitude
e.g. after z-score-transformation →
(averages over data set)
quantifies the contribution of the pair
of features (i,j) to the distance
after training:
prototypes represent typical class properties or subtypes
Relevance Matrix
12. AlCoB, June 2017, Aveiro / Portugal 12
three application examples
I) steroid metabolomics:
- detection of malignancy in adrenocortical tumors
based on urinary steroid metabolite excretion
GMLVQ: ~ 150 samples, 32-dim. feature vectors
II) cytokine expression data:
- diagnosis of (early) rheumatoid arthritis
based on synovial tissue samples
~ 50 samples represented by 117 cytokine expressions
in synovial tissue, PCA+GMLVQ combined
III) gene expression data:
- recurrence risk prediction from tumor samples
~ 400 samples, ~20000 dim. feature space
outlier analysis + GMLVQ on (80) pre-selected genes
13. Steroid metabolomics: detecting
malignancy in adrenocortical tumors
www.ensat.org
W. Arlt, M. Biehl, A. Taylor, S. Hahner, R. Libé, B. Hughes, P. Schneider,
D. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat,
F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C. Shackleton,
X. Bertagna, M.Fassnacht, P. Stewart
Urine Steroid Metabolomics as a Biomarker Tool for Detecting
Malignancy in Patients with Adrenal Tumors
J Clinical Endocrinology & Metabolism 96: 3775-3784 (2011)
14. AlCoB, June 2017, Aveiro / Portugal
www.ensat.org
classification of adrenocortical tumors (adenoma vs. carcinoma)
based on steroid hormone excretion profiles
benign ACA malignant ACC
features: 32 steroid metabolite excretion values
non-invasive measurement (24 hrs. urine samples)
steroid metabolomics
aim: develop a novel biomarker tool for differential diagnosis
idea: identify characteristic steroid profiles (prototypes)
15. AlCoB, June 2017, Aveiro / Portugal
Generalized Matrix LVQ , ACC vs. ACA classification
∙ data divided in 90% training, 10% test set, (z-score transformed)
∙ determine prototypes
typical profiles (1 per class)
∙ apply classifier to test data
evaluate performance (error rates, ROC)
∙ adaptive generalized quadratic distance measure
parameterized by
∙ repeat and average over many random splits
[Arlt et al., 2011]
[Biehl et al., 2012]
steroid metabolomics
16. AlCoB, June 2017, Aveiro / Portugal
prototypes: steroid excretion in ACA/ACC
ACA
ACC
steroid metabolomics
17. AlCoB, June 2017, Aveiro / Portugal
subset of selected steroids ↔ technical realization (patented, UoB)
using 9 markers only, similar ROC
Relevance matrix
… of pairs of markersrelevance of single markers
frequency of markers to be among top 9
steroid metabolomics
18. AlCoB, June 2017, Aveiro / Portugal
ROC characteristics
clear improvement due to
adaptive distances
90% / 10% randomized
splits of the data in
training and test set
averages over 1000 runs
(1-specificity)
(sensitivity)
diagonal rel.
Euclidean
full matrix
AUC
0.87
0.93
0.97
steroid metabolomics
19. AlCoB, June 2017, Aveiro / Portugal
off-diagonaldiagonal elements
19
ACA
ACC
discriminative
e.g. steroid 19 (THS)
Relevance matrix
steroid metabolomics
20. AlCoB, June 2017, Aveiro / Portugal
highly discriminative
combination of markers!
weaklydiscriminativemarkers
5a-THA (8)
TH-Doc (12)
steroid metabolomics
21. AlCoB, June 2017, Aveiro / Portugal
(1-specificity)
(sensitivity)
8
GMLVQ
GRLVQ
diagonal rel.
Euclidean
full matrix
AUC
0.87
0.93
0.97
adrenocortical tumors
22. AlCoB, June 2017, Aveiro / Portugal
visualization of the data set
ACA
ACC
generic property: relevance matrix becomes highly singular
23. AlCoB, June 2017, Aveiro / Portugal
• monitoring of patients after surgery and/or under medication
aim: recurrence detection / prediction
work in progress
• high-throughput LC/MS assay to replace GC/MS
• other disorders affecting / related to steroid metabolism
• identification of tumor subtypes ?
• on-going prospective study w.r.t. ~ 2000 patients
24. Early diagnosis of Rheumatoid Arthritis
Expression of chemokines CXCL4 and CXCL7 by synovial
macrophages defines an early stage of rheumatoid arthritis
Annals of the Rheumatic Diseases 75:763-771 (2016)
L. Yeo, N. Adlard, M. Biehl, M. Juarez, M. Snow
C.D. Buckley, A. Filer, K. Raza, D. Scheel-Toellner
25. AlCoB, June 2017, Aveiro / Portugal
uninflamed control established RA early inflammation
resolving early RA
cytokine based diagnosis of RA
at earliest possible stage ?
ultimate goals:
understand pathogenesis and
mechanism of progression
rheumatoid arthritis (RA)
27. AlCoB, June 2017, Aveiro / Portugal
GMLVQ analysis
pre-processing:
• log-transformed expression values
• 21 leading principal components explain 95% of the variation
Two two-class problems: (A) established RA vs. uninflamed controls
(B) early RA vs. resolving inflammation
• 1 prototype per class, global relevance matrix, distance measure:
• leave-two-out validation (one from each class)
evaluation in terms of Receiver Operating Characteristics
28. AlCoB, June 2017, Aveiro / Portugal
false positive rate
truepositiveratetruepositiverate
diagonal Λii vs. cytokine index i
(A) established RA vs.
uninflamed control
(B) early RA vs.
resolving inflammation
Matrix Relevance LVQ
diagonal relevancesleave-one-out
29. AlCoB, June 2017, Aveiro / Portugal
CXCL4 chemokine (C-X-C motif) ligand 4
CXCL7 chemokine (C-X-C motif) ligand 7
direct study on protein level, staining / imaging of sinovial tissue:
macrophages : predominant source of CXCL4/7 expression
protein level studies
• high levels of CXCL4 and
CXLC7 in early RA
• expression on macrophages
outside of blood vessels
discriminates
early RA / resolving cases
30. AlCoB, June 2017, Aveiro / Portugal
false positive rate
truepositiveratetruepositiverate
diagonal Λii vs. cytokine index i
(A) established RA vs.
uninflamed control
(B) early RA vs.
resolving inflammation
relevant cytokines
macrophage
stimulating 1
diagonal relevancesleave-one-out
31. AlCoB, June 2017, Aveiro / Portugal
work in progress
• more samples (difficult...) needed in order
to obtain a reliable early diagnosis
• integrated analysis of gene expression and other data
from the same / an analogous patient cohort
32. Gargi Mukherjee … Rutgers University, New Jersey
Kevin Raines … Stanford University, California
Srikanth Sastry … JNC, Bengaluru, India
Sebastian Doniach … Stanford University, California
Gyan Bhanot … Rutgers University, New Jersey
Michael Biehl … University of Groningen, The Netherlands
In: Proc. IEEE Congress on Evolutionary Computation CEC 2016
32
Predicting Recurrence in Clear Cell
Renal Cell Carcinoma
Analysis of TCGA data using Outlier Analysis and GMLVQ
33. AlCoB, June 2017, Aveiro / Portugal
clear cell Renal Cell Carcinoma (ccRCC)
publicly available datasets:
The Cancer Genome Atlas (TCGA) cancergenome.nih.gov
also hosted at Broad Institute gdac.broadinstitute.org
data
34. AlCoB, June 2017, Aveiro / Portugal
data
20532genes
65normalsamples
469 tumor
samples
65 + 65
matched
clear cell renal cell carcinoma
TCGA data @ Broad Institute
mRNA-Seq expression data X
normalized, log-transformed:
Y=log(1+X)
65 normal samples
65 matched tumor samples
469 tumor samples in total
number of
recurrences
recurrence data:
days after diagnosis
35. AlCoB, June 2017, Aveiro / Portugal
380
training
samples
outlier analysis
89testsamples
randomized split
fast forward to
machine learning
analysis
36. AlCoB, June 2017, Aveiro / Portugal
380
training
samples
outlier analysis
per gene:
determine
mean μ, standard deviation σ of Y
for each gene: identify outlier samples
Y > μ + σ “high outlier“
Y < μ - σ “low outlier“
restrict the following analysis to genes with
≥ 20 high outlier samples
or ≥ 20 low outlier samples
37. AlCoB, June 2017, Aveiro / Portugal
1546 „high-outlier genes“
with KM log rank p < 0.001
1628 „low-outlier genes“
with KM log rank p < 0.0005
construct two binary outlier matrices
„1“ for high-outlier samples
„0“ else
„1“ for low-outlier samples
„0“ else
1546 genes
PCA
Kaplan-Meier (KM) analysis per gene:
test for significant association of outlier status of samples with
recurrence
outlier analysis
1628 genes
380samples380samples
38. AlCoB, June 2017, Aveiro / Portugal
PCA reveals
four clusters of genes
711475
2261402
A B
DC
high outlier genes
low outlier genes
genes in small clusters (B,D):
outlier status associated
with late recurrence
genes in large clusters (A,C):
outlier status associated
with early recurrence
outlier analysis
39. AlCoB, June 2017, Aveiro / Portugal
recurrence risk score
top 20 genes (by KM p-value) from each cluster A,B,C,D
reference set of 80 genes
for each sample:
- determine outlier status w.r.t. the 80 genes (Y>?<μ ± σ )
- add up contributions per gene
- 1 if sample is outlier w.r.t. to a gene in A or C (early rec.)
0 if sample is not an outlier w.r.t. the gene
+ 1 if sample is outlier w.r.t. to a gene in B or D (late rec.)
recurrence risk score - 40 ≤ R ≤ + 40
observe: median = 2 over the 380 training samples
crisp classification w.r.t. recurrence risk:
high risk (early recurrence) if R < 2
low risk (late recurrence) if R ≥ 2
40. AlCoB, June 2017, Aveiro / Portugal
recurrence risk prediction
training set (380 samples) test set (89 samples)
log rank p < 1.e-16 log rank p < 1.e-4
KM plots with respect to high / low risk groups:
• risk score R is predictive of the actual recurrence risk
• the 80 selected genes can serve as a prognostic panel
41. AlCoB, June 2017, Aveiro / Portugal
extreme case analysis
number of
recurrences:
≤ 2 years
(early)
> 5 years
(late or no
recurrence)
109 samples
class 2, high risk
107 samples
class 1, low risk
(undefined)
2 classes:
• 80-dim. feature vectors
outlier analysis yields 4 groups (A,B,C,D) of 20
pre-selected genes associated with late/early recurrence
42. AlCoB, June 2017, Aveiro / Portugal
GMLVQ classifier
diagonal elements of Λ
A B C D
components of
A B C D
lowexpression|highexpression
• one prototype vector per class:
• adaptive distance for comparison of samples and prototypes:
43. AlCoB, June 2017, Aveiro / Portugal
GMLVQ classifier
ROC of GMLVQ classifier (Leave-One-Out of the 216 extreme samples)
KM plot w.r.t. all 469 samples
( L-1-O for 216 samples, plus 253 undefined
log rank p < 1.e-7
44. AlCoB, June 2017, Aveiro / Portugal
the set of 80 genes is also diagnostic:
• GMLVQ separates normal from tumor cells (close to) perfectly
• PCA of corresponding gene expressions:
65 normal samples
105 low risk samples (late rec.)
109 high risk samples (early rec.)
gradient from normal to high risk:
diagnostics?
45. AlCoB, June 2017, Aveiro / Portugal
12 most relevant genes
from GMLVQ classifier
most relevant genes (GMLVQ)
46. AlCoB, June 2017, Aveiro / Portugal
• GMLVQ suggests an even smaller panel of genes (12?)
identify a minimum panel for diagnostics and prognostics
• 80 genes do not necessarily reflect biological mechanisms
compare, e.g., with known pathways / modules of genes
remarks and open questions
• prospective studies
• more direct, multivariate identification of relevant genes by
dimension reduction + GMLVQ with back-transform
47. AlCoB, June 2017, Aveiro / Portugal 47
conclusion
prototype- and distance based systems:
- intuitive, transparent, interpretable
- classification, regression, unsupervised learning, visualization ...
- relevance learning: further insight into data and problem
- suitable for a variety of bio-medical problems
a recent review:
M. Biehl, B. Hammer, T. Villmann
Prototype-based models in Machine Learning
Advanced Review in: WIRES Cognitive Science 7(2): 92-111 (2016)
48. AlCoB, June 2017, Aveiro / Portugal 48
http://matlabserver.cs.rug.nl/gmlvqweb/web/
Matlab code:
Relevance and Matrix adaptation in Learning Vector
Quantization (GRLVQ, GMLVQ and LiRaM LVQ):
http://www.cs.rug.nl/~biehl/
links
Pre- and re-prints etc.:
A no-nonsense beginners’ tool for GMLVQ:
http://www.cs.rug.nl/~biehl/gmlvq
(see also: Tutorial, Thursday 9:30)
49. AlCoB, June 2017, Aveiro / Portugal 49
Barbara Hammer Thomas Villmann Wiebke Arlt Dagmar
Scheel-Toellner
Petra Schneider Kerstin Bunte Gyan Bhanot
thanks