June 2017: Biomedical applications of prototype-based classifiers and relevance learning

Michael Biehl Intelligent Systems
Johann Bernoulli Institute for
Mathematics and Computing Science
University of Groningen / NL
Biomedical applications of prototype-based
classifiers and relevance learning
www.cs.rug.nl/~biehl
Introduction: prototype-based classification, relevance learning
Generalized Matrix Relevance LVQ
Illustration: three bio-medical applications

AlCoB, June 2017, Aveiro / Portugal
2
supervised learning
classification / regression / prediction
based on labeled example data
generic workflow:
example data model apply to novel datatraining working
obvious performance measures: overall / class-wise accuracy
ROC, Precision Recall ...
validation
estimate working performance
set parameters of model / training
compare different models
accuracy is not enough - interpretable “white-box” systems
example: prototype-based models, distance-based classifiers

distance-based classifiers
a simple distance-based system: (K) NN classifier
• store a set of labeled examples
• classify a query according to the
label of the Nearest Neighbor
(or the majority of K NN)
• piece-wise linear decision
boundaries according
to (e.g.) Euclidean distance
from all examples
?
N-dim. feature space
+ conceptually simple,
+ no training phase
+ only one parameter (K)
- expensive (storage, computation)
- sensitive to mislabeled data
- overly complex decision boundaries

prototype-based classification
• represent the data by one or
several prototypes per class
• classify a query according to the
label of the nearest prototype
(or alternative schemes)
• local decision boundaries acc.
to (e.g.) Euclidean distances
+
+ robust, low storage needs,
little computational effort
- model selection: number of prototypes per class, etc.
requires training: placement of prototypes in feature space
N-dim. feature space
?
parameterization in feature space, interpretability
Learning Vector Quantization [Kohonen, 1990]

∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
• initialize prototype vectors
for different classes
competitive learning: LVQ1 [Kohonen, 1990]
• identify the winner
(closest prototype)
• present a single example
• move the winner
- closer towards the data (same class)
- away from the data (different class)

∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
∙ tesselation of feature space
[piece-wise linear]
∙ distance-based classification
[here: Euclidean distances]
∙ generalization ability
correct classification of new data
∙ aim: discrimination of classes
( ≠ vector quantization
or density estimation )

cost function based LVQ
one example: Generalized LVQ (GLVQ) cost function [Sato&Yamada, 1995]
two winning prototypes:
minimize
E favors
- small number of misclassifications, e.g. with
- large margins between classes
- small , large
- class-typical prototypes

LVQ distance measures
? key question: appropriate distance / (dis-) similarity measure
fixed, pre-defined distance measures:
(G)LVQ can formulated for general (differentiable) distances
examples: Minkowski distances (p≠2), correlation based,
statistical divergences, ... not necessarily metrics!
standard work-flow
- consider several distance measures according to prior knowledge
- compare performances in, e.g., cross-validation
elegant approach: Relevance Learning / adaptive distances
- employ parameterized distance measure
- optimize in the data-driven training process (cost function!)

Generalized Matrix Relevance LVQ:
generalized quadratic distance in LVQ:
[Schneider, Biehl, Hammer, 2009]
GMLVQ

GMLVQ
generalized quadratic distance in LVQ:
[Schneider, Biehl, Hammer, 2009]
variants:
one global, several local, class-wise relevance matrices
rectangular low-dim. representation / visualization
[Bunte et al., 2012]
diagonal matrices: single feature weights [Hammer et al., 2002]
training: adaptation of prototypes
and distance measure guided by
GLVQ cost function
Generalized Matrix Relevance LVQ:

AlCoB, June 2017, Aveiro / Portugal 11
interpretation
summarizes
• the contribution of a single dimension
• the relevance of original features in the classifier
Note: interpretation assumes implicitly that
features have equal order of magnitude
e.g. after z-score-transformation →
(averages over data set)
quantifies the contribution of the pair
of features (i,j) to the distance
after training:
prototypes represent typical class properties or subtypes
Relevance Matrix

three application examples
I) steroid metabolomics:
- detection of malignancy in adrenocortical tumors
based on urinary steroid metabolite excretion
GMLVQ: ~ 150 samples, 32-dim. feature vectors
II) cytokine expression data:
- diagnosis of (early) rheumatoid arthritis
based on synovial tissue samples
~ 50 samples represented by 117 cytokine expressions
in synovial tissue, PCA+GMLVQ combined
III) gene expression data:
- recurrence risk prediction from tumor samples
~ 400 samples, ~20000 dim. feature space
outlier analysis + GMLVQ on (80) pre-selected genes

Steroid metabolomics: detecting
malignancy in adrenocortical tumors
www.ensat.org
W. Arlt, M. Biehl, A. Taylor, S. Hahner, R. Libé, B. Hughes, P. Schneider,
D. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat,
F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C. Shackleton,
X. Bertagna, M.Fassnacht, P. Stewart
Urine Steroid Metabolomics as a Biomarker Tool for Detecting
Malignancy in Patients with Adrenal Tumors
J Clinical Endocrinology & Metabolism 96: 3775-3784 (2011)

www.ensat.org
classification of adrenocortical tumors (adenoma vs. carcinoma)
based on steroid hormone excretion profiles
benign ACA malignant ACC
features: 32 steroid metabolite excretion values
non-invasive measurement (24 hrs. urine samples)
steroid metabolomics
aim: develop a novel biomarker tool for differential diagnosis
idea: identify characteristic steroid profiles (prototypes)

Generalized Matrix LVQ , ACC vs. ACA classification
∙ data divided in 90% training, 10% test set, (z-score transformed)
∙ determine prototypes
typical profiles (1 per class)
∙ apply classifier to test data
evaluate performance (error rates, ROC)
∙ adaptive generalized quadratic distance measure
parameterized by
∙ repeat and average over many random splits
[Arlt et al., 2011]
[Biehl et al., 2012]

prototypes: steroid excretion in ACA/ACC
ACA
ACC

subset of selected steroids ↔ technical realization (patented, UoB)
using 9 markers only, similar ROC
Relevance matrix
… of pairs of markersrelevance of single markers
frequency of markers to be among top 9

ROC characteristics
clear improvement due to
adaptive distances
90% / 10% randomized
splits of the data in
training and test set
averages over 1000 runs
(1-specificity)
(sensitivity)
diagonal rel.
Euclidean
full matrix
AUC
0.87
0.93
0.97

off-diagonaldiagonal elements
19
ACA
ACC
discriminative
e.g. steroid 19 (THS)
Relevance matrix

highly discriminative
combination of markers!
weaklydiscriminativemarkers
5a-THA (8)
TH-Doc (12)

(1-specificity)
(sensitivity)
8
GMLVQ
GRLVQ
diagonal rel.
Euclidean
full matrix
AUC
0.87
0.93
0.97
adrenocortical tumors

visualization of the data set
ACA
ACC
generic property: relevance matrix becomes highly singular

• monitoring of patients after surgery and/or under medication
aim: recurrence detection / prediction
work in progress
• high-throughput LC/MS assay to replace GC/MS
• other disorders affecting / related to steroid metabolism
• identification of tumor subtypes ?
• on-going prospective study w.r.t. ~ 2000 patients

Early diagnosis of Rheumatoid Arthritis
Expression of chemokines CXCL4 and CXCL7 by synovial
macrophages defines an early stage of rheumatoid arthritis
Annals of the Rheumatic Diseases 75:763-771 (2016)
L. Yeo, N. Adlard, M. Biehl, M. Juarez, M. Snow
C.D. Buckley, A. Filer, K. Raza, D. Scheel-Toellner

uninflamed control established RA early inflammation
resolving early RA
cytokine based diagnosis of RA
at earliest possible stage ?
ultimate goals:
understand pathogenesis and
mechanism of progression
rheumatoid arthritis (RA)

mRNA extraction real-time PCRtissue sectionsynovium
synovial tissue cytokine expression
IL1A IL17F FASL CXCL4 CCL15 TGFB1 KITLG
IL1B IL18 CD70 CXCL5 CCL16 TGFB2 MST1
IL1RN IL19 CD30L CXCL6 CCL17 TGFB3 SPP1
IL2 IL20 4-1BB-L CXCL7 CCL18 EGF SFRP1
IL3 IL21 TRAIL CXCL9 CCL19 FGF2 ANXA1
IL4 IL22 RANKL CXCL10 CCL20 TGFA TNFRSF13B
IL5 IL23A TWEAK CXCL11 CCL21 IGF2 IL6R
IL6 IL24 APRIL CXCL12 CCL22 VEGFA NAMPT
IL7 IL25 BAFF CXCL13 CCL23 VEGFB C1QTNF3
IL8 IL26 LIGHT CXCL14 CCL24 MIF VCAM1
IL9 IL27 TL1A CXCL16 CCL25 LIF LGALS1
IL10 IL28A GITRL CCL1 CCL26 OSM LGALS9
IL11 IL29 FASLG CCL2 CCL27 ADIPOQ LGALS3
IL12A IL32 IFNA1 CCL3 CCL28 LEP LGALS12
IL12B IL33 IFNA2 CCL4 XCL1 GHRL
IL13 LTA IFNB1 CCL5 XCL2 RETN
IL14 TNF IFNG CCL7 CX3CL1 CTLA4
IL15 LTB CXCL1 CCL8 CSF1 EPO
IL16 OX40L CXCL2 CCL11 CSF2 TPO
IL17A CD40L CXCL3 CCL13 CSF3 FLT3LG
panel of 117 cytokines
• cell signaling proteins
• regulate immune response
• produced by, e.g.
T-cells, macrophages,
lymphocytes, fibroblasts, etc.

GMLVQ analysis
pre-processing:
• log-transformed expression values
• 21 leading principal components explain 95% of the variation
Two two-class problems: (A) established RA vs. uninflamed controls
(B) early RA vs. resolving inflammation
• 1 prototype per class, global relevance matrix, distance measure:
• leave-two-out validation (one from each class)
evaluation in terms of Receiver Operating Characteristics

false positive rate
truepositiveratetruepositiverate
diagonal Λii vs. cytokine index i
(A) established RA vs.
uninflamed control
(B) early RA vs.
resolving inflammation
Matrix Relevance LVQ
diagonal relevancesleave-one-out

CXCL4 chemokine (C-X-C motif) ligand 4
CXCL7 chemokine (C-X-C motif) ligand 7
direct study on protein level, staining / imaging of sinovial tissue:
macrophages : predominant source of CXCL4/7 expression
protein level studies
• high levels of CXCL4 and
CXLC7 in early RA
• expression on macrophages
outside of blood vessels
discriminates
early RA / resolving cases

false positive rate
truepositiveratetruepositiverate
diagonal Λii vs. cytokine index i
(A) established RA vs.
uninflamed control
(B) early RA vs.
resolving inflammation
relevant cytokines
macrophage
stimulating 1
diagonal relevancesleave-one-out

work in progress
• more samples (difficult...) needed in order
to obtain a reliable early diagnosis
• integrated analysis of gene expression and other data
from the same / an analogous patient cohort

Gargi Mukherjee … Rutgers University, New Jersey
Kevin Raines … Stanford University, California
Srikanth Sastry … JNC, Bengaluru, India
Sebastian Doniach … Stanford University, California
Gyan Bhanot … Rutgers University, New Jersey
Michael Biehl … University of Groningen, The Netherlands
In: Proc. IEEE Congress on Evolutionary Computation CEC 2016
32
Predicting Recurrence in Clear Cell
Renal Cell Carcinoma
Analysis of TCGA data using Outlier Analysis and GMLVQ

clear cell Renal Cell Carcinoma (ccRCC)
publicly available datasets:
The Cancer Genome Atlas (TCGA) cancergenome.nih.gov
also hosted at Broad Institute gdac.broadinstitute.org
data

data
20532genes
65normalsamples
469 tumor
samples
65 + 65
matched
clear cell renal cell carcinoma
TCGA data @ Broad Institute
mRNA-Seq expression data X
normalized, log-transformed:
Y=log(1+X)
65 normal samples
65 matched tumor samples
469 tumor samples in total
number of
recurrences
recurrence data:
days after diagnosis

380
training
samples
outlier analysis
89testsamples
randomized split
fast forward to
machine learning
analysis

380
training
samples
outlier analysis
per gene:
determine
mean μ, standard deviation σ of Y
for each gene: identify outlier samples
Y > μ + σ “high outlier“
Y < μ - σ “low outlier“
restrict the following analysis to genes with
≥ 20 high outlier samples
or ≥ 20 low outlier samples

1546 „high-outlier genes“
with KM log rank p < 0.001
1628 „low-outlier genes“
with KM log rank p < 0.0005
construct two binary outlier matrices
„1“ for high-outlier samples
„0“ else
„1“ for low-outlier samples
„0“ else
1546 genes
 PCA
Kaplan-Meier (KM) analysis per gene:
test for significant association of outlier status of samples with
recurrence
outlier analysis
1628 genes
380samples380samples

PCA reveals
four clusters of genes
711475
2261402
A B
DC
high outlier genes
low outlier genes
genes in small clusters (B,D):
outlier status associated
with late recurrence
genes in large clusters (A,C):
outlier status associated
with early recurrence
outlier analysis

recurrence risk score
top 20 genes (by KM p-value) from each cluster A,B,C,D
reference set of 80 genes
for each sample:
- determine outlier status w.r.t. the 80 genes (Y>?<μ ± σ )
- add up contributions per gene
- 1 if sample is outlier w.r.t. to a gene in A or C (early rec.)
0 if sample is not an outlier w.r.t. the gene
+ 1 if sample is outlier w.r.t. to a gene in B or D (late rec.)
recurrence risk score - 40 ≤ R ≤ + 40
observe: median = 2 over the 380 training samples
crisp classification w.r.t. recurrence risk:
high risk (early recurrence) if R < 2
low risk (late recurrence) if R ≥ 2

recurrence risk prediction
training set (380 samples) test set (89 samples)
log rank p < 1.e-16 log rank p < 1.e-4
KM plots with respect to high / low risk groups:
• risk score R is predictive of the actual recurrence risk
• the 80 selected genes can serve as a prognostic panel

extreme case analysis
number of
recurrences:
≤ 2 years
(early)
> 5 years
(late or no
recurrence)
109 samples
class 2, high risk
107 samples
class 1, low risk
(undefined)
2 classes:
• 80-dim. feature vectors
outlier analysis yields 4 groups (A,B,C,D) of 20
pre-selected genes associated with late/early recurrence

GMLVQ classifier
diagonal elements of Λ
A B C D
components of
A B C D
lowexpression|highexpression
• one prototype vector per class:
• adaptive distance for comparison of samples and prototypes:

GMLVQ classifier
ROC of GMLVQ classifier (Leave-One-Out of the 216 extreme samples)
KM plot w.r.t. all 469 samples
( L-1-O for 216 samples, plus 253 undefined
log rank p < 1.e-7

the set of 80 genes is also diagnostic:
• GMLVQ separates normal from tumor cells (close to) perfectly
• PCA of corresponding gene expressions:
65 normal samples
105 low risk samples (late rec.)
109 high risk samples (early rec.)
gradient from normal to high risk:
diagnostics?

12 most relevant genes
from GMLVQ classifier
most relevant genes (GMLVQ)

• GMLVQ suggests an even smaller panel of genes (12?)
identify a minimum panel for diagnostics and prognostics
• 80 genes do not necessarily reflect biological mechanisms
compare, e.g., with known pathways / modules of genes
remarks and open questions
• prospective studies
• more direct, multivariate identification of relevant genes by
dimension reduction + GMLVQ with back-transform

conclusion
prototype- and distance based systems:
- intuitive, transparent, interpretable
- classification, regression, unsupervised learning, visualization ...
- relevance learning: further insight into data and problem
- suitable for a variety of bio-medical problems
a recent review:
M. Biehl, B. Hammer, T. Villmann
Prototype-based models in Machine Learning
Advanced Review in: WIRES Cognitive Science 7(2): 92-111 (2016)

http://matlabserver.cs.rug.nl/gmlvqweb/web/
Matlab code:
Relevance and Matrix adaptation in Learning Vector
Quantization (GRLVQ, GMLVQ and LiRaM LVQ):
http://www.cs.rug.nl/~biehl/
links
Pre- and re-prints etc.:
A no-nonsense beginners’ tool for GMLVQ:
http://www.cs.rug.nl/~biehl/gmlvq
(see also: Tutorial, Thursday 9:30)

Barbara Hammer Thomas Villmann Wiebke Arlt Dagmar
Scheel-Toellner
Petra Schneider Kerstin Bunte Gyan Bhanot
thanks

June 2017: Biomedical applications of prototype-based classifiers and relevance learning

Recommended

Recommended

More Related Content

Similar to June 2017: Biomedical applications of prototype-based classifiers and relevance learning

Similar to June 2017: Biomedical applications of prototype-based classifiers and relevance learning (20)

More from University of Groningen

More from University of Groningen (17)

Recently uploaded

Recently uploaded (20)

June 2017: Biomedical applications of prototype-based classifiers and relevance learning