The sbv IMPROVER species translation challenge
Sometimes you can trust a rat
Sahand Hormoz Adel Dayarian
KITP, UC Santa Barbara
Gyan Bhanot
Rutgers Univ.
Michael Biehl
University of Groningen
Johann Bernoulli Institute
www.cs.rug.nl/biehl
m.biehl@rug.nl
Winning the rat race 2
sbv IMPROVER species translation challenge
systems
biology
verification
combined with
industrial
methodology
for
process
verification
in research
IBM Research, Yorktown Heights
Philip Morris International Research and Development
www.sbvimprover.com
Winning the rat race 3
protein phosphorylation
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
Winning the rat race 4
protein phosphorylation
chemical stimuli
gene expression
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
Winning the rat race 5
www.sbvimprover.com
chemical stimuli
phosphorylation
status
( measured)
gene expression
(Δ measured)
complex network (incomplete snapshot)
Winning the rat race 6
A AB B
• normal bronchial epithelial cells, derived from human and rat
• 52 different chemical stimuli (26 (A) + 26 (B)), additional controls
• phosphorylation status after 5 minutes and 25 minutes
• gene expression after 6 hours
challenge data
• rather low noise levels
• subtract control, median of replicates
challenge organizers: activation
abs(P) > 3 @5min. or @25min.
• ~ 10% positive examples
• noisy data (microarray)
• correct for saturation effects
N= 20110 (human)
N= 13841 (rat)
Winning the rat race 7
www.sbvimprover.com
2
1
3
challenge set-up and goals
1 intra-species prediction of phosphorylation
from gene expression
2 predict the response in human using
data available for rat cells
3 predict gene expression response
across species
Winning the rat race 8
intra-species phosphorylation prediction
sub-challenge 1
combination of two approaches:
• voter method
gene selection based on mutual information
• machine learning analysis
Principal Components representation +
Linear Discriminant Analysis
• weighted combination
based on Leave-One-Out cross validation
Winning the rat race 9
voter method
binarize data by thresholding
gene expression: G=1 if p < 0.01 (p-value for differential expression)
phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.)
for all pairs of genes and proteins:
calculate separate and joint entropies
using frequencies over stimuli
mutual information
assumption: high I indicates that a gene is predictive for the
corresponding protein status
Winning the rat race 10
example:
SYNPR level predictive of AKT1 activation
green = significant phosphorylation
red = significant gene expression
SYNPR under-expressed
 AKT1 phosphorylated
voter method
for each protein:
- determine a set of most predictive genes (varying number ~ 30-70)
- vote according to the presence of significant gene expressions
relative frequency of positive votes determines certainty score in [0,1]
Leave-One-Out (L-1-O) validation:
consider mutual information only over 25 stimuli, predict the 26th
performance estimate with respect to predicting novel data
Winning the rat race 11
voter method prediction
27 ... stimuli … 52
12….proteins…….16
• voting schemes obtained
from examples in A,
applied to the 26 new
stimuli of data set B
416 predictions w.r.t. data set B
• certainties in [0,1]
on average over the
26 L-1-O runs
Winning the rat race 12
machine learning approach
low-dimensional representation of gene expression data
• omit all genes with zero variation or only insignificant (p>0.05)
expression values over all 26 training stimuli (13841 -> 6033 genes)
• Principal Component Analysis (PCA) (pcascat, www.mloss.org
c/o MarcStrickert)
- error free representation of all data possible by max. 52 PCs
- here: use k ≤ 22 leading PCs only (remove small variations due to noise)
• Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify)
- identifies discriminative directions in k-dim. space
based on within-class and between-class variation
- probabilistic output provided, interpreted as certainty score
- if all training examples negative, score 0 is assigned
Winning the rat race 13
machine learning approach
• Leave-One-Out procedure with varying number k of PC projections
for each of the 16 target proteins
for k=1:22
- repeat 26 times: LDA based on 25 stimuli, predict the 26th
yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5)
- compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1)
- determine the number of false positives (fp), true positives (tp),
false negatives (fn), true negatives (tn)
Winning the rat race 14
machine learning approach
• perform protein-specific
weighted average to obtain certainties:
• prediction: apply to test set (B) (binarized)
27 ... stimuli … 52 27 ... stimuli … 52
proteins
proteins
Winning the rat race 15
machine learning approach
• for fair comparison with voter method:
Nested Leave-One-Out procedure
for each protein, repeat 26 times:
L-1-O using 24 out of 25 stimuli, varying k
mcc-weighted prediction for the 26th stimulus
• averaged certainties as weighted means
(unweighted mean if both mcc=0)
Winning the rat race 16
combined prediction
Winning the rat race 17
combined prediction
12….proteins…….16
27 ... stimuli … 52
Winning the rat race 18
12 © 2013 sbv IMPROVER, PMI and IBM
Scores and ranks of 21 participating teams
0
10
20
30
40
50
60
70
Sumofranks
Teams
AUPR Pearson BAC
Better rank to the left
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC: Balanced Accuracy
3 teams are separated
from the rest
Team AUPR Pearson BAC
Team_75 0.38 0.72 0.72
Team_49 0.42 0.71 0.69
Team_50 0.38 0.72 0.68
Team_93 0.37 0.70 0.61
Team_111 0.35 0.64 0.67
Team_61 0.35 0.68 0.60
Team_89 0.31 0.65 0.65
Team_112 0.29 0.63 0.66
Team_116 0.27 0.62 0.59
Team_64 0.23 0.59 0.58
Team_90 0.24 0.59 0.56
Team_100 0.23 0.60 0.56
Team_78 0.28 0.56 0.55
Team_72 0.15 0.55 0.58
Team_105 0.19 0.56 0.53
Team_82 0.14 0.55 0.55
Team_106 0.13 0.53 0.55
Team_71 0.14 0.49 0.45
Team_52 0.13 0.49 0.46
Team_84 0.10 0.48 0.49
Team_99 0.07 0.43 0.50
statistically significant (FDR < .05)
1
1
1
Winning the rat race 19
12 © 2013 sbv IMPROVER, PMI and IBM
Scores and ranks of 21 participating teams
0
10
20
30
40
50
60
70
Sumofranks
Teams
AUPR Pearson BAC
Better rank to the left
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC: Balanced Accuracy
3 teams are separated
from the rest
Team AUPR Pearson BAC
Team_75 0.38 0.72 0.72
Team_49 0.42 0.71 0.69
Team_50 0.38 0.72 0.68
Team_93 0.37 0.70 0.61
Team_111 0.35 0.64 0.67
Team_61 0.35 0.68 0.60
Team_89 0.31 0.65 0.65
Team_112 0.29 0.63 0.66
Team_116 0.27 0.62 0.59
Team_64 0.23 0.59 0.58
Team_90 0.24 0.59 0.56
Team_100 0.23 0.60 0.56
Team_78 0.28 0.56 0.55
Team_72 0.15 0.55 0.58
Team_105 0.19 0.56 0.53
Team_82 0.14 0.55 0.55
Team_106 0.13 0.53 0.55
Team_71 0.14 0.49 0.45
Team_52 0.13 0.49 0.46
Team_84 0.10 0.48 0.49
Team_99 0.07 0.43 0.50
statistically significant (FDR < .05)
LDA 0.34 0.71 0.67 2
voting 0.40 0.67 0.65 2
1
1
1
 combination improved the performance!
Winning the rat race 20
inter-species phosphorylation prediction
sub-challenge 2
Winning the rat race 21
www.sbvimprover.com
sub-challenge 2 set-up
Winning the rat race 22
sub-challenge 2 set-up
restrict ourselves to the use
of phosphorylation data only
reasoning:
immediate response to stimuli should
be comparable between species
www.sbvimprover.com
Winning the rat race 23
data
rat data set A
ratP
rat data set B
ratP
human data set A
humP
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 52
123…16123…16
stimuli
known prediction
proteins
Winning the rat race 24
assume similar activation in both species: “human ≈ rat”
naïve prediction
prediction score, corresponding to threshold 3 for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- threshold 0.5 for crisp classification
- here: scaling factor yields values well-spread in [0,1]
Winning the rat race 25
naïve prediction
AUC ≈ 0.83
sensitivity
1-specificity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
Winning the rat race 26
27 ... stimuli … 52
12….proteins…….16
color-coded certainty
for | humP |>3
in data set B
naïve prediction
Winning the rat race 27
machine learning approach
rat data set A
ratP
rat data set B
ratP
human data set A
|humP | > 3 ?
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 52
123…16123…16
stimuli
training prediction
proteins
16-dim.
vectors
16 separate
binary
classification
problems
Winning the rat race 28
LVQ prediction
LVQ1, one prototype per class
Nearest prototype classification:
here: 16-dim. data
Winning the rat race 29
prediction score / certainty for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- crisp classification for threshold 0.5
- here: scaling factor yields range of values similar to naïve prediction
validation: 26 Leave-One-Out training processes:
split data set A in 25 training / 1 test sample
(if training set is all negative: accept naïve prediction)
prediction: ensemble average of certainties over the 26 LVQ systems
LVQ prediction
Winning the rat race 30
AUC ≈ 0.88
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
obtained in the Leave-One-Out
validation scheme
LVQ prediction
sensitivity
1-specificity
Winning the rat race 31
naïve prediction
AUC ≈ 0.83
sensitivity
1-specificity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
Winning the rat race 32
27 ... stimuli … 52
12….proteins…16 combined prediction
12….proteins…16 27 ... stimuli … 52
combined prediction: weighted average according to
protein-specific performance (AUROC)
Winning the rat race 33
color-coded certainty
for |humP|>3
in data set B
27 ... stimuli … 52
12….proteins…….16 combined prediction
Winning the rat race 34
Winning the rat race 35
naïve (rat) 0.45 0.74 0.79 1
LVQ 0.37 0.69 0.76 3
 naïve scheme: best indiviudal prediction
• L-1-O not confirmed in the test set
 combination improves performance!
 confirmed in “wisdom of the crowd”
analysis
Winning the rat race 36
40 © 2013 sbv IMPROVER, PMI and IBM
Team Classifier Feature Selection Rank
Team_50
Learning Vector
Quantization LVQ1 +
naïve approach
NA 1
Team_111
Neural networks
13489 inputs, 1000
hidden sigmoid units,
32 outputs
2
Team_49
LDA
Rank proteins by
moderated t-test p-
values, threshold;
cross-validate
3
Team_61 Linear Fit PCA 4
Team_52
Least absolute
regression model LBE
NA
5
Team_93
Random forest
Predict activation
matrix of 7 proteins,
use it for remaining 9
6
Team_89
SVM w radial basis
kernel and RF
Biogrid, STRING 7
Classifier Methods for SC2
Winning the rat race 37
inter-species
pathway perturbation
prediction
sub-challenge 3
Winning the rat race 38
additional data / domain knowledge
246 gene sets from the C2CP collection (Broad Institute)
www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP
2) annotation of gene sets representing known pathways and function
1) mapping of rat genes to human orthologs
HGNC Comparison of Ortholog Predictions, HCOP
www.genenames.org/cgi-bin/hcop.pl
3) gene set enrichment analysis
www.broadinstitute.org/gsea/index.jsp
NES: normalized enrichment scores, representing expression
FDR: false discovery rate, i.e. statistical significance
threshold: FDR <0.25
Winning the rat race 39
in stimuli (set A)
genesets
FDR < 0.25
rat vs. human
frequent observation:
negative correlations between significant
rat and human gene sets
biology? data (pre-)processing?
Winning the rat race 40
• PCA: dimension and noise reduction
rat gene set data A and B represented by k (≤52) projections
training
training data: 26 stimuli in rat data set A
246-dim. vectors of rat NES
246 classification problems
targets: binarized human FDR (<0.25?)
• LDA: linear classifier using k projections as features (probabilistic output)
• Leave-One-Out validation: determine optimal k from data set A
• use k=8 to make predictions for
data set B (averaged over 26 L-1-O runs)
machine learning approach
Winning the rat race 41
27 ... stimuli … 52
genesets
final prediction, certanties
human gene set prediction
Winning the rat race 42
20 © 2013 sbv IMPROVER, PMI and IBM
Team scores and ranks
Team AUPR Pearson BAC rank
Team 50 0.19 0.59 0.54 1
Team 133 0.12 0.54 0.54 2
Team 49 0.12 0.53 0.53 3
Team 52 0.10 0.52 0.54 4
Team 131 0.11 0.50 0.52 5
Team 105 0.11 0.52 0.51 6
Team 111 0.06 0.41 0.43 7
0
5
10
15
20
25
Team_50 Team_133 Team_49 Team_52 Team_131 Team_105 Team_111
Sumofranks
BAC
Pearson
AUPR
FDR ≤ 0.01
Better rank to the left
significant
31 © 2013 sbv IMPROVER, PMI and IBM
0
5
10
15
20
25
30
35
40
45
Team_50 top_2 top_4 top_3 top_5 Team_133 Team_49 top_6 Team_52 all_teams Team_131 Team_105 Team_111
Sumofranks
“Teams"
AUPR Pearson BAC
All Teams
Best Individual
Team
Aggregation of results: The Wisdom of Crowds
Winning the rat race 43
summary
 sc-1: intra-species prediction of phosphorylation
gene expression is predictive for phosphorylation status
 sc-3: inter-species prediction of gene sets
weakly predictive, presence of negative correlations
between rat and human genes and gene sets
 sc-2: inter-species prediction of phosphorylation
rat phosphorylation is predictive for human cell response
Winning the rat race 44
outlook
• more sophisticated learning schemes / classifiers
e.g. feature weighting schemes, Matrix Relevance LVQ
• ‘joint’ predictions of protein or gene set tableaus
e.g. predict 1 protein from 16 + 15 values in set A
two-step procedure for set B
• include gene expression in sub-challenge 2
• investigate difficult to predict proteins / gene sets
• infer and enhance network models from experimental data
on-going, new challenge (runs until February 2014)
Network Verification Challenge (NVC)
www.sbvimprover.com
Winning the rat race 45
take home messages
• team work works (and skype is great)
• in case of doubt: PCA
• the smaller the data set, the simpler the method
• committees can be useful!
• if you have won the rat race, you might be a rat

2013: Sometimes you can trust a rat - The sbv improver species translation challenge

  • 1.
    The sbv IMPROVERspecies translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ. Michael Biehl University of Groningen Johann Bernoulli Institute www.cs.rug.nl/biehl m.biehl@rug.nl
  • 2.
    Winning the ratrace 2 sbv IMPROVER species translation challenge systems biology verification combined with industrial methodology for process verification in research IBM Research, Yorktown Heights Philip Morris International Research and Development www.sbvimprover.com
  • 3.
    Winning the ratrace 3 protein phosphorylation reversible protein phosphorylation addition or removal of a phosphate group alters shape and function of proteins
  • 4.
    Winning the ratrace 4 protein phosphorylation chemical stimuli gene expression reversible protein phosphorylation addition or removal of a phosphate group alters shape and function of proteins
  • 5.
    Winning the ratrace 5 www.sbvimprover.com chemical stimuli phosphorylation status ( measured) gene expression (Δ measured) complex network (incomplete snapshot)
  • 6.
    Winning the ratrace 6 A AB B • normal bronchial epithelial cells, derived from human and rat • 52 different chemical stimuli (26 (A) + 26 (B)), additional controls • phosphorylation status after 5 minutes and 25 minutes • gene expression after 6 hours challenge data • rather low noise levels • subtract control, median of replicates challenge organizers: activation abs(P) > 3 @5min. or @25min. • ~ 10% positive examples • noisy data (microarray) • correct for saturation effects N= 20110 (human) N= 13841 (rat)
  • 7.
    Winning the ratrace 7 www.sbvimprover.com 2 1 3 challenge set-up and goals 1 intra-species prediction of phosphorylation from gene expression 2 predict the response in human using data available for rat cells 3 predict gene expression response across species
  • 8.
    Winning the ratrace 8 intra-species phosphorylation prediction sub-challenge 1 combination of two approaches: • voter method gene selection based on mutual information • machine learning analysis Principal Components representation + Linear Discriminant Analysis • weighted combination based on Leave-One-Out cross validation
  • 9.
    Winning the ratrace 9 voter method binarize data by thresholding gene expression: G=1 if p < 0.01 (p-value for differential expression) phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.) for all pairs of genes and proteins: calculate separate and joint entropies using frequencies over stimuli mutual information assumption: high I indicates that a gene is predictive for the corresponding protein status
  • 10.
    Winning the ratrace 10 example: SYNPR level predictive of AKT1 activation green = significant phosphorylation red = significant gene expression SYNPR under-expressed  AKT1 phosphorylated voter method for each protein: - determine a set of most predictive genes (varying number ~ 30-70) - vote according to the presence of significant gene expressions relative frequency of positive votes determines certainty score in [0,1] Leave-One-Out (L-1-O) validation: consider mutual information only over 25 stimuli, predict the 26th performance estimate with respect to predicting novel data
  • 11.
    Winning the ratrace 11 voter method prediction 27 ... stimuli … 52 12….proteins…….16 • voting schemes obtained from examples in A, applied to the 26 new stimuli of data set B 416 predictions w.r.t. data set B • certainties in [0,1] on average over the 26 L-1-O runs
  • 12.
    Winning the ratrace 12 machine learning approach low-dimensional representation of gene expression data • omit all genes with zero variation or only insignificant (p>0.05) expression values over all 26 training stimuli (13841 -> 6033 genes) • Principal Component Analysis (PCA) (pcascat, www.mloss.org c/o MarcStrickert) - error free representation of all data possible by max. 52 PCs - here: use k ≤ 22 leading PCs only (remove small variations due to noise) • Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify) - identifies discriminative directions in k-dim. space based on within-class and between-class variation - probabilistic output provided, interpreted as certainty score - if all training examples negative, score 0 is assigned
  • 13.
    Winning the ratrace 13 machine learning approach • Leave-One-Out procedure with varying number k of PC projections for each of the 16 target proteins for k=1:22 - repeat 26 times: LDA based on 25 stimuli, predict the 26th yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5) - compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1) - determine the number of false positives (fp), true positives (tp), false negatives (fn), true negatives (tn)
  • 14.
    Winning the ratrace 14 machine learning approach • perform protein-specific weighted average to obtain certainties: • prediction: apply to test set (B) (binarized) 27 ... stimuli … 52 27 ... stimuli … 52 proteins proteins
  • 15.
    Winning the ratrace 15 machine learning approach • for fair comparison with voter method: Nested Leave-One-Out procedure for each protein, repeat 26 times: L-1-O using 24 out of 25 stimuli, varying k mcc-weighted prediction for the 26th stimulus • averaged certainties as weighted means (unweighted mean if both mcc=0)
  • 16.
    Winning the ratrace 16 combined prediction
  • 17.
    Winning the ratrace 17 combined prediction 12….proteins…….16 27 ... stimuli … 52
  • 18.
    Winning the ratrace 18 12 © 2013 sbv IMPROVER, PMI and IBM Scores and ranks of 21 participating teams 0 10 20 30 40 50 60 70 Sumofranks Teams AUPR Pearson BAC Better rank to the left AUPR: Area Under Precision Recall Pearson: Pearson correlation between predictions and binarized Gold Standard BAC: Balanced Accuracy 3 teams are separated from the rest Team AUPR Pearson BAC Team_75 0.38 0.72 0.72 Team_49 0.42 0.71 0.69 Team_50 0.38 0.72 0.68 Team_93 0.37 0.70 0.61 Team_111 0.35 0.64 0.67 Team_61 0.35 0.68 0.60 Team_89 0.31 0.65 0.65 Team_112 0.29 0.63 0.66 Team_116 0.27 0.62 0.59 Team_64 0.23 0.59 0.58 Team_90 0.24 0.59 0.56 Team_100 0.23 0.60 0.56 Team_78 0.28 0.56 0.55 Team_72 0.15 0.55 0.58 Team_105 0.19 0.56 0.53 Team_82 0.14 0.55 0.55 Team_106 0.13 0.53 0.55 Team_71 0.14 0.49 0.45 Team_52 0.13 0.49 0.46 Team_84 0.10 0.48 0.49 Team_99 0.07 0.43 0.50 statistically significant (FDR < .05) 1 1 1
  • 19.
    Winning the ratrace 19 12 © 2013 sbv IMPROVER, PMI and IBM Scores and ranks of 21 participating teams 0 10 20 30 40 50 60 70 Sumofranks Teams AUPR Pearson BAC Better rank to the left AUPR: Area Under Precision Recall Pearson: Pearson correlation between predictions and binarized Gold Standard BAC: Balanced Accuracy 3 teams are separated from the rest Team AUPR Pearson BAC Team_75 0.38 0.72 0.72 Team_49 0.42 0.71 0.69 Team_50 0.38 0.72 0.68 Team_93 0.37 0.70 0.61 Team_111 0.35 0.64 0.67 Team_61 0.35 0.68 0.60 Team_89 0.31 0.65 0.65 Team_112 0.29 0.63 0.66 Team_116 0.27 0.62 0.59 Team_64 0.23 0.59 0.58 Team_90 0.24 0.59 0.56 Team_100 0.23 0.60 0.56 Team_78 0.28 0.56 0.55 Team_72 0.15 0.55 0.58 Team_105 0.19 0.56 0.53 Team_82 0.14 0.55 0.55 Team_106 0.13 0.53 0.55 Team_71 0.14 0.49 0.45 Team_52 0.13 0.49 0.46 Team_84 0.10 0.48 0.49 Team_99 0.07 0.43 0.50 statistically significant (FDR < .05) LDA 0.34 0.71 0.67 2 voting 0.40 0.67 0.65 2 1 1 1  combination improved the performance!
  • 20.
    Winning the ratrace 20 inter-species phosphorylation prediction sub-challenge 2
  • 21.
    Winning the ratrace 21 www.sbvimprover.com sub-challenge 2 set-up
  • 22.
    Winning the ratrace 22 sub-challenge 2 set-up restrict ourselves to the use of phosphorylation data only reasoning: immediate response to stimuli should be comparable between species www.sbvimprover.com
  • 23.
    Winning the ratrace 23 data rat data set A ratP rat data set B ratP human data set A humP human data set B | humP | > 3 ? 1 2 3 … 25 26 27 28 29 … 51 52 123…16123…16 stimuli known prediction proteins
  • 24.
    Winning the ratrace 24 assume similar activation in both species: “human ≈ rat” naïve prediction prediction score, corresponding to threshold 3 for activation - precise (monotonic!) form is irrelevant for ROC, PR etc. - threshold 0.5 for crisp classification - here: scaling factor yields values well-spread in [0,1]
  • 25.
    Winning the ratrace 25 naïve prediction AUC ≈ 0.83 sensitivity 1-specificity ROC with respect to the full panel (416 predictions) of | humP | > 3
  • 26.
    Winning the ratrace 26 27 ... stimuli … 52 12….proteins…….16 color-coded certainty for | humP |>3 in data set B naïve prediction
  • 27.
    Winning the ratrace 27 machine learning approach rat data set A ratP rat data set B ratP human data set A |humP | > 3 ? human data set B | humP | > 3 ? 1 2 3 … 25 26 27 28 29 … 51 52 123…16123…16 stimuli training prediction proteins 16-dim. vectors 16 separate binary classification problems
  • 28.
    Winning the ratrace 28 LVQ prediction LVQ1, one prototype per class Nearest prototype classification: here: 16-dim. data
  • 29.
    Winning the ratrace 29 prediction score / certainty for activation - precise (monotonic!) form is irrelevant for ROC, PR etc. - crisp classification for threshold 0.5 - here: scaling factor yields range of values similar to naïve prediction validation: 26 Leave-One-Out training processes: split data set A in 25 training / 1 test sample (if training set is all negative: accept naïve prediction) prediction: ensemble average of certainties over the 26 LVQ systems LVQ prediction
  • 30.
    Winning the ratrace 30 AUC ≈ 0.88 ROC with respect to the full panel (416 predictions) of | humP | > 3 obtained in the Leave-One-Out validation scheme LVQ prediction sensitivity 1-specificity
  • 31.
    Winning the ratrace 31 naïve prediction AUC ≈ 0.83 sensitivity 1-specificity ROC with respect to the full panel (416 predictions) of | humP | > 3
  • 32.
    Winning the ratrace 32 27 ... stimuli … 52 12….proteins…16 combined prediction 12….proteins…16 27 ... stimuli … 52 combined prediction: weighted average according to protein-specific performance (AUROC)
  • 33.
    Winning the ratrace 33 color-coded certainty for |humP|>3 in data set B 27 ... stimuli … 52 12….proteins…….16 combined prediction
  • 34.
  • 35.
    Winning the ratrace 35 naïve (rat) 0.45 0.74 0.79 1 LVQ 0.37 0.69 0.76 3  naïve scheme: best indiviudal prediction • L-1-O not confirmed in the test set  combination improves performance!  confirmed in “wisdom of the crowd” analysis
  • 36.
    Winning the ratrace 36 40 © 2013 sbv IMPROVER, PMI and IBM Team Classifier Feature Selection Rank Team_50 Learning Vector Quantization LVQ1 + naïve approach NA 1 Team_111 Neural networks 13489 inputs, 1000 hidden sigmoid units, 32 outputs 2 Team_49 LDA Rank proteins by moderated t-test p- values, threshold; cross-validate 3 Team_61 Linear Fit PCA 4 Team_52 Least absolute regression model LBE NA 5 Team_93 Random forest Predict activation matrix of 7 proteins, use it for remaining 9 6 Team_89 SVM w radial basis kernel and RF Biogrid, STRING 7 Classifier Methods for SC2
  • 37.
    Winning the ratrace 37 inter-species pathway perturbation prediction sub-challenge 3
  • 38.
    Winning the ratrace 38 additional data / domain knowledge 246 gene sets from the C2CP collection (Broad Institute) www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP 2) annotation of gene sets representing known pathways and function 1) mapping of rat genes to human orthologs HGNC Comparison of Ortholog Predictions, HCOP www.genenames.org/cgi-bin/hcop.pl 3) gene set enrichment analysis www.broadinstitute.org/gsea/index.jsp NES: normalized enrichment scores, representing expression FDR: false discovery rate, i.e. statistical significance threshold: FDR <0.25
  • 39.
    Winning the ratrace 39 in stimuli (set A) genesets FDR < 0.25 rat vs. human frequent observation: negative correlations between significant rat and human gene sets biology? data (pre-)processing?
  • 40.
    Winning the ratrace 40 • PCA: dimension and noise reduction rat gene set data A and B represented by k (≤52) projections training training data: 26 stimuli in rat data set A 246-dim. vectors of rat NES 246 classification problems targets: binarized human FDR (<0.25?) • LDA: linear classifier using k projections as features (probabilistic output) • Leave-One-Out validation: determine optimal k from data set A • use k=8 to make predictions for data set B (averaged over 26 L-1-O runs) machine learning approach
  • 41.
    Winning the ratrace 41 27 ... stimuli … 52 genesets final prediction, certanties human gene set prediction
  • 42.
    Winning the ratrace 42 20 © 2013 sbv IMPROVER, PMI and IBM Team scores and ranks Team AUPR Pearson BAC rank Team 50 0.19 0.59 0.54 1 Team 133 0.12 0.54 0.54 2 Team 49 0.12 0.53 0.53 3 Team 52 0.10 0.52 0.54 4 Team 131 0.11 0.50 0.52 5 Team 105 0.11 0.52 0.51 6 Team 111 0.06 0.41 0.43 7 0 5 10 15 20 25 Team_50 Team_133 Team_49 Team_52 Team_131 Team_105 Team_111 Sumofranks BAC Pearson AUPR FDR ≤ 0.01 Better rank to the left significant 31 © 2013 sbv IMPROVER, PMI and IBM 0 5 10 15 20 25 30 35 40 45 Team_50 top_2 top_4 top_3 top_5 Team_133 Team_49 top_6 Team_52 all_teams Team_131 Team_105 Team_111 Sumofranks “Teams" AUPR Pearson BAC All Teams Best Individual Team Aggregation of results: The Wisdom of Crowds
  • 43.
    Winning the ratrace 43 summary  sc-1: intra-species prediction of phosphorylation gene expression is predictive for phosphorylation status  sc-3: inter-species prediction of gene sets weakly predictive, presence of negative correlations between rat and human genes and gene sets  sc-2: inter-species prediction of phosphorylation rat phosphorylation is predictive for human cell response
  • 44.
    Winning the ratrace 44 outlook • more sophisticated learning schemes / classifiers e.g. feature weighting schemes, Matrix Relevance LVQ • ‘joint’ predictions of protein or gene set tableaus e.g. predict 1 protein from 16 + 15 values in set A two-step procedure for set B • include gene expression in sub-challenge 2 • investigate difficult to predict proteins / gene sets • infer and enhance network models from experimental data on-going, new challenge (runs until February 2014) Network Verification Challenge (NVC) www.sbvimprover.com
  • 45.
    Winning the ratrace 45 take home messages • team work works (and skype is great) • in case of doubt: PCA • the smaller the data set, the simpler the method • committees can be useful! • if you have won the rat race, you might be a rat