2013: Sometimes you can trust a rat - The sbv improver species translation challenge

The sbv IMPROVER species translation challenge
Sometimes you can trust a rat
Sahand Hormoz Adel Dayarian
KITP, UC Santa Barbara
Gyan Bhanot
Rutgers Univ.
Michael Biehl
University of Groningen
Johann Bernoulli Institute
www.cs.rug.nl/biehl
m.biehl@rug.nl

Winning the rat race 2
sbv IMPROVER species translation challenge
systems
biology
verification
combined with
industrial
methodology
for
process
verification
in research
IBM Research, Yorktown Heights
Philip Morris International Research and Development
www.sbvimprover.com

protein phosphorylation
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins

protein phosphorylation
chemical stimuli
gene expression
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins

www.sbvimprover.com
chemical stimuli
phosphorylation
status
( measured)
gene expression
(Δ measured)
complex network (incomplete snapshot)

A AB B
• normal bronchial epithelial cells, derived from human and rat
• 52 different chemical stimuli (26 (A) + 26 (B)), additional controls
• phosphorylation status after 5 minutes and 25 minutes
• gene expression after 6 hours
challenge data
• rather low noise levels
• subtract control, median of replicates
challenge organizers: activation
abs(P) > 3 @5min. or @25min.
• ~ 10% positive examples
• noisy data (microarray)
• correct for saturation effects
N= 20110 (human)
N= 13841 (rat)

www.sbvimprover.com
2
1
3
challenge set-up and goals
1 intra-species prediction of phosphorylation
from gene expression
2 predict the response in human using
data available for rat cells
3 predict gene expression response
across species

intra-species phosphorylation prediction
sub-challenge 1
combination of two approaches:
• voter method
gene selection based on mutual information
• machine learning analysis
Principal Components representation +
Linear Discriminant Analysis
• weighted combination
based on Leave-One-Out cross validation

voter method
binarize data by thresholding
gene expression: G=1 if p < 0.01 (p-value for differential expression)
phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.)
for all pairs of genes and proteins:
calculate separate and joint entropies
using frequencies over stimuli
mutual information
assumption: high I indicates that a gene is predictive for the
corresponding protein status

example:
SYNPR level predictive of AKT1 activation
green = significant phosphorylation
red = significant gene expression
SYNPR under-expressed
 AKT1 phosphorylated
voter method
for each protein:
- determine a set of most predictive genes (varying number ~ 30-70)
- vote according to the presence of significant gene expressions
relative frequency of positive votes determines certainty score in [0,1]
Leave-One-Out (L-1-O) validation:
consider mutual information only over 25 stimuli, predict the 26th
performance estimate with respect to predicting novel data

voter method prediction
27 ... stimuli … 52
12….proteins…….16
• voting schemes obtained
from examples in A,
applied to the 26 new
stimuli of data set B
416 predictions w.r.t. data set B
• certainties in [0,1]
on average over the
26 L-1-O runs

machine learning approach
low-dimensional representation of gene expression data
• omit all genes with zero variation or only insignificant (p>0.05)
expression values over all 26 training stimuli (13841 -> 6033 genes)
• Principal Component Analysis (PCA) (pcascat, www.mloss.org
c/o MarcStrickert)
- error free representation of all data possible by max. 52 PCs
- here: use k ≤ 22 leading PCs only (remove small variations due to noise)
• Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify)
- identifies discriminative directions in k-dim. space
based on within-class and between-class variation
- probabilistic output provided, interpreted as certainty score
- if all training examples negative, score 0 is assigned

• Leave-One-Out procedure with varying number k of PC projections
for each of the 16 target proteins
for k=1:22
- repeat 26 times: LDA based on 25 stimuli, predict the 26th
yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5)
- compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1)
- determine the number of false positives (fp), true positives (tp),
false negatives (fn), true negatives (tn)

• perform protein-specific
weighted average to obtain certainties:
• prediction: apply to test set (B) (binarized)
27 ... stimuli … 52 27 ... stimuli … 52
proteins
proteins

• for fair comparison with voter method:
Nested Leave-One-Out procedure
for each protein, repeat 26 times:
L-1-O using 24 out of 25 stimuli, varying k
mcc-weighted prediction for the 26th stimulus
• averaged certainties as weighted means
(unweighted mean if both mcc=0)

combined prediction

combined prediction
27 ... stimuli … 52

12 © 2013 sbv IMPROVER, PMI and IBM
Scores and ranks of 21 participating teams
0
10
20
30
40
50
60
70
Sumofranks
Teams
AUPR Pearson BAC
Better rank to the left
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC: Balanced Accuracy
3 teams are separated
from the rest
Team AUPR Pearson BAC
Team_75 0.38 0.72 0.72
Team_49 0.42 0.71 0.69
Team_50 0.38 0.72 0.68
Team_93 0.37 0.70 0.61
Team_111 0.35 0.64 0.67
Team_61 0.35 0.68 0.60
Team_89 0.31 0.65 0.65
Team_112 0.29 0.63 0.66
Team_116 0.27 0.62 0.59
Team_64 0.23 0.59 0.58
Team_90 0.24 0.59 0.56
Team_100 0.23 0.60 0.56
Team_78 0.28 0.56 0.55
Team_72 0.15 0.55 0.58
Team_105 0.19 0.56 0.53
Team_82 0.14 0.55 0.55
Team_106 0.13 0.53 0.55
Team_71 0.14 0.49 0.45
Team_52 0.13 0.49 0.46
Team_84 0.10 0.48 0.49
Team_99 0.07 0.43 0.50
statistically significant (FDR < .05)
1
1
1

Scores and ranks of 21 participating teams
0
10
20
30
40
50
60
70
Sumofranks
Teams
AUPR Pearson BAC
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC: Balanced Accuracy
3 teams are separated
from the rest
Team AUPR Pearson BAC
Team_75 0.38 0.72 0.72
Team_49 0.42 0.71 0.69
Team_50 0.38 0.72 0.68
Team_93 0.37 0.70 0.61
Team_111 0.35 0.64 0.67
Team_61 0.35 0.68 0.60
Team_89 0.31 0.65 0.65
Team_112 0.29 0.63 0.66
Team_116 0.27 0.62 0.59
Team_64 0.23 0.59 0.58
Team_90 0.24 0.59 0.56
Team_100 0.23 0.60 0.56
Team_78 0.28 0.56 0.55
Team_72 0.15 0.55 0.58
Team_105 0.19 0.56 0.53
Team_82 0.14 0.55 0.55
Team_106 0.13 0.53 0.55
Team_71 0.14 0.49 0.45
Team_52 0.13 0.49 0.46
Team_84 0.10 0.48 0.49
Team_99 0.07 0.43 0.50
statistically significant (FDR < .05)
LDA 0.34 0.71 0.67 2
voting 0.40 0.67 0.65 2
1
1
1
 combination improved the performance!

inter-species phosphorylation prediction
sub-challenge 2

www.sbvimprover.com
sub-challenge 2 set-up

sub-challenge 2 set-up
restrict ourselves to the use
of phosphorylation data only
reasoning:
immediate response to stimuli should
be comparable between species
www.sbvimprover.com

data
rat data set A
ratP
rat data set B
ratP
human data set A
humP
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 52
123…16123…16
stimuli
known prediction
proteins

assume similar activation in both species: “human ≈ rat”
naïve prediction
prediction score, corresponding to threshold 3 for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- threshold 0.5 for crisp classification
- here: scaling factor yields values well-spread in [0,1]

naïve prediction
AUC ≈ 0.83
sensitivity
1-specificity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3

27 ... stimuli … 52
color-coded certainty
for | humP |>3
in data set B
naïve prediction

rat data set A
ratP
rat data set B
ratP
human data set A
|humP | > 3 ?
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 52
123…16123…16
stimuli
training prediction
proteins
16-dim.
vectors
16 separate
binary
classification
problems

LVQ prediction
LVQ1, one prototype per class
Nearest prototype classification:
here: 16-dim. data

prediction score / certainty for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- crisp classification for threshold 0.5
- here: scaling factor yields range of values similar to naïve prediction
validation: 26 Leave-One-Out training processes:
split data set A in 25 training / 1 test sample
(if training set is all negative: accept naïve prediction)
prediction: ensemble average of certainties over the 26 LVQ systems
LVQ prediction

AUC ≈ 0.88
ROC
| humP | > 3
obtained in the Leave-One-Out
validation scheme
LVQ prediction
sensitivity
1-specificity

naïve prediction
AUC ≈ 0.83
sensitivity
1-specificity
ROC
| humP | > 3

27 ... stimuli … 52
12….proteins…16 combined prediction
12….proteins…16 27 ... stimuli … 52
combined prediction: weighted average according to
protein-specific performance (AUROC)

color-coded certainty
for |humP|>3
in data set B
27 ... stimuli … 52
12….proteins…….16 combined prediction

naïve (rat) 0.45 0.74 0.79 1
LVQ 0.37 0.69 0.76 3
 naïve scheme: best indiviudal prediction
• L-1-O not confirmed in the test set
 combination improves performance!
 confirmed in “wisdom of the crowd”
analysis

Team Classifier Feature Selection Rank
Team_50
Learning Vector
Quantization LVQ1 +
naïve approach
NA 1
Team_111
Neural networks
13489 inputs, 1000
hidden sigmoid units,
32 outputs
2
Team_49
LDA
Rank proteins by
moderated t-test p-
values, threshold;
cross-validate
3
Team_61 Linear Fit PCA 4
Team_52
Least absolute
regression model LBE
NA
5
Team_93
Random forest
Predict activation
matrix of 7 proteins,
use it for remaining 9
6
Team_89
SVM w radial basis
kernel and RF
Biogrid, STRING 7
Classifier Methods for SC2

inter-species
pathway perturbation
prediction
sub-challenge 3

additional data / domain knowledge
246 gene sets from the C2CP collection (Broad Institute)
www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP
2) annotation of gene sets representing known pathways and function
1) mapping of rat genes to human orthologs
HGNC Comparison of Ortholog Predictions, HCOP
www.genenames.org/cgi-bin/hcop.pl
3) gene set enrichment analysis
www.broadinstitute.org/gsea/index.jsp
NES: normalized enrichment scores, representing expression
FDR: false discovery rate, i.e. statistical significance
threshold: FDR <0.25

in stimuli (set A)
genesets
FDR < 0.25
rat vs. human
frequent observation:
negative correlations between significant
rat and human gene sets
biology? data (pre-)processing?

• PCA: dimension and noise reduction
rat gene set data A and B represented by k (≤52) projections
training
training data: 26 stimuli in rat data set A
246-dim. vectors of rat NES
246 classification problems
targets: binarized human FDR (<0.25?)
• LDA: linear classifier using k projections as features (probabilistic output)
• Leave-One-Out validation: determine optimal k from data set A
• use k=8 to make predictions for
data set B (averaged over 26 L-1-O runs)

27 ... stimuli … 52
genesets
final prediction, certanties
human gene set prediction

Team scores and ranks
Team AUPR Pearson BAC rank
Team 50 0.19 0.59 0.54 1
Team 133 0.12 0.54 0.54 2
Team 49 0.12 0.53 0.53 3
Team 52 0.10 0.52 0.54 4
Team 131 0.11 0.50 0.52 5
Team 105 0.11 0.52 0.51 6
Team 111 0.06 0.41 0.43 7
0
5
10
15
20
25
Team_50 Team_133 Team_49 Team_52 Team_131 Team_105 Team_111
Sumofranks
BAC
Pearson
AUPR
FDR ≤ 0.01
significant
0
5
10
15
20
25
30
35
40
45
Team_50 top_2 top_4 top_3 top_5 Team_133 Team_49 top_6 Team_52 all_teams Team_131 Team_105 Team_111
Sumofranks
“Teams"
AUPR Pearson BAC
All Teams
Best Individual
Team
Aggregation of results: The Wisdom of Crowds

summary
 sc-1: intra-species prediction of phosphorylation
gene expression is predictive for phosphorylation status
 sc-3: inter-species prediction of gene sets
weakly predictive, presence of negative correlations
between rat and human genes and gene sets
 sc-2: inter-species prediction of phosphorylation
rat phosphorylation is predictive for human cell response

outlook
• more sophisticated learning schemes / classifiers
e.g. feature weighting schemes, Matrix Relevance LVQ
• ‘joint’ predictions of protein or gene set tableaus
e.g. predict 1 protein from 16 + 15 values in set A
two-step procedure for set B
• include gene expression in sub-challenge 2
• investigate difficult to predict proteins / gene sets
• infer and enhance network models from experimental data
on-going, new challenge (runs until February 2014)
Network Verification Challenge (NVC)
www.sbvimprover.com

take home messages
• team work works (and skype is great)
• in case of doubt: PCA
• the smaller the data set, the simpler the method
• committees can be useful!
• if you have won the rat race, you might be a rat

2013: Sometimes you can trust a rat - The sbv improver species translation challenge

More Related Content

Similar to 2013: Sometimes you can trust a rat - The sbv improver species translation challenge

More from Michael Biehl

Recently uploaded

2013: Sometimes you can trust a rat - The sbv improver species translation challenge