Lect5_GSEA_Classify (1).ppt

Gene Set Enrichment Analysis
Xiaole Shirley Liu
STAT115/STAT215

• In some microarray experiments comparing two
conditions, there might be no single gene
significantly diff expressed, but a group of genes
slightly diff expressed
• Check a set of genes with similar annotation (e.g.
GO) and see their expression values
– Kolmogorov-Smirnov test
• GSEA at Broad Institute

• Mootha et al, PNAS 2003
– Kolmogorov-Smirnov test
– Cumulative fraction function: What fraction of genes
are below this fold change?
T*-test
FC

STAT115
03/18/2008
4
• Alternative to KS: one sample z-test
– Population with all the genes follow normal ~
N(,2)
– Avg of the genes (X) with a specific
annotation:
|
|
)
(
X
X
z





• Set of genes with specific annotation involved in
coordinated down-regulation
• Need to define the set before looking at the data
• Can only see the significance by looking at the
whole set

Expanded Gene Sets
• Subramanian, et al PNAS 2005
6

Microarray Classification
Xiaole Shirley Liu
STAT115/STAT215

9
Microarray Classification
probe set Normal m412a
Normal m414a
Normal m416a
Normal m426a
Normal m430a
MM m282 MM m331a
MM m332a
MM m333a
MM m334a
MM m353a
MM m408a
MM m423a
MM m424a
39089_at 89.31 143.37 111.61 134.78 121.57 104.02 101.11 105.16 121.21 176.72 117.16 137.19 109.5 109.06
35862_at 95.05 107.04 71.06 100.63 117.58 103.96 95.2 114.35 95.03 90.32 93.13 88.61 90.87 112.95
41777_at 22.76 20.05 21.37 25.55 30.8 20.75 21.95 28.82 30.85 28.81 22.65 18.91 22.58 21.65
38250_at 53.55 62.89 29.36 62.74 36.14 60.07 37.46 42.85 27.86 41.48 116.4 46.39 38.9 29.11
656_at 177.69 177.65 167.15 166.04 155.07 180.4 136.47 200.4 201.8 138.38 165.92 176.25 162.85 156.17
332_at 128.5 98.29 130.58 111.49 103.56 115.47 121.01 134.5 118.85 88.71 105.08 93.28 113.18 140.13
39185_at 107.86 114.02 104.08 108.89 112.75 113.61 120.9 120.1 113.82 102.72 109.81 104.86 104.4 95.53
514_at 69.21 51.43 92.43 69.21 55.46 58.43 73.9 74.58 88.07 57.01 79.11 53.63 53.43 69.62
35010_at 65.34 42 48.14 52.85 59.07 49.62 62.59 68.39 55.57 47.92 46.97 49.73 44.7 55.73
34793_s_at 9.95 9.12 10.45 14.65 21.91 13.2 14.02 17.15 9.05 10.66 8.24 13.43 17.17 15.97
33277_at 153.21 120.52 136.7 113.79 110.23 140.96 153.44 149.59 119.14 98.57 156.85 101.86 117.28 104.72
34788_at 167.66 172.86 142.6 199.39 195.34 156.66 173.96 159.16 207.34 154.18 158.59 151.91 171.65 246.11
2053_at 91.76 111.82 99.57 95.58 87.17 123.15 82.24 93.92 97.76 114.66 80.33 107.65 89.78 85.41
33465_at 63.37 45.24 54.72 56.74 58.16 59.55 63.43 71.55 55.76 46.63 49.78 40.49 44.5 69.33
41097_at 145.34 148.08 171.78 151.96 128.26 138.98 148.45 160.25 169.47 133.5 166.24 135.37 159.2 129.96
32394_s_at 449.9 1190.09 429.93 1034.13 196.52 214.51 220.81 331.66 652.66 488.37 699.41 1903.88 843.79 575.16
1969_s_at 30.03 34.58 59.76 32.84 46.98 51.34 40.4 41.75 31.8 36.74 62.42 40.4 36.37 26.06
39225_at 43.19 82.15 97.56 78.3 57.23 65.29 75.14 54.5 58.35 62.47 124.64 56.42 90.55 57.28
36919_r_at 36.45 26.84 37.94 35.79 38.86 33.99 28.94 32.57 39.61 32.08 31.37 36.58 44.33 36.99
33574_at 16.14 12.58 10.93 14.65 29.64 19.38 14.65 15.29 16.14 19.72 11.23 12.6 18.2 24.04
36271_at 41.71 25.8 39.79 49.71 52.64 33.5 48.33 41.15 48.74 45.12 36.5 38.58 55.99 29.73
490_g_at 83.48 103.93 121.57 80.05 73.81 115.47 106.57 96.19 101.49 78.5 86.13 71.87 83.73 93.64
1654_at 78.63 82.7 93.15 73.96 73.82 104.4 100.39 91.78 82.26 63.21 76.23 56.97 76.2 73.04
41207_at 100.27 80.62 84.98 75.44 74.26 95.56 96.83 100.36 85.12 71.34 81.04 75.81 70.77 70.81
40080_at 172.83 106.63 122.03 118.12 131.15 153.53 150.19 161.04 123 101.64 142.03 110.02 113.58 117.18
38699_at 69.1 67.16 62.73 67.46 74.03 61.16 75.27 75.7 63.2 68.12 57.25 65.42 70.71 75.81
698_f_at 21.36 43.88 30.5 65.43 35.73 44.05 32.34 35.17 33.89 62.61 34.72 42.49 32.13 37.51
36036_at 105.59 71.45 88.72 79.84 75.78 95.13 115.07 100.81 84.13 69.87 76.51 71.58 72.16 73.85
40720_at 104.84 175.9 186.87 65.58 64 204.55 89.48 110.87 99 59.84 138.3 59.43 197.43 118.32
32194_at 34.01 165.32 153.91 59.4 43.4 98.5 59.53 43.28 47.98 63.09 217.29 127.38 79.38 82.04
31499_s_at 42.66 36.26 47.61 43.35 48.55 40.87 52.57 53.86 41.41 40.08 44.22 35.6 43.32 41.48
41685_at 25.07 14.68 22.41 22.98 19.79 22.21 21.85 25.12 20.27 18.44 20.37 12.85 22.02 25.91
31788_at 115.87 151.38 103.33 144.45 138.01 125.9 132.74 121.06 113.56 114.21 149.88 199.76 121.17 96.03
1719_at 15.65 18.26 16.74 21.49 15.16 11.49 17.52 21.35 19.36 20.6 15.13 14.3 18.77 18.49
973_at 169.15 142.44 164.57 129 151.38 189.15 171.12 169.57 139.02 140.37 145.62 145.17 130.23 132.35
?

10
Classification
• Equivalent to machine learning methods
• Task: assign object to class based on
measurements on object
– E.g. is sample normal or cancer based on expression
profile?
• Unsupervised learning
– Ignore known class labels
– Sometimes can’t separate even the known classes
• Supervised learning:
– Extract useful features based on known class labels to
best separate classes
– Can over fit the data, so need to separate training and
test set

11
Clustering Classification
• Which known samples does the unknown sample
cluster with?
• No guarantee that the known sample will cluster
• Try batch removal or different clustering methods
– Change linkage, select subset genes (semi-supervised)

12
K Nearest Neighbor
• Used in missing value estimation
• For observation X with unknown label, find
the K observations in the training data
closest (e.g. correlation) to X
• Predict the label of X based on majority
vote by KNN
• K can be determined by predictability of
known samples, semi-supervised again!

KNN Example
• Can extend KNN by
assigning weights to the
neighbors by inverse
distance from the test-
sample
• Offer little insights into
mechanism
13

14
MDS
• Multidimensional scaling
• Based on distance between data in high
dimensional space (e.g. correlations)
• Give 2-3D representation approximating the
pairwise distance relationship as much as
possible
• Non-linear projection
• Can directly predict
new sample based on dist
Break

15
Principal Component Analysis
• Linear transformation that projects the data on to new
coordinate system (linear combinations of the original
variables) to capture as much of the variation in data as
possible
• The first principal component accounts for the greatest
possible variance in dataset
• The second principal component accounts for the next
highest variance and is uncorrelated with (orthogonal to)
the first principal component.

Finding the Projections
• Looking for a linear combination to transform the
original data matrix X to:
Y= T X=1 X1+ 2 X2+..+ p Xp
• Where  =(1 , 2 ,.., p)T is a column vector of
weights with
1²+ 2²+..+ p² =1
• Maximize the variance of the projection of the
observations on the Y variables

Finding the Projections
• The direction of  is given by the eigenvector 1
correponding to the largest eigenvalue of matrix C
• The second vector that is orthogonal (uncorrelated) to the
first is the one that has the second highest variance which
comes to be the eignevector corresponding to the second
eigenvalue
17
Good Better

18
Principal Component Analysis
• Achieved by singular value decomposition (SVD):
X = UDVT
• X is the original data
• U (N × N) is the relative
projection of the points
• V is project directions
– v1 is a unit vector, direction of the first projection
– The eigenvector with the largest eigenvalue
– Linear combination (relative importance) of each gene
(if PCA on samples)

19
PCA
• D is scaling factor (eigenvalue)
– Diagonal matrix, d1  d2  d3  …  0
– dm
2 measures the variances captures by the mth principal
component
• u1d1 is distance along v1 from origin (first
principal components)
– Expression value projected on v1
– u1d1 captures the largest variance of original X
– v2 is 2nd projection direction, orthogonal to PC1, u2d2 is
2nd principal component and captures the 2nd largest var
of X

PCA for Classification
20
Blood transcriptome of
healthy individual (HI),
cardiovascular risk factor
individuals (RF),
individuals with
asymptomatic left
ventricular dysfunction
groups (ALVD) and
chronic heart failure
patients (CHF).
New sample predicted to
be CHF.

Interpretation of components
• See the weights of variables in each component
• If Y1= 0.41X1 +0.15X2 -0.38X3+0.03X4+…
• X1 and X3 are more important than X2 and X4 in PC1,
offers some biological insights
• PCA and MDS are both good dimension reduction
methods
• PCA is a good clustering method, and can be conducted
on genes or on samples
• PCA is only powerful if the biological question is
related to the highest variance in the dataset

PCA for Batch Effect Detection
• PCA can identify batch effect
• Obvious batch effect: early PC’s separate
samples by batch
22
Un-normalized Qnorm COMBAT
Brezina et al, Microarray 2015
Break

24
Supervised Learning
Performance Assessment
• If error rate is estimated from whole learning data
set, could overfit the data (do well now, but poorly
in future observations)
• Need cross validation to assess performance
• Leave-1 cross validation on n data points
– Build classifier on (n-1), test on the one left out
• N-fold cross validation
– Divide data into N subset (equal size), build classifier
on (N-1) subsets, compute error rate on left out subset

26
Logistic Regression
• Data: (yi, xi), i=1,…, n
• Dependent variable is binary: 0, 1
• Model
• Logit: natural log of odds ratio Pb(1) over Pb(0)
• b0+b1X really big, Y =1; b0+b1X really small, Y =0
• But change in probability is not linear with changes in X
• b0  intercept
• b1  regression slope
• b0+b1X = 0  decision
boundary Pb(1) = 0.5
0 1 1
ˆ
( 1| )
exp( )
ˆ
(1 ( 1| )) (1 )
i
i
i
P Y X
b b X
P Y X


   

  
   
  
 
 

Example (wiki)
• Hours study on Pb of passing an exam
• Significant association
– P-value 0.0167
• Pb (pass) = 1/[1+exp(-b0-b1X)]
= 1/[1+exp(4.0777-1.5046* Hours)]
• 4.0777/1.5046 = 2.71 hours  Pb (pass) = 0.5
27

28
Logistic Regression
• Sample classification: Y  Cancer 1, Normal 0
• Find subset of p genes whose expression x
collectively predict new sample classification
• ’s are estimated from training data (R)
• The decision boundary is determined by the linear
regression, i.e., classify yi =1 if:
• More later in the semester
ip
p
i
i
i
i
i
x
x
y
P
y
P


 






1
1
0
)
|
0
(
)
|
1
(
log
x
x

29
Support Vector Machine
• SVM
– Which hyperplane is the best?

30
• SVM finds the hyperplane that maximizes
the margin
• Margin determined by support vectors
(samples lie on the class
edge), others irrelevant

31
• SVM finds the hyperplane that maximizes
the margin
• Margin determined by support vectors
others irrelevant
• Extensions:
– Soft edge, support vectors diff
weight
– Non separable: slack var  > 0
Max (margin –   # bad)

Nonlinear SVM
• Project the data through higher dimensional space
with kernel function, so classes can be separated
by hyperplane
• A few implemented kernel functions available in
BioConductor, the choice is usually trial and error
and personal experience
K(x,y) = (xy)2

33
Outline
• GSEA for activities of group of genes
• Dimension reduction techniques
– MDS, PCA
• Unsupervised learning method
– Clustering, KNN, MDS
– PCA, batch effect
• Supervised learning for classification
– Logistic regression
– SVM
– Cross validation

Lect5_GSEA_Classify (1).ppt

More Related Content

Similar to Lect5_GSEA_Classify (1).ppt

Recently uploaded

Lect5_GSEA_Classify (1).ppt