Project_702

Hierarchical Cluster
Analysis
Binf 702,Final Project May 4th, 2015
Sreelakshmi Dodderi

Part1: Cluster analysis on part of NCI60 data.
Part2: Cluster analysis on kinases of Golub data using Neighbor
joining method.

Hierarchical Clustering
Is a connectivity based clustering.
Is a whole family of methods that differ by the way distances are
computed.
Represented using a dendrogram.

NCI60 data
• Dataset of gene expression profiles.
• The format is a list containing two elements:
data- a 64x6830 matrix of gene expression values.
labs- is a vector listing the 9 cancer types.
leukemia, melanoma, lung, colon, CNS, ovarian, renal, breast and
prostate cancers.

Computations on NCI60
• PCA
• Cluster analysis using complete, average and single linkage methods
on :
• Set1:Breast cancer and ovarian cancer cell lines.(metastasis)
• Set2:Colon cancer and prostate cancer cell lines(metastasis)
• Set3:Colon cancer and renal cancer cell lines(no association found)

PCA Computations.
prcomp() function outputs the standard deviation of each principal
component.
Squaring these standard deviations=variance
Proportion of variance explained (PVE) by each principal component
=variance explained by each principal component /the total variance
explained by all principal components.

Barplot of PCA on a part of NCI60 data
Plot of Principle component analysis on NCI60 data
Variances
02004006008001000

Screeplot
It is more informative to plot the PVE and the
cumulative PVE of each principal
component.
While each of the first 5 principle
components explain substantial amount of
variance , there is a marked decrease in the
variance explained by the further principle
components.

BREAST
BREAST
BREAST
OVARIAN
BREAST
BREAST
BREAST
BREAST
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
20406080100120
Complete linkage
hclust (*, "complete")
DtBO
Height
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
30405060708090100
Average linkage
hclust (*, "average")
DtBO
Height
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
30405060708090
Single linkage
hclust (*, "single")
DtBO
Height
In complete linkage one of the ovarian cancer cell lines is very closely related to breast cancer cell line and they both form
a cluster together separately.

Clustering entire NCI60 data by Euclidean and
Maximum distance methods.
“Maximum” metric method - maximum distance between two
components (x and y) is calculated and in the resulting cluster, the
more distant ones are clustered together.
In other words, the components far apart in the cluster using maximum
distance are closely related.
We, can use this analogy to see the relatedness of breast cancer and
ovarian cancer cell lines.

Maximum metric method.
MELANOMA
LEUKEMIA
RENAL
OVARIAN
CNS
NSCLC
RENAL
OVARIAN
RENAL
MELANOMA
RENAL
NSCLC
NSCLC
BREAST
BREAST
BREAST
PROSTATE
MELANOMA
COLON
MELANOMA
MCF7A-repro
BREAST
BREAST
MELANOMA
NSCLC
K562B-repro
CNS
MCF7D-repro
CNS
COLON
BREAST
LEUKEMIA
LEUKEMIA
RENAL
MELANOMA
BREAST
CNS
RENAL
COLON
UNKNOWN
OVARIAN
RENAL
NSCLC
NSCLC
MELANOMA
MELANOMA
OVARIAN
OVARIAN
LEUKEMIA
NSCLC
PROSTATE
COLON
NSCLC
NSCLC
LEUKEMIA
OVARIAN
COLON
K562A-repro
LEUKEMIA
COLON
COLON
CNS
RENAL
RENAL
345678
Complete Linkage
Maximum method

Euclidean metric method.
RENAL
BREAST
NSCLC
BREAST
BREAST
CNS
CNS
RENAL
MELANOMA
OVARIAN
OVARIAN
NSCLC
OVARIAN
COLON
COLON
OVARIAN
PROSTATE
NSCLC
NSCLC
NSCLC
PROSTATE
NSCLC
MELANOMA
RENAL
RENAL
RENAL
OVARIAN
UNKNOWN
OVARIAN
NSCLC
CNS
CNS
CNS
NSCLC
RENAL
RENAL
RENAL
RENAL
NSCLC
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
BREAST
BREAST
COLON
COLON
COLON
COLON
COLON
BREAST
MCF7A-repro
BREAST
MCF7D-repro
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA
406080100120140160
Complete Linkage
Euclidean method

COLON
COLON
PROSTATE
PROSTATE
COLON
COLON
COLON
COLON
COLON
60657075808590
Complete linkage
DtCP
Height
COLON
COLON
PROSTATE
PROSTATE
COLON
COLON
COLON
COLON
COLON
6065707580
Average linkage
DtCP
Height
COLON
COLON
COLON
PROSTATE
PROSTATE
COLON
COLON
COLON
COLON
606264666870
Single linkage
DtCP
Height
Clustering of (7)colon cancer and (2)prostate cancer cell lines

RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
COLON
COLON
COLON
COLON
COLON
COLON
COLON
8090100110120130140
Complete linkage
DstCR
Height
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
COLON
COLON
COLON
COLON
COLON
COLON
COLON
8090100110120130
Average linkage
DstCR
Height
RENAL
RENAL
COLON
COLON
COLON
RENAL
COLON
RENAL
COLON
COLON
COLON
RENAL
RENAL
RENAL
RENAL
RENAL
80859095100105110115
Single linkage
DstCR
Height
Clustering of (7)colon cancer and (9)renal cancer cell lines.

Conclusion of part 1.
Complete linkage cluster analysis of ovarian cancer and breast cancer
cell lines, show a close relatedness between the two.
• Cluster analysis on the other two sections (colon cancer and prostate
cancer ; colon cancer and renal cancer) of data sets shows the
efficiency of the methods to clearly cluster within a single cancer
type.
• Single linkage will tend to yield trailing clusters, onto which individual
samples attach one by one.
• Complete and average linkage tend to yield more balanced clusters.

Part 2: Cluster analysis of kinase genes in Golub
data using Neighbor joining method.
Why Kinase genes?
Kinases modify other proteins by chemically adding phosphate group to
them. Phosphorylation can turn a protein off.
Kinases regulate majority of signal transduction cellular pathways. Errors in
signaling pathways are responsible for diseases such as cancer and
autoimmunity.
Some of the kinase inhibitors are used in treating cancer.
Identify closely related kinase genes through cluster analysis .
Such closely related kinases often have similar structure and function.
When you design a kinase inhibitor, this might as well work for a group of
closely related kinases or a family of kinases.

Why Neighbor Joining Method ?
• NJ method is statistically consistent under many models of evolution
• Unlike UPGMA, neighbor joining does not assume that all lineages
evolve at the same rate.
• Ideal to assign individuals to groups that often corresponds to self-
identified geographical ancestry.

Computations:
• Get only the kinase genes from the Golub data.
> library("multtest");data(golub)
> o<-grep("kinase",golub.gnames[,2])
> length(o)
[1] 139
There are 139 kinase genes.
• Use two sample t-test to select genes with experimental effect .
> pt <- apply(golub,1,function(x) t.test(x ~ gol.fac)$p.value)
> oo <- o[pt[o]<0.01]
> kin<-golub[oo,]
> dim(kin)
[1] 28 38
This yields 28 genes.
• Perform NJ clustering on the 28 kinase genes.

PRKCD Protein kinase C, delta
PFKP Phosphofructokinase, platelet
Fructose 6-phosphate,2-kinase/fructose 2,6-bisphosphatase
PRKCQ Protein kinase C-theta
Protein tyrosine kinase related mRNA sequence
PRKAR1A CAMP-dependent protein kinase regulatory subunit type I
DCK Deoxycytidine kinase
BLK Protein-tyrosine kinase blk
Protein kinase inhibitor [human, neuroblastoma cell line SH-SY-5Y, mRNA, 2147 nt]
Serine kinase mRNA
CSNK1D Casein kinase 1, delta
Protein kinase C-binding protein RACK7 mRNA, partial cds
Protein kinase ATR mRNA
Hematopoietic progenitor kinase (HPK1) mRNA
CaM kinase II isoform mRNA
ITPKB Inositol 1,4,5-trisphosphate 3-kinase B
DAGK1 Diacylglycerol kinase, alpha (80kD)
RPL7A Neurotrophic tyrosine kinase, receptor, type 1
MST1R Protein-tyrosine kinase RON
mRNA (clone C-2k) mRNA for serine/threonine protein kinase
Nucleoside-diphosphate kinase
Ndr protein kinase
Phosphatidylinositol 3-kinase
DNA-dependent protein kinase catalytic subunit (DNA-PKcs) mRNA
CALM1 Calmodulin 1 (phosphorylase kinase, delta)
DAGK4 Diacylglycerol kinase delta
GB DEF = T-lymphocyte specific protein tyrosine kinase p56lck (lck) abberant mRNA
PRKCB1 Protein kinase C, beta 1
NJ clustering on the 28 kinase genes.

Conclusion:
The two tyrosine kinases genes are clustered together,closely to each
other – “GB DEF=T-lymphocyte specific protein tyrosine kinase
p56lck(lck) abberant mRNA” and “Protein tyrosine kinase related mRNA
sequence”.
Can be used to design a kinase inhibitor which might work on all the
related kinases and help treat certain types of cancer.

Biochemical techniques to test the above analysis:
Perform automated chain-termination or Maxam- Gilbert DNA
sequencing for each the above closely related genes.
Also obtain the protein from respective genes, and sequence them.
Perform Multiple sequence alignment method with the help of online
tool.
Compare the Neighbor Joining tree obtained by above computation
with the phylogenetic tree produced by MSA tool of the sequences
obtained through wet lab analysis.

References:
• Klein RL, Brown AR, Gomez-Castro CM, Chambers SK, Cragun JM, Grasso-LeBeau L, Lang JE. Ovarian Cancer
Metastatic to the Breast Presenting as Inflammatory Breast Cancer: A Case Report and Literature Review. J
Cancer 2010; 1:27-31. doi:10.7150/jca.1.27. Available from http://www.jcancer.org/v01p0027.htm
• Malumbres M, and Barbacid M. 2007 Feb 17th, Cell cycle Curr Opin Genet Dev. 60-5 [PMID: 17208431]
• Saitou N, and Nei M. 1987 July 4th The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol Biol Evol. 406-25 [PMID: 3447015]
• Bibilography:
• Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Feb11th, 2013 An Introduction to
Statistical Learning: with Applications in R, pp377-419.
• Wim P. Krijnen (2009) Applied Statistics for Bioinformatics using R.

Project_702

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Project_702

Similar to Project_702 (20)

Project_702