SlideShare a Scribd company logo
Hierarchical Cluster
Analysis
Binf 702,Final Project May 4th, 2015
Sreelakshmi Dodderi
Part1: Cluster analysis on part of NCI60 data.
Part2: Cluster analysis on kinases of Golub data using Neighbor
joining method.
Hierarchical Clustering
Is a connectivity based clustering.
Is a whole family of methods that differ by the way distances are
computed.
Represented using a dendrogram.
NCI60 data
• Dataset of gene expression profiles.
• The format is a list containing two elements:
data- a 64x6830 matrix of gene expression values.
labs- is a vector listing the 9 cancer types.
leukemia, melanoma, lung, colon, CNS, ovarian, renal, breast and
prostate cancers.
Computations on NCI60
• PCA
• Cluster analysis using complete, average and single linkage methods
on :
• Set1:Breast cancer and ovarian cancer cell lines.(metastasis)
• Set2:Colon cancer and prostate cancer cell lines(metastasis)
• Set3:Colon cancer and renal cancer cell lines(no association found)
PCA Computations.
prcomp() function outputs the standard deviation of each principal
component.
Squaring these standard deviations=variance
Proportion of variance explained (PVE) by each principal component
=variance explained by each principal component /the total variance
explained by all principal components.
Barplot of PCA on a part of NCI60 data
Plot of Principle component analysis on NCI60 data
Variances
02004006008001000
Screeplot
It is more informative to plot the PVE and the
cumulative PVE of each principal
component.
While each of the first 5 principle
components explain substantial amount of
variance , there is a marked decrease in the
variance explained by the further principle
components.
BREAST
BREAST
BREAST
OVARIAN
BREAST
BREAST
BREAST
BREAST
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
20406080100120
Complete linkage
hclust (*, "complete")
DtBO
Height
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
30405060708090100
Average linkage
hclust (*, "average")
DtBO
Height
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
BREAST
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
OVARIAN
30405060708090
Single linkage
hclust (*, "single")
DtBO
Height
In complete linkage one of the ovarian cancer cell lines is very closely related to breast cancer cell line and they both form
a cluster together separately.
Clustering entire NCI60 data by Euclidean and
Maximum distance methods.
“Maximum” metric method - maximum distance between two
components (x and y) is calculated and in the resulting cluster, the
more distant ones are clustered together.
In other words, the components far apart in the cluster using maximum
distance are closely related.
We, can use this analogy to see the relatedness of breast cancer and
ovarian cancer cell lines.
Maximum metric method.
MELANOMA
LEUKEMIA
RENAL
OVARIAN
CNS
NSCLC
RENAL
OVARIAN
RENAL
MELANOMA
RENAL
NSCLC
NSCLC
BREAST
BREAST
BREAST
PROSTATE
MELANOMA
COLON
MELANOMA
MCF7A-repro
BREAST
BREAST
MELANOMA
NSCLC
K562B-repro
CNS
MCF7D-repro
CNS
COLON
BREAST
LEUKEMIA
LEUKEMIA
RENAL
MELANOMA
BREAST
CNS
RENAL
COLON
UNKNOWN
OVARIAN
RENAL
NSCLC
NSCLC
MELANOMA
MELANOMA
OVARIAN
OVARIAN
LEUKEMIA
NSCLC
PROSTATE
COLON
NSCLC
NSCLC
LEUKEMIA
OVARIAN
COLON
K562A-repro
LEUKEMIA
COLON
COLON
CNS
RENAL
RENAL
345678
Complete Linkage
Maximum method
Euclidean metric method.
RENAL
BREAST
NSCLC
BREAST
BREAST
CNS
CNS
RENAL
MELANOMA
OVARIAN
OVARIAN
NSCLC
OVARIAN
COLON
COLON
OVARIAN
PROSTATE
NSCLC
NSCLC
NSCLC
PROSTATE
NSCLC
MELANOMA
RENAL
RENAL
RENAL
OVARIAN
UNKNOWN
OVARIAN
NSCLC
CNS
CNS
CNS
NSCLC
RENAL
RENAL
RENAL
RENAL
NSCLC
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
BREAST
BREAST
COLON
COLON
COLON
COLON
COLON
BREAST
MCF7A-repro
BREAST
MCF7D-repro
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA
406080100120140160
Complete Linkage
Euclidean method
COLON
COLON
PROSTATE
PROSTATE
COLON
COLON
COLON
COLON
COLON
60657075808590
Complete linkage
hclust (*, "complete")
DtCP
Height
COLON
COLON
PROSTATE
PROSTATE
COLON
COLON
COLON
COLON
COLON
6065707580
Average linkage
hclust (*, "average")
DtCP
Height
COLON
COLON
COLON
PROSTATE
PROSTATE
COLON
COLON
COLON
COLON
606264666870
Single linkage
hclust (*, "single")
DtCP
Height
Clustering of (7)colon cancer and (2)prostate cancer cell lines
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
COLON
COLON
COLON
COLON
COLON
COLON
COLON
8090100110120130140
Complete linkage
hclust (*, "complete")
DstCR
Height
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
COLON
COLON
COLON
COLON
COLON
COLON
COLON
8090100110120130
Average linkage
hclust (*, "average")
DstCR
Height
RENAL
RENAL
COLON
COLON
COLON
RENAL
COLON
RENAL
COLON
COLON
COLON
RENAL
RENAL
RENAL
RENAL
RENAL
80859095100105110115
Single linkage
hclust (*, "single")
DstCR
Height
Clustering of (7)colon cancer and (9)renal cancer cell lines.
Conclusion of part 1.
Complete linkage cluster analysis of ovarian cancer and breast cancer
cell lines, show a close relatedness between the two.
• Cluster analysis on the other two sections (colon cancer and prostate
cancer ; colon cancer and renal cancer) of data sets shows the
efficiency of the methods to clearly cluster within a single cancer
type.
• Single linkage will tend to yield trailing clusters, onto which individual
samples attach one by one.
• Complete and average linkage tend to yield more balanced clusters.
Part 2: Cluster analysis of kinase genes in Golub
data using Neighbor joining method.
Why Kinase genes?
Kinases modify other proteins by chemically adding phosphate group to
them. Phosphorylation can turn a protein off.
Kinases regulate majority of signal transduction cellular pathways. Errors in
signaling pathways are responsible for diseases such as cancer and
autoimmunity.
Some of the kinase inhibitors are used in treating cancer.
Identify closely related kinase genes through cluster analysis .
Such closely related kinases often have similar structure and function.
When you design a kinase inhibitor, this might as well work for a group of
closely related kinases or a family of kinases.
Why Neighbor Joining Method ?
• NJ method is statistically consistent under many models of evolution
• Unlike UPGMA, neighbor joining does not assume that all lineages
evolve at the same rate.
• Ideal to assign individuals to groups that often corresponds to self-
identified geographical ancestry.
Computations:
• Get only the kinase genes from the Golub data.
> library("multtest");data(golub)
> o<-grep("kinase",golub.gnames[,2])
> length(o)
[1] 139
There are 139 kinase genes.
• Use two sample t-test to select genes with experimental effect .
> pt <- apply(golub,1,function(x) t.test(x ~ gol.fac)$p.value)
> oo <- o[pt[o]<0.01]
> kin<-golub[oo,]
> dim(kin)
[1] 28 38
This yields 28 genes.
• Perform NJ clustering on the 28 kinase genes.
PRKCD Protein kinase C, delta
PFKP Phosphofructokinase, platelet
Fructose 6-phosphate,2-kinase/fructose 2,6-bisphosphatase
PRKCQ Protein kinase C-theta
Protein tyrosine kinase related mRNA sequence
PRKAR1A CAMP-dependent protein kinase regulatory subunit type I
DCK Deoxycytidine kinase
BLK Protein-tyrosine kinase blk
Protein kinase inhibitor [human, neuroblastoma cell line SH-SY-5Y, mRNA, 2147 nt]
Serine kinase mRNA
CSNK1D Casein kinase 1, delta
Protein kinase C-binding protein RACK7 mRNA, partial cds
Protein kinase ATR mRNA
Hematopoietic progenitor kinase (HPK1) mRNA
CaM kinase II isoform mRNA
ITPKB Inositol 1,4,5-trisphosphate 3-kinase B
DAGK1 Diacylglycerol kinase, alpha (80kD)
RPL7A Neurotrophic tyrosine kinase, receptor, type 1
MST1R Protein-tyrosine kinase RON
mRNA (clone C-2k) mRNA for serine/threonine protein kinase
Nucleoside-diphosphate kinase
Ndr protein kinase
Phosphatidylinositol 3-kinase
DNA-dependent protein kinase catalytic subunit (DNA-PKcs) mRNA
CALM1 Calmodulin 1 (phosphorylase kinase, delta)
DAGK4 Diacylglycerol kinase delta
GB DEF = T-lymphocyte specific protein tyrosine kinase p56lck (lck) abberant mRNA
PRKCB1 Protein kinase C, beta 1
NJ clustering on the 28 kinase genes.
Conclusion:
The two tyrosine kinases genes are clustered together,closely to each
other – “GB DEF=T-lymphocyte specific protein tyrosine kinase
p56lck(lck) abberant mRNA” and “Protein tyrosine kinase related mRNA
sequence”.
Can be used to design a kinase inhibitor which might work on all the
related kinases and help treat certain types of cancer.
Biochemical techniques to test the above analysis:
Perform automated chain-termination or Maxam- Gilbert DNA
sequencing for each the above closely related genes.
Also obtain the protein from respective genes, and sequence them.
Perform Multiple sequence alignment method with the help of online
tool.
Compare the Neighbor Joining tree obtained by above computation
with the phylogenetic tree produced by MSA tool of the sequences
obtained through wet lab analysis.
References:
• Klein RL, Brown AR, Gomez-Castro CM, Chambers SK, Cragun JM, Grasso-LeBeau L, Lang JE. Ovarian Cancer
Metastatic to the Breast Presenting as Inflammatory Breast Cancer: A Case Report and Literature Review. J
Cancer 2010; 1:27-31. doi:10.7150/jca.1.27. Available from http://www.jcancer.org/v01p0027.htm
• Malumbres M, and Barbacid M. 2007 Feb 17th, Cell cycle Curr Opin Genet Dev. 60-5 [PMID: 17208431]
• Saitou N, and Nei M. 1987 July 4th The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol Biol Evol. 406-25 [PMID: 3447015]
• Bibilography:
• Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Feb11th, 2013 An Introduction to
Statistical Learning: with Applications in R, pp377-419.
• Wim P. Krijnen (2009) Applied Statistics for Bioinformatics using R.
Thank You !

More Related Content

What's hot

Fehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal ClubFehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal Club
Giovanni Marco Dall'Olio
 
Identifying novel and druggable targets in a triple negative breast cancer ce...
Identifying novel and druggable targets in a triple negative breast cancer ce...Identifying novel and druggable targets in a triple negative breast cancer ce...
Identifying novel and druggable targets in a triple negative breast cancer ce...
Thermo Fisher Scientific
 
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
NTNU
 
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Amazon Web Services
 
Molecular profiling 2013
Molecular profiling 2013Molecular profiling 2013
Molecular profiling 2013
Prof. Wim Van Criekinge
 
Applications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomicApplications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomic
Susan Rey
 
Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013
Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013
Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013
Functional Genomics Data Society
 
Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...
Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...
Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...
Candy Smellie
 
Poster Presentation
Poster PresentationPoster Presentation
Poster PresentationChunghee Kim
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Ttp Lab Tech Talk 051810
Ttp Lab Tech Talk 051810Ttp Lab Tech Talk 051810
Ttp Lab Tech Talk 051810
Neil Kubica
 
Paper 7 powerpoint
Paper 7 powerpointPaper 7 powerpoint
Paper 7 powerpointMuniba Iqbal
 
Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyKeiji Takamoto
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
Vaibhav Maurya
 
chromosomal abnormalities by Iqra malik
chromosomal abnormalities by Iqra malik chromosomal abnormalities by Iqra malik
chromosomal abnormalities by Iqra malik
hafizaiqramalik
 
Bioinformatic Analysis of Synthetic Lethality in Breast Cancer
Bioinformatic Analysis of Synthetic Lethality in Breast CancerBioinformatic Analysis of Synthetic Lethality in Breast Cancer
Bioinformatic Analysis of Synthetic Lethality in Breast Cancer
Tom Kelly
 

What's hot (19)

Fehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal ClubFehrman Nat Gen 2014 - Journal Club
Fehrman Nat Gen 2014 - Journal Club
 
Identifying novel and druggable targets in a triple negative breast cancer ce...
Identifying novel and druggable targets in a triple negative breast cancer ce...Identifying novel and druggable targets in a triple negative breast cancer ce...
Identifying novel and druggable targets in a triple negative breast cancer ce...
 
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
 
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
 
Molecular profiling 2013
Molecular profiling 2013Molecular profiling 2013
Molecular profiling 2013
 
Applications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomicApplications of protein array in diagnostics and genomic and proteomic
Applications of protein array in diagnostics and genomic and proteomic
 
Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013
Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013
Stefano Volinia, miRNA Signature - Breast Cancer, fged_seattle_2013
 
Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...
Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...
Resolving Ambiguity in Target ID Screens - CRISPR-Cas9 Based Essentiality Pro...
 
Poster Presentation
Poster PresentationPoster Presentation
Poster Presentation
 
predictive marker8_27
predictive marker8_27predictive marker8_27
predictive marker8_27
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Ttp Lab Tech Talk 051810
Ttp Lab Tech Talk 051810Ttp Lab Tech Talk 051810
Ttp Lab Tech Talk 051810
 
publication 2
publication 2publication 2
publication 2
 
Paper 7 powerpoint
Paper 7 powerpointPaper 7 powerpoint
Paper 7 powerpoint
 
Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus Strategy
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
chromosomal abnormalities by Iqra malik
chromosomal abnormalities by Iqra malik chromosomal abnormalities by Iqra malik
chromosomal abnormalities by Iqra malik
 
Bioinformatic Analysis of Synthetic Lethality in Breast Cancer
Bioinformatic Analysis of Synthetic Lethality in Breast CancerBioinformatic Analysis of Synthetic Lethality in Breast Cancer
Bioinformatic Analysis of Synthetic Lethality in Breast Cancer
 
Thesis - Abstract
Thesis - AbstractThesis - Abstract
Thesis - Abstract
 

Viewers also liked

Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
BarryK88
 
Tree pruning
Tree pruningTree pruning
Tree pruning
priya_kalia
 
La inteligencia espiritual - Escrito número 7
La inteligencia espiritual - Escrito número 7La inteligencia espiritual - Escrito número 7
La inteligencia espiritual - Escrito número 7
COMUNIDADES CRISTIANAS COMPROMETIDAS EAS DE COLOMBIA
 
El sonido del universo.
El sonido del universo. El sonido del universo.
El sonido del universo.
Octavio Lowry
 
97 drenaje intersección
97 drenaje intersección97 drenaje intersección
97 drenaje intersección
Sierra Francisco Justo
 
Safety Culture Definitions and Enhancement Process
Safety Culture Definitions and Enhancement ProcessSafety Culture Definitions and Enhancement Process
Safety Culture Definitions and Enhancement ProcessISOB
 
Madrid network and clusters april 2013
Madrid network and clusters april 2013Madrid network and clusters april 2013
Madrid network and clusters april 2013Eugenio Fontán
 
Aprovechamiento energético de los Residuos de la Industria Agroalimentaria.
Aprovechamiento energético de los Residuos de la Industria Agroalimentaria. Aprovechamiento energético de los Residuos de la Industria Agroalimentaria.
Aprovechamiento energético de los Residuos de la Industria Agroalimentaria.
CTAEX
 
E-Commerce Security: A Primer
E-Commerce Security: A PrimerE-Commerce Security: A Primer
E-Commerce Security: A Primer
John ILIADIS
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
BarryK88
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
Hein Min Htike
 
Data mining to predict academic performance.
Data mining to predict academic performance. Data mining to predict academic performance.
Data mining to predict academic performance. Ranjith Gowda
 
Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...
Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...
Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...
jespadill
 
Gestion energetica eficiente
Gestion energetica eficienteGestion energetica eficiente
Gestion energetica eficiente
Rafael Ojeda Ruiz
 
High lights on Sleep physiology
High lights on Sleep physiology High lights on Sleep physiology
High lights on Sleep physiology
Ashraf ElAdawy
 
Los epacientes ante la salud 2.0: una vision de conjunto
Los epacientes ante la salud 2.0:  una vision de conjuntoLos epacientes ante la salud 2.0:  una vision de conjunto
Los epacientes ante la salud 2.0: una vision de conjunto
Universitat Politècnica de València
 
CATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SA
CATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SACATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SA
CATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SA
Martin Funes
 
Grecia Antigua. El Erotismo en la Historia del Arte.
Grecia Antigua. El Erotismo en la Historia del Arte.Grecia Antigua. El Erotismo en la Historia del Arte.
Grecia Antigua. El Erotismo en la Historia del Arte.
Ars Erótica
 
BSidesPGH - Never Surrender - Reducing Social Engineering Risk
BSidesPGH - Never Surrender - Reducing Social Engineering RiskBSidesPGH - Never Surrender - Reducing Social Engineering Risk
BSidesPGH - Never Surrender - Reducing Social Engineering Risk
Rob Ragan
 

Viewers also liked (20)

Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
 
Tree pruning
Tree pruningTree pruning
Tree pruning
 
La inteligencia espiritual - Escrito número 7
La inteligencia espiritual - Escrito número 7La inteligencia espiritual - Escrito número 7
La inteligencia espiritual - Escrito número 7
 
El sonido del universo.
El sonido del universo. El sonido del universo.
El sonido del universo.
 
97 drenaje intersección
97 drenaje intersección97 drenaje intersección
97 drenaje intersección
 
Safety Culture Definitions and Enhancement Process
Safety Culture Definitions and Enhancement ProcessSafety Culture Definitions and Enhancement Process
Safety Culture Definitions and Enhancement Process
 
Madrid network and clusters april 2013
Madrid network and clusters april 2013Madrid network and clusters april 2013
Madrid network and clusters april 2013
 
Aprovechamiento energético de los Residuos de la Industria Agroalimentaria.
Aprovechamiento energético de los Residuos de la Industria Agroalimentaria. Aprovechamiento energético de los Residuos de la Industria Agroalimentaria.
Aprovechamiento energético de los Residuos de la Industria Agroalimentaria.
 
E-Commerce Security: A Primer
E-Commerce Security: A PrimerE-Commerce Security: A Primer
E-Commerce Security: A Primer
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
Data mining to predict academic performance.
Data mining to predict academic performance. Data mining to predict academic performance.
Data mining to predict academic performance.
 
Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...
Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...
Libro complementario | Capítulo 9 | Pedro líder misional conservador | Escuel...
 
Gestion energetica eficiente
Gestion energetica eficienteGestion energetica eficiente
Gestion energetica eficiente
 
High lights on Sleep physiology
High lights on Sleep physiology High lights on Sleep physiology
High lights on Sleep physiology
 
Los epacientes ante la salud 2.0: una vision de conjunto
Los epacientes ante la salud 2.0:  una vision de conjuntoLos epacientes ante la salud 2.0:  una vision de conjunto
Los epacientes ante la salud 2.0: una vision de conjunto
 
CATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SA
CATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SACATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SA
CATALOGO JLC - Repuestos maquinas de Jardin y Bosque - Jorge L Carranza SA
 
Grecia Antigua. El Erotismo en la Historia del Arte.
Grecia Antigua. El Erotismo en la Historia del Arte.Grecia Antigua. El Erotismo en la Historia del Arte.
Grecia Antigua. El Erotismo en la Historia del Arte.
 
BSidesPGH - Never Surrender - Reducing Social Engineering Risk
BSidesPGH - Never Surrender - Reducing Social Engineering RiskBSidesPGH - Never Surrender - Reducing Social Engineering Risk
BSidesPGH - Never Surrender - Reducing Social Engineering Risk
 

Similar to Project_702

SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
Aashish Patel
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
Natalio Krasnogor
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Elia Brodsky
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
mothersafe
 
High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data Analysis
Setia Pramana
 
Applications of microarray
Applications of microarrayApplications of microarray
Applications of microarray
prateek kumar
 
Microarray CGH
Microarray CGHMicroarray CGH
Microarray CGH
Pinal Chaudhari
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposterElsa Fecke
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Kimberly Williams
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Natalio Krasnogor
 
Biomed central
Biomed centralBiomed central
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learningFord Sleeman
 
overview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csnceroverview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csncer
Seham Al-Shehri
 
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...
QIAGEN
 
Detecting Somatic Mutation - Ensemble Approach
Detecting Somatic Mutation - Ensemble ApproachDetecting Somatic Mutation - Ensemble Approach
Detecting Somatic Mutation - Ensemble Approach
Hong ChangBum
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
Kamel Mansouri
 
Poster_Devin_Petersohn_Jeff_City
Poster_Devin_Petersohn_Jeff_CityPoster_Devin_Petersohn_Jeff_City
Poster_Devin_Petersohn_Jeff_CityDevin Petersohn
 
Jax GM Archer Fusionplex Solid Tumor Panel AMP Poster
Jax GM Archer Fusionplex Solid Tumor Panel AMP PosterJax GM Archer Fusionplex Solid Tumor Panel AMP Poster
Jax GM Archer Fusionplex Solid Tumor Panel AMP PosterSamantha Helm
 
CDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatinCDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatin
Marco Antoniotti
 

Similar to Project_702 (20)

SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 
High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data Analysis
 
Applications of microarray
Applications of microarrayApplications of microarray
Applications of microarray
 
Microarray CGH
Microarray CGHMicroarray CGH
Microarray CGH
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
 
M Sc Project
M Sc ProjectM Sc Project
M Sc Project
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
Biomed central
Biomed centralBiomed central
Biomed central
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
overview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csnceroverview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csncer
 
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...
 
Detecting Somatic Mutation - Ensemble Approach
Detecting Somatic Mutation - Ensemble ApproachDetecting Somatic Mutation - Ensemble Approach
Detecting Somatic Mutation - Ensemble Approach
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 
Poster_Devin_Petersohn_Jeff_City
Poster_Devin_Petersohn_Jeff_CityPoster_Devin_Petersohn_Jeff_City
Poster_Devin_Petersohn_Jeff_City
 
Jax GM Archer Fusionplex Solid Tumor Panel AMP Poster
Jax GM Archer Fusionplex Solid Tumor Panel AMP PosterJax GM Archer Fusionplex Solid Tumor Panel AMP Poster
Jax GM Archer Fusionplex Solid Tumor Panel AMP Poster
 
CDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatinCDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatin
 

Project_702

  • 1. Hierarchical Cluster Analysis Binf 702,Final Project May 4th, 2015 Sreelakshmi Dodderi
  • 2. Part1: Cluster analysis on part of NCI60 data. Part2: Cluster analysis on kinases of Golub data using Neighbor joining method.
  • 3. Hierarchical Clustering Is a connectivity based clustering. Is a whole family of methods that differ by the way distances are computed. Represented using a dendrogram.
  • 4. NCI60 data • Dataset of gene expression profiles. • The format is a list containing two elements: data- a 64x6830 matrix of gene expression values. labs- is a vector listing the 9 cancer types. leukemia, melanoma, lung, colon, CNS, ovarian, renal, breast and prostate cancers.
  • 5. Computations on NCI60 • PCA • Cluster analysis using complete, average and single linkage methods on : • Set1:Breast cancer and ovarian cancer cell lines.(metastasis) • Set2:Colon cancer and prostate cancer cell lines(metastasis) • Set3:Colon cancer and renal cancer cell lines(no association found)
  • 6. PCA Computations. prcomp() function outputs the standard deviation of each principal component. Squaring these standard deviations=variance Proportion of variance explained (PVE) by each principal component =variance explained by each principal component /the total variance explained by all principal components.
  • 7. Barplot of PCA on a part of NCI60 data Plot of Principle component analysis on NCI60 data Variances 02004006008001000
  • 8. Screeplot It is more informative to plot the PVE and the cumulative PVE of each principal component. While each of the first 5 principle components explain substantial amount of variance , there is a marked decrease in the variance explained by the further principle components.
  • 9. BREAST BREAST BREAST OVARIAN BREAST BREAST BREAST BREAST OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN 20406080100120 Complete linkage hclust (*, "complete") DtBO Height BREAST BREAST BREAST BREAST BREAST BREAST BREAST OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN 30405060708090100 Average linkage hclust (*, "average") DtBO Height BREAST BREAST BREAST BREAST BREAST BREAST BREAST OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN 30405060708090 Single linkage hclust (*, "single") DtBO Height In complete linkage one of the ovarian cancer cell lines is very closely related to breast cancer cell line and they both form a cluster together separately.
  • 10. Clustering entire NCI60 data by Euclidean and Maximum distance methods. “Maximum” metric method - maximum distance between two components (x and y) is calculated and in the resulting cluster, the more distant ones are clustered together. In other words, the components far apart in the cluster using maximum distance are closely related. We, can use this analogy to see the relatedness of breast cancer and ovarian cancer cell lines.
  • 13. COLON COLON PROSTATE PROSTATE COLON COLON COLON COLON COLON 60657075808590 Complete linkage hclust (*, "complete") DtCP Height COLON COLON PROSTATE PROSTATE COLON COLON COLON COLON COLON 6065707580 Average linkage hclust (*, "average") DtCP Height COLON COLON COLON PROSTATE PROSTATE COLON COLON COLON COLON 606264666870 Single linkage hclust (*, "single") DtCP Height Clustering of (7)colon cancer and (2)prostate cancer cell lines
  • 14. RENAL RENAL RENAL RENAL RENAL RENAL RENAL RENAL RENAL COLON COLON COLON COLON COLON COLON COLON 8090100110120130140 Complete linkage hclust (*, "complete") DstCR Height RENAL RENAL RENAL RENAL RENAL RENAL RENAL RENAL RENAL COLON COLON COLON COLON COLON COLON COLON 8090100110120130 Average linkage hclust (*, "average") DstCR Height RENAL RENAL COLON COLON COLON RENAL COLON RENAL COLON COLON COLON RENAL RENAL RENAL RENAL RENAL 80859095100105110115 Single linkage hclust (*, "single") DstCR Height Clustering of (7)colon cancer and (9)renal cancer cell lines.
  • 15. Conclusion of part 1. Complete linkage cluster analysis of ovarian cancer and breast cancer cell lines, show a close relatedness between the two. • Cluster analysis on the other two sections (colon cancer and prostate cancer ; colon cancer and renal cancer) of data sets shows the efficiency of the methods to clearly cluster within a single cancer type. • Single linkage will tend to yield trailing clusters, onto which individual samples attach one by one. • Complete and average linkage tend to yield more balanced clusters.
  • 16. Part 2: Cluster analysis of kinase genes in Golub data using Neighbor joining method. Why Kinase genes? Kinases modify other proteins by chemically adding phosphate group to them. Phosphorylation can turn a protein off. Kinases regulate majority of signal transduction cellular pathways. Errors in signaling pathways are responsible for diseases such as cancer and autoimmunity. Some of the kinase inhibitors are used in treating cancer. Identify closely related kinase genes through cluster analysis . Such closely related kinases often have similar structure and function. When you design a kinase inhibitor, this might as well work for a group of closely related kinases or a family of kinases.
  • 17. Why Neighbor Joining Method ? • NJ method is statistically consistent under many models of evolution • Unlike UPGMA, neighbor joining does not assume that all lineages evolve at the same rate. • Ideal to assign individuals to groups that often corresponds to self- identified geographical ancestry.
  • 18. Computations: • Get only the kinase genes from the Golub data. > library("multtest");data(golub) > o<-grep("kinase",golub.gnames[,2]) > length(o) [1] 139 There are 139 kinase genes. • Use two sample t-test to select genes with experimental effect . > pt <- apply(golub,1,function(x) t.test(x ~ gol.fac)$p.value) > oo <- o[pt[o]<0.01] > kin<-golub[oo,] > dim(kin) [1] 28 38 This yields 28 genes. • Perform NJ clustering on the 28 kinase genes.
  • 19. PRKCD Protein kinase C, delta PFKP Phosphofructokinase, platelet Fructose 6-phosphate,2-kinase/fructose 2,6-bisphosphatase PRKCQ Protein kinase C-theta Protein tyrosine kinase related mRNA sequence PRKAR1A CAMP-dependent protein kinase regulatory subunit type I DCK Deoxycytidine kinase BLK Protein-tyrosine kinase blk Protein kinase inhibitor [human, neuroblastoma cell line SH-SY-5Y, mRNA, 2147 nt] Serine kinase mRNA CSNK1D Casein kinase 1, delta Protein kinase C-binding protein RACK7 mRNA, partial cds Protein kinase ATR mRNA Hematopoietic progenitor kinase (HPK1) mRNA CaM kinase II isoform mRNA ITPKB Inositol 1,4,5-trisphosphate 3-kinase B DAGK1 Diacylglycerol kinase, alpha (80kD) RPL7A Neurotrophic tyrosine kinase, receptor, type 1 MST1R Protein-tyrosine kinase RON mRNA (clone C-2k) mRNA for serine/threonine protein kinase Nucleoside-diphosphate kinase Ndr protein kinase Phosphatidylinositol 3-kinase DNA-dependent protein kinase catalytic subunit (DNA-PKcs) mRNA CALM1 Calmodulin 1 (phosphorylase kinase, delta) DAGK4 Diacylglycerol kinase delta GB DEF = T-lymphocyte specific protein tyrosine kinase p56lck (lck) abberant mRNA PRKCB1 Protein kinase C, beta 1 NJ clustering on the 28 kinase genes.
  • 20. Conclusion: The two tyrosine kinases genes are clustered together,closely to each other – “GB DEF=T-lymphocyte specific protein tyrosine kinase p56lck(lck) abberant mRNA” and “Protein tyrosine kinase related mRNA sequence”. Can be used to design a kinase inhibitor which might work on all the related kinases and help treat certain types of cancer.
  • 21. Biochemical techniques to test the above analysis: Perform automated chain-termination or Maxam- Gilbert DNA sequencing for each the above closely related genes. Also obtain the protein from respective genes, and sequence them. Perform Multiple sequence alignment method with the help of online tool. Compare the Neighbor Joining tree obtained by above computation with the phylogenetic tree produced by MSA tool of the sequences obtained through wet lab analysis.
  • 22. References: • Klein RL, Brown AR, Gomez-Castro CM, Chambers SK, Cragun JM, Grasso-LeBeau L, Lang JE. Ovarian Cancer Metastatic to the Breast Presenting as Inflammatory Breast Cancer: A Case Report and Literature Review. J Cancer 2010; 1:27-31. doi:10.7150/jca.1.27. Available from http://www.jcancer.org/v01p0027.htm • Malumbres M, and Barbacid M. 2007 Feb 17th, Cell cycle Curr Opin Genet Dev. 60-5 [PMID: 17208431] • Saitou N, and Nei M. 1987 July 4th The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 406-25 [PMID: 3447015] • Bibilography: • Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Feb11th, 2013 An Introduction to Statistical Learning: with Applications in R, pp377-419. • Wim P. Krijnen (2009) Applied Statistics for Bioinformatics using R.