SlideShare a Scribd company logo
DENSITY-BASED SPATIAL CLUSTERING OF
APPLICATIONS WITH NOISES FOR DNA
METHYLATION DATA
Division of Statistics
Northern Illinois University,2017
Committee:
Dr. Alan Polansky
Dr. Nader Ebrahimi
Dr. Haiming Zhou
Dr. Duchwan Ryu
Mohammed Atef Alghzzy
Contents:
DNA Methylation
Cluster Analysis (K-Means and DBSCAN)
Simulation Study
Clustering for DNA methylation
• DNA methylation is a
process by which methyl
groups are added to the
Cytosine nucleotide in DNA.
• Methylation can change the
activity of a DNA segment
without changing the
sequence, when located in a
gene promoter, and
it typically acts to repress
gene transcription.
 DNA Methylation
• DNA methylation has a crucial role in the development and progression of the
cancer (Kerr et al.,2007).
• DNA methylation changes have been associated with many human diseases,
especially cancer (Kulis and Esteller, 2010; Spisák et al.,2012)
Motivation to Study Methylation:
 DNA Methylation
• DNA methylations contain a huge amount of data (28 million CpG sites in the
human genome)
• DNA methylation usually follows non-symmetric distribution at each CpG site
and non-linear groups of samples.
Difficulties to Analyze DNA Methylation
 DNA Methylation
• We use advanced algorithms, called in Computer Science field the Machine
Learning Algorithms; that give computers the ability to learn without being
explicitly programmed (Arthur Samuel, 1959).
• Machine learning algorithms :
1. Unsupervised algorithm (Cluster analysis): There is no precedent information
about the groups of data.
2. Supervised algorithm (Discrimination Analysis): There is precedent
information about the groups of data.
Methods Consideration:
 DNA Methylation
Cluster Analysis
• Clustering (or cluster analysis) is one of
the main data analysis techniques and
deals with the organization of a set of
objects in a multidimensional space into
cohesive groups, called clusters.
• Each cluster contains objects that are
very similar to each other and very
dissimilar to objects in other clusters
(Rasmussen, 1992).
Cluster Analysis
Cluster algorithms has two main types:
Hierarchical algorithms: Decompose the data of n
objects into several levels of nested clusters
represented by a dendrogram. So that each node
of the tree represents a cluster of data.
Partitioning algorithms: Construct a flat (single
level) partition of a data of n objects into a set of k
clusters such that the objects in a cluster are more
similar to each other than to objects in different
clusters like K-Means and DBSCAN.
Cluster Analysis
Cluster analysis steps:
Cluster Analysis
1. Choose a Distance Function
2. Construct Proximities Matrix
3. Choose a Clustering Algorithm
Cluster analysis steps:
▪ Manhattan distance:
Cluster Analysis
1. Choose a distance function:
▪ Euclidean distance:
or
2. Calculate differences between observations by proximities matrix:
Cluster analysis steps:
Cluster Analysis
. . . . . .
.
.
.
1)Hierarchical Clustering
2)K-MEANS
3)K-Medians
4)Expectation Maximization
5)Fuzzy Clustering
6)Non Negative Matrix Factorization
7)Latent Dirichlet Allocation (LDA)
8)DBSCAN
Cluster analysis steps:
3. Choosing Clustering Algorithms:
Cluster Analysis
K-Means Clustering:
• Each data point belongs to the cluster with the nearest mean, this algorithm
proposed by Stuart Lloyd (1957).
• Requires only the number of required clusters (K), what makes it the most
popular algorithm.
Cluster Analysis
1
2
43
Cluster Analysis
D = {d1, d2,......,dn}
k: number of desired clusters (e.g. k=2)
1. Arbitrarily choose k data-items from D
as initial centroids;
2. Assign each item di to the cluster
which has the closest centroid
3. Calculate new mean for each cluster
4. Until convergence criteria is met.
K-Means algorithm:
1
Advantages:
1. Simple, easy to implement, and interpret clustering results
2. Fast and efficient in terms of computational cost
Disadvantages:
1. Often produce clusters with relatively uniform size even if the data have
different cluster size.
2. Cannot find non-linear clusters or clusters with unusual shapes.
K-Means Clustering:
Cluster Analysis
DBSCAN:
• The Density-based spatial clustering of applications with noise (DBSCAN) is a
data clustering algorithm proposed by (Martin Ester, et al, 1996).
• It based on connecting points within certain distance thresholds
• It only connects points that satisfy a density criterion of (Ɛ , MinPts).
Cluster Analysis
Choose Ɛ and MinPints (by field Expert).
1. Arbitrary select point p
2. Label Core point: which has a neighborhood with
MinPts or more within the radius Ɛ.
3. Label Border Point which has a neighborhood
that has less than MinPts within the radius Ɛ.
4. Otherwise it will be considered as a noise
5. Continue until it covers all points
DBSCAN algorithm:
Cluster Analysis
DBSCAN algorithm:
Cluster Analysis
Advantages
1. Clusters can have arbitrary shape and size
2. Number of clusters is determined automatically (not like K-Means).
3. Can separate clusters from surrounding noise (it define noise).
4. Parameters MinPts and Ɛ should be set by the domain expert (not by Statisticians!)
Disadvantages:
• Selecting MinPts and Ɛ which very sensitive and difficult to determine.
DBSCAN
Cluster Analysis
Simulation Study
Simulation Study
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40 45
• We generated two non-linear groups of data in Microsoft-Excel that it is like
an overlapped moon shapes in two dimensions (X,Y) by 346 points.
Descriptive Statistics
X Y
Mean 20.97 21.9
Median 21 22
SD 10.55 4.84
Range 39 21
Minimum 1 11
Maximum 40 32
K-Means (K=2)
Example of K-Means clustering
Simulation Study
DBSCAN (Ɛ = 1, MinPts = 4)
Example of DBSCAN
Simulation Study
Misclassification of Clustering
True
Cluster
K-means DBSCAN
Total
1 2 1 2 3
1 117 31 148 0 0 148
2 56 142 0 195 3 198
Total 173 173 148 195 3 346
Simulation Study
Clustering for DNA methylation
Dendrograms of Clusters for Samples and CpG Sites
Clustering for DNA methylation
Usual clustering for DNA methylation is conducted by two-way
Clustering for DNA methylation
Description of the DNA Methylation Data:
• The data that had been
collected is a microarray data
from the TCGAAnalysis of
DNA Methylation for lung
adenocarcinoma using
Illumina Infinium Human
Methylation 27 platform.
Methylation Ratios Data–Descriptive STAT
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265
Clustering for DNA methylation
• So, we examined randomly selected two CpG
sites 117586918 and117746793 for the
linearity of groups of samples.
• Notice the non-linearity of the samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Cancer Normal
Samples:
Clustering for DNA methylation
• We checked the samples against each other and
we found that the first sample and the sample
number 13 have a non-linear shape that lead us
to be quite sure of the difficult possibility to
classify them linearly.
• We see the necessity to use DBSCAN algorithm!
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
CpG sites:
Clustering for DNA methylation
• The CpG sites have a non-symmetric distributions, which is the first indictor of
non-linearity of the methylation data.
Logit transformation:
Methylation Ratios Data – Descriptive
Statistics
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265
Clustering for DNA methylation
Summary of DNA Methylations
Ratios to Analyze
Min Max Ave.
-4.8628 3.4868 -1.814
-4.7809 3.1381 -1.9554
-4.862 3.486 -1.852
Clustering Samples:
Clustering for DNA methylation
• DBSCAN is giving more
valuable and useful results, since
it separates the cancer samples
• While the K-means has divided
the cancer samples into useless
two clusters.
Comparison between DBSCAN and K-means
for DNA Methylation Rations
K-Means DBSCAN
Total
Cluster
1
Cluster
2
Cluster
1
Cluster
2
Cancer 30 35 4 61 65
Normal 24 0 23 1 24
Total 54 35 27 62 89
Clustering CpG sites:
DBSCAN and K-Means for
the CpG sites
Cluster DBSCAN
K-
Means
1 21 17
2 7 11
Total 28 28
Clustering for DNA methylation
• DBSCAN identified small number of
differentially methylated CpG sites and large
number of non-differentially methylated CpG sites.
• while K-Means has led to similar numbers of
differentially methylated and non- differentially
methylated CpG sites!
• The gene located after those 7 CpG sites that identifying as differentially
methylated are suspected to have a crucial role for the cancer, and according to
Santa Cruz Genome Browser this genome has a function of Protects DRG2
from proteolytic degradation, that would be another motivation to study more
about this in the future studies.
Clustering for DNA methylation
Necessary work afterwards:
Santa Cruz Genome Browser
Thank you

More Related Content

What's hot

Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
Ricardo Wendell Rodrigues da Silveira
 
Block cipher modes of operations
Block cipher modes of operationsBlock cipher modes of operations
Block cipher modes of operations
AkashRanjandas1
 
Independent Component Analysis
Independent Component AnalysisIndependent Component Analysis
Independent Component Analysis
Tatsuya Yokota
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
Kenny Daniel
 
LIVER-SEG-PPT-1.pptx
LIVER-SEG-PPT-1.pptxLIVER-SEG-PPT-1.pptx
LIVER-SEG-PPT-1.pptx
SunilNaik85
 
10 Instruction Sets Characteristics
10  Instruction  Sets Characteristics10  Instruction  Sets Characteristics
10 Instruction Sets Characteristics
Jeanie Delos Arcos
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate Regression
MohitMhapuskar
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
amalalhait
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Mahbubur Rahman Shimul
 
Sram technology
Sram technologySram technology
Sram technology
dilipbagadai
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Id3 algorithm
Id3 algorithmId3 algorithm
Id3 algorithm
SreekuttanJayakumar
 
Artificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge AcquisitionArtificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge Acquisition
The Integral Worm
 
Decision tree
Decision treeDecision tree
Decision tree
shivani saluja
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
Jeremiah Fadugba
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 

What's hot (20)

Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Block cipher modes of operations
Block cipher modes of operationsBlock cipher modes of operations
Block cipher modes of operations
 
Independent Component Analysis
Independent Component AnalysisIndependent Component Analysis
Independent Component Analysis
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
LIVER-SEG-PPT-1.pptx
LIVER-SEG-PPT-1.pptxLIVER-SEG-PPT-1.pptx
LIVER-SEG-PPT-1.pptx
 
10 Instruction Sets Characteristics
10  Instruction  Sets Characteristics10  Instruction  Sets Characteristics
10 Instruction Sets Characteristics
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate Regression
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Sram technology
Sram technologySram technology
Sram technology
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Id3 algorithm
Id3 algorithmId3 algorithm
Id3 algorithm
 
Artificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge AcquisitionArtificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge Acquisition
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 

Similar to Density based spatial clustering of applications with noises for dna methylation data

Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...
Dr.(Mrs).Gethsiyal Augasta
 
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASEA NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
indexPub
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
Vahid Mirjalili
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
Bhuvanya Raghunathan
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
ShwetapadmaBabu1
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
●๋•máńíکhá Gőýálツ
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 
Data integration lab_meeting
Data integration lab_meetingData integration lab_meeting
Data integration lab_meeting
Liangqun Lu
 
H0114857
H0114857H0114857
H0114857
IJRES Journal
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
nlt2390
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Seval Çapraz
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology
Hajra Qayyum
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
Vahid Mirjalili
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
IAESIJEECS
 
DBSCAN (1) (4).pptx
DBSCAN (1) (4).pptxDBSCAN (1) (4).pptx
DBSCAN (1) (4).pptx
ABINPMATHEW22020
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
pannicle
 

Similar to Density based spatial clustering of applications with noises for dna methylation data (20)

Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...
 
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASEA NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Data integration lab_meeting
Data integration lab_meetingData integration lab_meeting
Data integration lab_meeting
 
H0114857
H0114857H0114857
H0114857
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
DBSCAN (1) (4).pptx
DBSCAN (1) (4).pptxDBSCAN (1) (4).pptx
DBSCAN (1) (4).pptx
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 

Recently uploaded

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 

Density based spatial clustering of applications with noises for dna methylation data

  • 1. DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISES FOR DNA METHYLATION DATA Division of Statistics Northern Illinois University,2017 Committee: Dr. Alan Polansky Dr. Nader Ebrahimi Dr. Haiming Zhou Dr. Duchwan Ryu Mohammed Atef Alghzzy
  • 2. Contents: DNA Methylation Cluster Analysis (K-Means and DBSCAN) Simulation Study Clustering for DNA methylation
  • 3. • DNA methylation is a process by which methyl groups are added to the Cytosine nucleotide in DNA. • Methylation can change the activity of a DNA segment without changing the sequence, when located in a gene promoter, and it typically acts to repress gene transcription.  DNA Methylation
  • 4. • DNA methylation has a crucial role in the development and progression of the cancer (Kerr et al.,2007). • DNA methylation changes have been associated with many human diseases, especially cancer (Kulis and Esteller, 2010; Spisák et al.,2012) Motivation to Study Methylation:  DNA Methylation
  • 5. • DNA methylations contain a huge amount of data (28 million CpG sites in the human genome) • DNA methylation usually follows non-symmetric distribution at each CpG site and non-linear groups of samples. Difficulties to Analyze DNA Methylation  DNA Methylation
  • 6. • We use advanced algorithms, called in Computer Science field the Machine Learning Algorithms; that give computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). • Machine learning algorithms : 1. Unsupervised algorithm (Cluster analysis): There is no precedent information about the groups of data. 2. Supervised algorithm (Discrimination Analysis): There is precedent information about the groups of data. Methods Consideration:  DNA Methylation
  • 8. • Clustering (or cluster analysis) is one of the main data analysis techniques and deals with the organization of a set of objects in a multidimensional space into cohesive groups, called clusters. • Each cluster contains objects that are very similar to each other and very dissimilar to objects in other clusters (Rasmussen, 1992). Cluster Analysis
  • 9. Cluster algorithms has two main types: Hierarchical algorithms: Decompose the data of n objects into several levels of nested clusters represented by a dendrogram. So that each node of the tree represents a cluster of data. Partitioning algorithms: Construct a flat (single level) partition of a data of n objects into a set of k clusters such that the objects in a cluster are more similar to each other than to objects in different clusters like K-Means and DBSCAN. Cluster Analysis
  • 10. Cluster analysis steps: Cluster Analysis 1. Choose a Distance Function 2. Construct Proximities Matrix 3. Choose a Clustering Algorithm
  • 11. Cluster analysis steps: ▪ Manhattan distance: Cluster Analysis 1. Choose a distance function: ▪ Euclidean distance: or
  • 12. 2. Calculate differences between observations by proximities matrix: Cluster analysis steps: Cluster Analysis . . . . . . . . .
  • 13. 1)Hierarchical Clustering 2)K-MEANS 3)K-Medians 4)Expectation Maximization 5)Fuzzy Clustering 6)Non Negative Matrix Factorization 7)Latent Dirichlet Allocation (LDA) 8)DBSCAN Cluster analysis steps: 3. Choosing Clustering Algorithms: Cluster Analysis
  • 14. K-Means Clustering: • Each data point belongs to the cluster with the nearest mean, this algorithm proposed by Stuart Lloyd (1957). • Requires only the number of required clusters (K), what makes it the most popular algorithm. Cluster Analysis
  • 15. 1 2 43 Cluster Analysis D = {d1, d2,......,dn} k: number of desired clusters (e.g. k=2) 1. Arbitrarily choose k data-items from D as initial centroids; 2. Assign each item di to the cluster which has the closest centroid 3. Calculate new mean for each cluster 4. Until convergence criteria is met. K-Means algorithm: 1
  • 16. Advantages: 1. Simple, easy to implement, and interpret clustering results 2. Fast and efficient in terms of computational cost Disadvantages: 1. Often produce clusters with relatively uniform size even if the data have different cluster size. 2. Cannot find non-linear clusters or clusters with unusual shapes. K-Means Clustering: Cluster Analysis
  • 17. DBSCAN: • The Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by (Martin Ester, et al, 1996). • It based on connecting points within certain distance thresholds • It only connects points that satisfy a density criterion of (Ɛ , MinPts). Cluster Analysis
  • 18. Choose Ɛ and MinPints (by field Expert). 1. Arbitrary select point p 2. Label Core point: which has a neighborhood with MinPts or more within the radius Ɛ. 3. Label Border Point which has a neighborhood that has less than MinPts within the radius Ɛ. 4. Otherwise it will be considered as a noise 5. Continue until it covers all points DBSCAN algorithm: Cluster Analysis
  • 20. Advantages 1. Clusters can have arbitrary shape and size 2. Number of clusters is determined automatically (not like K-Means). 3. Can separate clusters from surrounding noise (it define noise). 4. Parameters MinPts and Ɛ should be set by the domain expert (not by Statisticians!) Disadvantages: • Selecting MinPts and Ɛ which very sensitive and difficult to determine. DBSCAN Cluster Analysis
  • 22. Simulation Study 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 45 • We generated two non-linear groups of data in Microsoft-Excel that it is like an overlapped moon shapes in two dimensions (X,Y) by 346 points. Descriptive Statistics X Y Mean 20.97 21.9 Median 21 22 SD 10.55 4.84 Range 39 21 Minimum 1 11 Maximum 40 32
  • 23. K-Means (K=2) Example of K-Means clustering Simulation Study
  • 24. DBSCAN (Ɛ = 1, MinPts = 4) Example of DBSCAN Simulation Study
  • 25. Misclassification of Clustering True Cluster K-means DBSCAN Total 1 2 1 2 3 1 117 31 148 0 0 148 2 56 142 0 195 3 198 Total 173 173 148 195 3 346 Simulation Study
  • 26. Clustering for DNA methylation
  • 27. Dendrograms of Clusters for Samples and CpG Sites Clustering for DNA methylation Usual clustering for DNA methylation is conducted by two-way
  • 28. Clustering for DNA methylation Description of the DNA Methylation Data: • The data that had been collected is a microarray data from the TCGAAnalysis of DNA Methylation for lung adenocarcinoma using Illumina Infinium Human Methylation 27 platform. Methylation Ratios Data–Descriptive STAT Status Count Min Max Ave. Cancer 65 0.0076 0.9703 0.2683 Normal 24 0.0083 0.9584 0.2562 Total 89 0.0076 0.9703 0.265
  • 29. Clustering for DNA methylation • So, we examined randomly selected two CpG sites 117586918 and117746793 for the linearity of groups of samples. • Notice the non-linearity of the samples 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Cancer Normal Samples:
  • 30. Clustering for DNA methylation • We checked the samples against each other and we found that the first sample and the sample number 13 have a non-linear shape that lead us to be quite sure of the difficult possibility to classify them linearly. • We see the necessity to use DBSCAN algorithm! 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 CpG sites:
  • 31. Clustering for DNA methylation • The CpG sites have a non-symmetric distributions, which is the first indictor of non-linearity of the methylation data.
  • 32. Logit transformation: Methylation Ratios Data – Descriptive Statistics Status Count Min Max Ave. Cancer 65 0.0076 0.9703 0.2683 Normal 24 0.0083 0.9584 0.2562 Total 89 0.0076 0.9703 0.265 Clustering for DNA methylation Summary of DNA Methylations Ratios to Analyze Min Max Ave. -4.8628 3.4868 -1.814 -4.7809 3.1381 -1.9554 -4.862 3.486 -1.852
  • 33. Clustering Samples: Clustering for DNA methylation • DBSCAN is giving more valuable and useful results, since it separates the cancer samples • While the K-means has divided the cancer samples into useless two clusters. Comparison between DBSCAN and K-means for DNA Methylation Rations K-Means DBSCAN Total Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cancer 30 35 4 61 65 Normal 24 0 23 1 24 Total 54 35 27 62 89
  • 34. Clustering CpG sites: DBSCAN and K-Means for the CpG sites Cluster DBSCAN K- Means 1 21 17 2 7 11 Total 28 28 Clustering for DNA methylation • DBSCAN identified small number of differentially methylated CpG sites and large number of non-differentially methylated CpG sites. • while K-Means has led to similar numbers of differentially methylated and non- differentially methylated CpG sites!
  • 35. • The gene located after those 7 CpG sites that identifying as differentially methylated are suspected to have a crucial role for the cancer, and according to Santa Cruz Genome Browser this genome has a function of Protects DRG2 from proteolytic degradation, that would be another motivation to study more about this in the future studies. Clustering for DNA methylation Necessary work afterwards: Santa Cruz Genome Browser

Editor's Notes

  1. Why DNA methylation is important in disease and cancer studies?
  2. What are the difficulties to analyze DNA methylation?
  3. What you are going to do to analyze DNA methylation?
  4. What type of cluster analysis are you considering?
  5. Make 3 slides with this and the next one: Slide1: Cluster analysis steps Slide2: Distance matrix Slide3: Clustering algorithm
  6. Make 3 slides with this and the next one: Slide1: Cluster analysis steps Slide2: Distance matrix Slide3: Clustering algorithm
  7. Make 3 slides with this and the next one: Slide1: Cluster analysis steps Slide2: Distance matrix Slide3: Clustering algorithm
  8. 1)+2) Hierarchical clustering (e.g., single-linkage)
  9. Write how did you generate simulation data.
  10. Itemize the comments on the left-side.
  11. What you observe from the data, boxplot? Insert a slide for the summary.
  12. Insert a slide to summary what you have found from DBSCAN.
  13. Note that this is the future works to do, after identifying differentially methylated CpG sites.