Density-based spatial clustering of applications with noise (DBSCAN) is one of the most popular clustering methods to classify nonlinearly grouped data. In particular, DNA methylations are considered to be differently skewed by CpG sites and to be nonlinearly grouped by cancer statuses. Under this circumstance, DBSCAN is expected to have a desirable clustering feature. This thesis reviews the DBSCAN algorithm and compares its performance to the other traditional clustering algorithm, K-means method. Simulation studies show the misclassification ratios of DBSCAN with the comparison of K-means method to evaluate their performance, and the classification of DNA methylations from patients with lung adenocarcinoma demonstrates the application of DBSCAN.
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Density based spatial clustering of applications with noises for dna methylation data
1. DENSITY-BASED SPATIAL CLUSTERING OF
APPLICATIONS WITH NOISES FOR DNA
METHYLATION DATA
Division of Statistics
Northern Illinois University,2017
Committee:
Dr. Alan Polansky
Dr. Nader Ebrahimi
Dr. Haiming Zhou
Dr. Duchwan Ryu
Mohammed Atef Alghzzy
3. • DNA methylation is a
process by which methyl
groups are added to the
Cytosine nucleotide in DNA.
• Methylation can change the
activity of a DNA segment
without changing the
sequence, when located in a
gene promoter, and
it typically acts to repress
gene transcription.
DNA Methylation
4. • DNA methylation has a crucial role in the development and progression of the
cancer (Kerr et al.,2007).
• DNA methylation changes have been associated with many human diseases,
especially cancer (Kulis and Esteller, 2010; Spisák et al.,2012)
Motivation to Study Methylation:
DNA Methylation
5. • DNA methylations contain a huge amount of data (28 million CpG sites in the
human genome)
• DNA methylation usually follows non-symmetric distribution at each CpG site
and non-linear groups of samples.
Difficulties to Analyze DNA Methylation
DNA Methylation
6. • We use advanced algorithms, called in Computer Science field the Machine
Learning Algorithms; that give computers the ability to learn without being
explicitly programmed (Arthur Samuel, 1959).
• Machine learning algorithms :
1. Unsupervised algorithm (Cluster analysis): There is no precedent information
about the groups of data.
2. Supervised algorithm (Discrimination Analysis): There is precedent
information about the groups of data.
Methods Consideration:
DNA Methylation
8. • Clustering (or cluster analysis) is one of
the main data analysis techniques and
deals with the organization of a set of
objects in a multidimensional space into
cohesive groups, called clusters.
• Each cluster contains objects that are
very similar to each other and very
dissimilar to objects in other clusters
(Rasmussen, 1992).
Cluster Analysis
9. Cluster algorithms has two main types:
Hierarchical algorithms: Decompose the data of n
objects into several levels of nested clusters
represented by a dendrogram. So that each node
of the tree represents a cluster of data.
Partitioning algorithms: Construct a flat (single
level) partition of a data of n objects into a set of k
clusters such that the objects in a cluster are more
similar to each other than to objects in different
clusters like K-Means and DBSCAN.
Cluster Analysis
10. Cluster analysis steps:
Cluster Analysis
1. Choose a Distance Function
2. Construct Proximities Matrix
3. Choose a Clustering Algorithm
11. Cluster analysis steps:
▪ Manhattan distance:
Cluster Analysis
1. Choose a distance function:
▪ Euclidean distance:
or
14. K-Means Clustering:
• Each data point belongs to the cluster with the nearest mean, this algorithm
proposed by Stuart Lloyd (1957).
• Requires only the number of required clusters (K), what makes it the most
popular algorithm.
Cluster Analysis
15. 1
2
43
Cluster Analysis
D = {d1, d2,......,dn}
k: number of desired clusters (e.g. k=2)
1. Arbitrarily choose k data-items from D
as initial centroids;
2. Assign each item di to the cluster
which has the closest centroid
3. Calculate new mean for each cluster
4. Until convergence criteria is met.
K-Means algorithm:
1
16. Advantages:
1. Simple, easy to implement, and interpret clustering results
2. Fast and efficient in terms of computational cost
Disadvantages:
1. Often produce clusters with relatively uniform size even if the data have
different cluster size.
2. Cannot find non-linear clusters or clusters with unusual shapes.
K-Means Clustering:
Cluster Analysis
17. DBSCAN:
• The Density-based spatial clustering of applications with noise (DBSCAN) is a
data clustering algorithm proposed by (Martin Ester, et al, 1996).
• It based on connecting points within certain distance thresholds
• It only connects points that satisfy a density criterion of (Ɛ , MinPts).
Cluster Analysis
18. Choose Ɛ and MinPints (by field Expert).
1. Arbitrary select point p
2. Label Core point: which has a neighborhood with
MinPts or more within the radius Ɛ.
3. Label Border Point which has a neighborhood
that has less than MinPts within the radius Ɛ.
4. Otherwise it will be considered as a noise
5. Continue until it covers all points
DBSCAN algorithm:
Cluster Analysis
20. Advantages
1. Clusters can have arbitrary shape and size
2. Number of clusters is determined automatically (not like K-Means).
3. Can separate clusters from surrounding noise (it define noise).
4. Parameters MinPts and Ɛ should be set by the domain expert (not by Statisticians!)
Disadvantages:
• Selecting MinPts and Ɛ which very sensitive and difficult to determine.
DBSCAN
Cluster Analysis
22. Simulation Study
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40 45
• We generated two non-linear groups of data in Microsoft-Excel that it is like
an overlapped moon shapes in two dimensions (X,Y) by 346 points.
Descriptive Statistics
X Y
Mean 20.97 21.9
Median 21 22
SD 10.55 4.84
Range 39 21
Minimum 1 11
Maximum 40 32
27. Dendrograms of Clusters for Samples and CpG Sites
Clustering for DNA methylation
Usual clustering for DNA methylation is conducted by two-way
28. Clustering for DNA methylation
Description of the DNA Methylation Data:
• The data that had been
collected is a microarray data
from the TCGAAnalysis of
DNA Methylation for lung
adenocarcinoma using
Illumina Infinium Human
Methylation 27 platform.
Methylation Ratios Data–Descriptive STAT
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265
29. Clustering for DNA methylation
• So, we examined randomly selected two CpG
sites 117586918 and117746793 for the
linearity of groups of samples.
• Notice the non-linearity of the samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Cancer Normal
Samples:
30. Clustering for DNA methylation
• We checked the samples against each other and
we found that the first sample and the sample
number 13 have a non-linear shape that lead us
to be quite sure of the difficult possibility to
classify them linearly.
• We see the necessity to use DBSCAN algorithm!
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
CpG sites:
31. Clustering for DNA methylation
• The CpG sites have a non-symmetric distributions, which is the first indictor of
non-linearity of the methylation data.
32. Logit transformation:
Methylation Ratios Data – Descriptive
Statistics
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265
Clustering for DNA methylation
Summary of DNA Methylations
Ratios to Analyze
Min Max Ave.
-4.8628 3.4868 -1.814
-4.7809 3.1381 -1.9554
-4.862 3.486 -1.852
33. Clustering Samples:
Clustering for DNA methylation
• DBSCAN is giving more
valuable and useful results, since
it separates the cancer samples
• While the K-means has divided
the cancer samples into useless
two clusters.
Comparison between DBSCAN and K-means
for DNA Methylation Rations
K-Means DBSCAN
Total
Cluster
1
Cluster
2
Cluster
1
Cluster
2
Cancer 30 35 4 61 65
Normal 24 0 23 1 24
Total 54 35 27 62 89
34. Clustering CpG sites:
DBSCAN and K-Means for
the CpG sites
Cluster DBSCAN
K-
Means
1 21 17
2 7 11
Total 28 28
Clustering for DNA methylation
• DBSCAN identified small number of
differentially methylated CpG sites and large
number of non-differentially methylated CpG sites.
• while K-Means has led to similar numbers of
differentially methylated and non- differentially
methylated CpG sites!
35. • The gene located after those 7 CpG sites that identifying as differentially
methylated are suspected to have a crucial role for the cancer, and according to
Santa Cruz Genome Browser this genome has a function of Protects DRG2
from proteolytic degradation, that would be another motivation to study more
about this in the future studies.
Clustering for DNA methylation
Necessary work afterwards:
Santa Cruz Genome Browser