Density based spatial clustering of applications with noises for dna methylation data

DENSITY-BASED SPATIAL CLUSTERING OF
APPLICATIONS WITH NOISES FOR DNA
METHYLATION DATA
Division of Statistics
Northern Illinois University,2017
Committee:
Dr. Alan Polansky
Dr. Nader Ebrahimi
Dr. Haiming Zhou
Dr. Duchwan Ryu
Mohammed Atef Alghzzy

Contents:
DNA Methylation
Cluster Analysis (K-Means and DBSCAN)
Simulation Study
Clustering for DNA methylation

• DNA methylation is a
process by which methyl
groups are added to the
Cytosine nucleotide in DNA.
• Methylation can change the
activity of a DNA segment
without changing the
sequence, when located in a
gene promoter, and
it typically acts to repress
gene transcription.
 DNA Methylation

• DNA methylation has a crucial role in the development and progression of the
cancer (Kerr et al.,2007).
• DNA methylation changes have been associated with many human diseases,
especially cancer (Kulis and Esteller, 2010; Spisák et al.,2012)
Motivation to Study Methylation:
 DNA Methylation

• DNA methylations contain a huge amount of data (28 million CpG sites in the
human genome)
• DNA methylation usually follows non-symmetric distribution at each CpG site
and non-linear groups of samples.
Difficulties to Analyze DNA Methylation
 DNA Methylation

• We use advanced algorithms, called in Computer Science field the Machine
Learning Algorithms; that give computers the ability to learn without being
explicitly programmed (Arthur Samuel, 1959).
• Machine learning algorithms :
1. Unsupervised algorithm (Cluster analysis): There is no precedent information
about the groups of data.
2. Supervised algorithm (Discrimination Analysis): There is precedent
information about the groups of data.
Methods Consideration:
 DNA Methylation

• Clustering (or cluster analysis) is one of
the main data analysis techniques and
deals with the organization of a set of
objects in a multidimensional space into
cohesive groups, called clusters.
• Each cluster contains objects that are
very similar to each other and very
dissimilar to objects in other clusters
(Rasmussen, 1992).
Cluster Analysis

Cluster algorithms has two main types:
Hierarchical algorithms: Decompose the data of n
objects into several levels of nested clusters
represented by a dendrogram. So that each node
of the tree represents a cluster of data.
Partitioning algorithms: Construct a flat (single
level) partition of a data of n objects into a set of k
clusters such that the objects in a cluster are more
similar to each other than to objects in different
clusters like K-Means and DBSCAN.
Cluster Analysis

Cluster analysis steps:
Cluster Analysis
1. Choose a Distance Function
2. Construct Proximities Matrix
3. Choose a Clustering Algorithm

▪ Manhattan distance:
Cluster Analysis
1. Choose a distance function:
▪ Euclidean distance:
or

2. Calculate differences between observations by proximities matrix:
Cluster Analysis
. . . . . .
.
.
.

1)Hierarchical Clustering
2)K-MEANS
3)K-Medians
4)Expectation Maximization
5)Fuzzy Clustering
6)Non Negative Matrix Factorization
7)Latent Dirichlet Allocation (LDA)
8)DBSCAN
3. Choosing Clustering Algorithms:
Cluster Analysis

K-Means Clustering:
• Each data point belongs to the cluster with the nearest mean, this algorithm
proposed by Stuart Lloyd (1957).
• Requires only the number of required clusters (K), what makes it the most
popular algorithm.
Cluster Analysis

1
2
43
Cluster Analysis
D = {d1, d2,......,dn}
k: number of desired clusters (e.g. k=2)
1. Arbitrarily choose k data-items from D
as initial centroids;
2. Assign each item di to the cluster
which has the closest centroid
3. Calculate new mean for each cluster
4. Until convergence criteria is met.
K-Means algorithm:
1

Advantages:
1. Simple, easy to implement, and interpret clustering results
2. Fast and efficient in terms of computational cost
Disadvantages:
1. Often produce clusters with relatively uniform size even if the data have
different cluster size.
2. Cannot find non-linear clusters or clusters with unusual shapes.
K-Means Clustering:
Cluster Analysis

DBSCAN:
• The Density-based spatial clustering of applications with noise (DBSCAN) is a
data clustering algorithm proposed by (Martin Ester, et al, 1996).
• It based on connecting points within certain distance thresholds
• It only connects points that satisfy a density criterion of (Ɛ , MinPts).
Cluster Analysis

Choose Ɛ and MinPints (by field Expert).
1. Arbitrary select point p
2. Label Core point: which has a neighborhood with
MinPts or more within the radius Ɛ.
3. Label Border Point which has a neighborhood
that has less than MinPts within the radius Ɛ.
4. Otherwise it will be considered as a noise
5. Continue until it covers all points
DBSCAN algorithm:
Cluster Analysis

DBSCAN algorithm:
Cluster Analysis

Advantages
1. Clusters can have arbitrary shape and size
2. Number of clusters is determined automatically (not like K-Means).
3. Can separate clusters from surrounding noise (it define noise).
4. Parameters MinPts and Ɛ should be set by the domain expert (not by Statisticians!)
Disadvantages:
• Selecting MinPts and Ɛ which very sensitive and difficult to determine.
DBSCAN
Cluster Analysis

Simulation Study
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40 45
• We generated two non-linear groups of data in Microsoft-Excel that it is like
an overlapped moon shapes in two dimensions (X,Y) by 346 points.
Descriptive Statistics
X Y
Mean 20.97 21.9
Median 21 22
SD 10.55 4.84
Range 39 21
Minimum 1 11
Maximum 40 32

K-Means (K=2)
Example of K-Means clustering
Simulation Study

DBSCAN (Ɛ = 1, MinPts = 4)
Example of DBSCAN
Simulation Study

Misclassification of Clustering
True
Cluster
K-means DBSCAN
Total
1 2 1 2 3
1 117 31 148 0 0 148
2 56 142 0 195 3 198
Total 173 173 148 195 3 346
Simulation Study

Dendrograms of Clusters for Samples and CpG Sites
Usual clustering for DNA methylation is conducted by two-way

Description of the DNA Methylation Data:
• The data that had been
collected is a microarray data
from the TCGAAnalysis of
DNA Methylation for lung
adenocarcinoma using
Illumina Infinium Human
Methylation 27 platform.
Methylation Ratios Data–Descriptive STAT
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265

• So, we examined randomly selected two CpG
sites 117586918 and117746793 for the
linearity of groups of samples.
• Notice the non-linearity of the samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Cancer Normal
Samples:

• We checked the samples against each other and
we found that the first sample and the sample
number 13 have a non-linear shape that lead us
to be quite sure of the difficult possibility to
classify them linearly.
• We see the necessity to use DBSCAN algorithm!
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
CpG sites:

• The CpG sites have a non-symmetric distributions, which is the first indictor of
non-linearity of the methylation data.

Logit transformation:
Methylation Ratios Data – Descriptive
Statistics
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265
Summary of DNA Methylations
Ratios to Analyze
Min Max Ave.
-4.8628 3.4868 -1.814
-4.7809 3.1381 -1.9554
-4.862 3.486 -1.852

Clustering Samples:
• DBSCAN is giving more
valuable and useful results, since
it separates the cancer samples
• While the K-means has divided
the cancer samples into useless
two clusters.
Comparison between DBSCAN and K-means
for DNA Methylation Rations
K-Means DBSCAN
Total
Cluster
1
Cluster
2
Cluster
1
Cluster
2
Cancer 30 35 4 61 65
Normal 24 0 23 1 24
Total 54 35 27 62 89

Clustering CpG sites:
DBSCAN and K-Means for
the CpG sites
Cluster DBSCAN
K-
Means
1 21 17
2 7 11
Total 28 28
• DBSCAN identified small number of
differentially methylated CpG sites and large
number of non-differentially methylated CpG sites.
• while K-Means has led to similar numbers of
differentially methylated and non- differentially
methylated CpG sites!

• The gene located after those 7 CpG sites that identifying as differentially
methylated are suspected to have a crucial role for the cancer, and according to
Santa Cruz Genome Browser this genome has a function of Protects DRG2
from proteolytic degradation, that would be another motivation to study more
about this in the future studies.
Necessary work afterwards:
Santa Cruz Genome Browser

Density based spatial clustering of applications with noises for dna methylation data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Density based spatial clustering of applications with noises for dna methylation data

Similar to Density based spatial clustering of applications with noises for dna methylation data (20)

Recently uploaded

Recently uploaded (20)

Density based spatial clustering of applications with noises for dna methylation data

Editor's Notes