SlideShare a Scribd company logo
1 of 36
DENSITY-BASED SPATIAL CLUSTERING OF
APPLICATIONS WITH NOISES FOR DNA
METHYLATION DATA
Division of Statistics
Northern Illinois University,2017
Committee:
Dr. Alan Polansky
Dr. Nader Ebrahimi
Dr. Haiming Zhou
Dr. Duchwan Ryu
Mohammed Atef Alghzzy
Contents:
DNA Methylation
Cluster Analysis (K-Means and DBSCAN)
Simulation Study
Clustering for DNA methylation
• DNA methylation is a
process by which methyl
groups are added to the
Cytosine nucleotide in DNA.
• Methylation can change the
activity of a DNA segment
without changing the
sequence, when located in a
gene promoter, and
it typically acts to repress
gene transcription.
 DNA Methylation
• DNA methylation has a crucial role in the development and progression of the
cancer (Kerr et al.,2007).
• DNA methylation changes have been associated with many human diseases,
especially cancer (Kulis and Esteller, 2010; Spisák et al.,2012)
Motivation to Study Methylation:
 DNA Methylation
• DNA methylations contain a huge amount of data (28 million CpG sites in the
human genome)
• DNA methylation usually follows non-symmetric distribution at each CpG site
and non-linear groups of samples.
Difficulties to Analyze DNA Methylation
 DNA Methylation
• We use advanced algorithms, called in Computer Science field the Machine
Learning Algorithms; that give computers the ability to learn without being
explicitly programmed (Arthur Samuel, 1959).
• Machine learning algorithms :
1. Unsupervised algorithm (Cluster analysis): There is no precedent information
about the groups of data.
2. Supervised algorithm (Discrimination Analysis): There is precedent
information about the groups of data.
Methods Consideration:
 DNA Methylation
Cluster Analysis
• Clustering (or cluster analysis) is one of
the main data analysis techniques and
deals with the organization of a set of
objects in a multidimensional space into
cohesive groups, called clusters.
• Each cluster contains objects that are
very similar to each other and very
dissimilar to objects in other clusters
(Rasmussen, 1992).
Cluster Analysis
Cluster algorithms has two main types:
Hierarchical algorithms: Decompose the data of n
objects into several levels of nested clusters
represented by a dendrogram. So that each node
of the tree represents a cluster of data.
Partitioning algorithms: Construct a flat (single
level) partition of a data of n objects into a set of k
clusters such that the objects in a cluster are more
similar to each other than to objects in different
clusters like K-Means and DBSCAN.
Cluster Analysis
Cluster analysis steps:
Cluster Analysis
1. Choose a Distance Function
2. Construct Proximities Matrix
3. Choose a Clustering Algorithm
Cluster analysis steps:
▪ Manhattan distance:
Cluster Analysis
1. Choose a distance function:
▪ Euclidean distance:
or
2. Calculate differences between observations by proximities matrix:
Cluster analysis steps:
Cluster Analysis
. . . . . .
.
.
.
1)Hierarchical Clustering
2)K-MEANS
3)K-Medians
4)Expectation Maximization
5)Fuzzy Clustering
6)Non Negative Matrix Factorization
7)Latent Dirichlet Allocation (LDA)
8)DBSCAN
Cluster analysis steps:
3. Choosing Clustering Algorithms:
Cluster Analysis
K-Means Clustering:
• Each data point belongs to the cluster with the nearest mean, this algorithm
proposed by Stuart Lloyd (1957).
• Requires only the number of required clusters (K), what makes it the most
popular algorithm.
Cluster Analysis
1
2
43
Cluster Analysis
D = {d1, d2,......,dn}
k: number of desired clusters (e.g. k=2)
1. Arbitrarily choose k data-items from D
as initial centroids;
2. Assign each item di to the cluster
which has the closest centroid
3. Calculate new mean for each cluster
4. Until convergence criteria is met.
K-Means algorithm:
1
Advantages:
1. Simple, easy to implement, and interpret clustering results
2. Fast and efficient in terms of computational cost
Disadvantages:
1. Often produce clusters with relatively uniform size even if the data have
different cluster size.
2. Cannot find non-linear clusters or clusters with unusual shapes.
K-Means Clustering:
Cluster Analysis
DBSCAN:
• The Density-based spatial clustering of applications with noise (DBSCAN) is a
data clustering algorithm proposed by (Martin Ester, et al, 1996).
• It based on connecting points within certain distance thresholds
• It only connects points that satisfy a density criterion of (Ɛ , MinPts).
Cluster Analysis
Choose Ɛ and MinPints (by field Expert).
1. Arbitrary select point p
2. Label Core point: which has a neighborhood with
MinPts or more within the radius Ɛ.
3. Label Border Point which has a neighborhood
that has less than MinPts within the radius Ɛ.
4. Otherwise it will be considered as a noise
5. Continue until it covers all points
DBSCAN algorithm:
Cluster Analysis
DBSCAN algorithm:
Cluster Analysis
Advantages
1. Clusters can have arbitrary shape and size
2. Number of clusters is determined automatically (not like K-Means).
3. Can separate clusters from surrounding noise (it define noise).
4. Parameters MinPts and Ɛ should be set by the domain expert (not by Statisticians!)
Disadvantages:
• Selecting MinPts and Ɛ which very sensitive and difficult to determine.
DBSCAN
Cluster Analysis
Simulation Study
Simulation Study
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40 45
• We generated two non-linear groups of data in Microsoft-Excel that it is like
an overlapped moon shapes in two dimensions (X,Y) by 346 points.
Descriptive Statistics
X Y
Mean 20.97 21.9
Median 21 22
SD 10.55 4.84
Range 39 21
Minimum 1 11
Maximum 40 32
K-Means (K=2)
Example of K-Means clustering
Simulation Study
DBSCAN (Ɛ = 1, MinPts = 4)
Example of DBSCAN
Simulation Study
Misclassification of Clustering
True
Cluster
K-means DBSCAN
Total
1 2 1 2 3
1 117 31 148 0 0 148
2 56 142 0 195 3 198
Total 173 173 148 195 3 346
Simulation Study
Clustering for DNA methylation
Dendrograms of Clusters for Samples and CpG Sites
Clustering for DNA methylation
Usual clustering for DNA methylation is conducted by two-way
Clustering for DNA methylation
Description of the DNA Methylation Data:
• The data that had been
collected is a microarray data
from the TCGAAnalysis of
DNA Methylation for lung
adenocarcinoma using
Illumina Infinium Human
Methylation 27 platform.
Methylation Ratios Data–Descriptive STAT
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265
Clustering for DNA methylation
• So, we examined randomly selected two CpG
sites 117586918 and117746793 for the
linearity of groups of samples.
• Notice the non-linearity of the samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Cancer Normal
Samples:
Clustering for DNA methylation
• We checked the samples against each other and
we found that the first sample and the sample
number 13 have a non-linear shape that lead us
to be quite sure of the difficult possibility to
classify them linearly.
• We see the necessity to use DBSCAN algorithm!
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
CpG sites:
Clustering for DNA methylation
• The CpG sites have a non-symmetric distributions, which is the first indictor of
non-linearity of the methylation data.
Logit transformation:
Methylation Ratios Data – Descriptive
Statistics
Status Count Min Max Ave.
Cancer 65 0.0076 0.9703 0.2683
Normal 24 0.0083 0.9584 0.2562
Total 89 0.0076 0.9703 0.265
Clustering for DNA methylation
Summary of DNA Methylations
Ratios to Analyze
Min Max Ave.
-4.8628 3.4868 -1.814
-4.7809 3.1381 -1.9554
-4.862 3.486 -1.852
Clustering Samples:
Clustering for DNA methylation
• DBSCAN is giving more
valuable and useful results, since
it separates the cancer samples
• While the K-means has divided
the cancer samples into useless
two clusters.
Comparison between DBSCAN and K-means
for DNA Methylation Rations
K-Means DBSCAN
Total
Cluster
1
Cluster
2
Cluster
1
Cluster
2
Cancer 30 35 4 61 65
Normal 24 0 23 1 24
Total 54 35 27 62 89
Clustering CpG sites:
DBSCAN and K-Means for
the CpG sites
Cluster DBSCAN
K-
Means
1 21 17
2 7 11
Total 28 28
Clustering for DNA methylation
• DBSCAN identified small number of
differentially methylated CpG sites and large
number of non-differentially methylated CpG sites.
• while K-Means has led to similar numbers of
differentially methylated and non- differentially
methylated CpG sites!
• The gene located after those 7 CpG sites that identifying as differentially
methylated are suspected to have a crucial role for the cancer, and according to
Santa Cruz Genome Browser this genome has a function of Protects DRG2
from proteolytic degradation, that would be another motivation to study more
about this in the future studies.
Clustering for DNA methylation
Necessary work afterwards:
Santa Cruz Genome Browser
Thank you

More Related Content

What's hot

IOT and its communication models and protocols.pdf
IOT and its communication models and protocols.pdfIOT and its communication models and protocols.pdf
IOT and its communication models and protocols.pdfMD.ANISUR RAHMAN
 
IoT Communication Protocols
IoT Communication ProtocolsIoT Communication Protocols
IoT Communication ProtocolsPradeep Kumar TS
 
Prototyping Online Components(Part 02)_Internet of Things
Prototyping Online Components(Part 02)_Internet of ThingsPrototyping Online Components(Part 02)_Internet of Things
Prototyping Online Components(Part 02)_Internet of Thingsalengadan
 
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...Edureka!
 
Restoring & Non-Restoring Division Algorithm By Sania Nisar
Restoring & Non-Restoring Division Algorithm By Sania NisarRestoring & Non-Restoring Division Algorithm By Sania Nisar
Restoring & Non-Restoring Division Algorithm By Sania NisarSania Nisar
 
Top 10 Applications Of Artificial Intelligence | Edureka
Top 10 Applications Of Artificial Intelligence | EdurekaTop 10 Applications Of Artificial Intelligence | Edureka
Top 10 Applications Of Artificial Intelligence | EdurekaEdureka!
 
Generating code from dags
Generating code from dagsGenerating code from dags
Generating code from dagsindhu mathi
 
Beginners: What is Industrial IoT (IIoT)
Beginners: What is Industrial IoT (IIoT)Beginners: What is Industrial IoT (IIoT)
Beginners: What is Industrial IoT (IIoT)3G4G
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial IntelligenceNeil Mathew
 
Password based door locking system
Password based door locking systemPassword based door locking system
Password based door locking systemArjun Singh
 
Applications of IOT (internet of things)
Applications of IOT (internet of things)Applications of IOT (internet of things)
Applications of IOT (internet of things)Vinesh Gowda
 
Lecture 1 - Introduction to IoT
Lecture 1 - Introduction to IoTLecture 1 - Introduction to IoT
Lecture 1 - Introduction to IoTAlexandru Radovici
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)Prakash Honnur
 
Password based door lock system using 8051 microcontroller final report
Password based door lock system using 8051 microcontroller final reportPassword based door lock system using 8051 microcontroller final report
Password based door lock system using 8051 microcontroller final reportChinaraja Baratam
 

What's hot (20)

IOT and its communication models and protocols.pdf
IOT and its communication models and protocols.pdfIOT and its communication models and protocols.pdf
IOT and its communication models and protocols.pdf
 
Mobile computing
Mobile computingMobile computing
Mobile computing
 
IoT Communication Protocols
IoT Communication ProtocolsIoT Communication Protocols
IoT Communication Protocols
 
Prototyping Online Components(Part 02)_Internet of Things
Prototyping Online Components(Part 02)_Internet of ThingsPrototyping Online Components(Part 02)_Internet of Things
Prototyping Online Components(Part 02)_Internet of Things
 
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
 
Restoring & Non-Restoring Division Algorithm By Sania Nisar
Restoring & Non-Restoring Division Algorithm By Sania NisarRestoring & Non-Restoring Division Algorithm By Sania Nisar
Restoring & Non-Restoring Division Algorithm By Sania Nisar
 
If then rule in fuzzy logic and fuzzy implications
If then rule  in fuzzy logic and fuzzy implicationsIf then rule  in fuzzy logic and fuzzy implications
If then rule in fuzzy logic and fuzzy implications
 
Top 10 Applications Of Artificial Intelligence | Edureka
Top 10 Applications Of Artificial Intelligence | EdurekaTop 10 Applications Of Artificial Intelligence | Edureka
Top 10 Applications Of Artificial Intelligence | Edureka
 
Generating code from dags
Generating code from dagsGenerating code from dags
Generating code from dags
 
Nano computing
Nano computingNano computing
Nano computing
 
Beginners: What is Industrial IoT (IIoT)
Beginners: What is Industrial IoT (IIoT)Beginners: What is Industrial IoT (IIoT)
Beginners: What is Industrial IoT (IIoT)
 
Soft computing
Soft computingSoft computing
Soft computing
 
Reasoning in AI
Reasoning in AIReasoning in AI
Reasoning in AI
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
IOT System.pptx
IOT System.pptxIOT System.pptx
IOT System.pptx
 
Password based door locking system
Password based door locking systemPassword based door locking system
Password based door locking system
 
Applications of IOT (internet of things)
Applications of IOT (internet of things)Applications of IOT (internet of things)
Applications of IOT (internet of things)
 
Lecture 1 - Introduction to IoT
Lecture 1 - Introduction to IoTLecture 1 - Introduction to IoT
Lecture 1 - Introduction to IoT
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)
 
Password based door lock system using 8051 microcontroller final report
Password based door lock system using 8051 microcontroller final reportPassword based door lock system using 8051 microcontroller final report
Password based door lock system using 8051 microcontroller final report
 

Similar to Density based spatial clustering of applications with noises for dna methylation data

Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...Dr.(Mrs).Gethsiyal Augasta
 
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASEA NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASEindexPub
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Data integration lab_meeting
Data integration lab_meetingData integration lab_meeting
Data integration lab_meetingLiangqun Lu
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)nlt2390
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology Hajra Qayyum
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latexIAESIJEECS
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slidespannicle
 

Similar to Density based spatial clustering of applications with noises for dna methylation data (20)

Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...
 
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASEA NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
A NOVEL DENSITY-BASED CLUSTERING ALGORITHM FOR PREDICTING CARDIOVASCULAR DISEASE
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Data integration lab_meeting
Data integration lab_meetingData integration lab_meeting
Data integration lab_meeting
 
H0114857
H0114857H0114857
H0114857
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
DBSCAN (1) (4).pptx
DBSCAN (1) (4).pptxDBSCAN (1) (4).pptx
DBSCAN (1) (4).pptx
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 

Recently uploaded

如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksBoston Institute of Analytics
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethSamantha Rae Coolbeth
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 

Recently uploaded (20)

如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 

Density based spatial clustering of applications with noises for dna methylation data

  • 1. DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISES FOR DNA METHYLATION DATA Division of Statistics Northern Illinois University,2017 Committee: Dr. Alan Polansky Dr. Nader Ebrahimi Dr. Haiming Zhou Dr. Duchwan Ryu Mohammed Atef Alghzzy
  • 2. Contents: DNA Methylation Cluster Analysis (K-Means and DBSCAN) Simulation Study Clustering for DNA methylation
  • 3. • DNA methylation is a process by which methyl groups are added to the Cytosine nucleotide in DNA. • Methylation can change the activity of a DNA segment without changing the sequence, when located in a gene promoter, and it typically acts to repress gene transcription.  DNA Methylation
  • 4. • DNA methylation has a crucial role in the development and progression of the cancer (Kerr et al.,2007). • DNA methylation changes have been associated with many human diseases, especially cancer (Kulis and Esteller, 2010; Spisák et al.,2012) Motivation to Study Methylation:  DNA Methylation
  • 5. • DNA methylations contain a huge amount of data (28 million CpG sites in the human genome) • DNA methylation usually follows non-symmetric distribution at each CpG site and non-linear groups of samples. Difficulties to Analyze DNA Methylation  DNA Methylation
  • 6. • We use advanced algorithms, called in Computer Science field the Machine Learning Algorithms; that give computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). • Machine learning algorithms : 1. Unsupervised algorithm (Cluster analysis): There is no precedent information about the groups of data. 2. Supervised algorithm (Discrimination Analysis): There is precedent information about the groups of data. Methods Consideration:  DNA Methylation
  • 8. • Clustering (or cluster analysis) is one of the main data analysis techniques and deals with the organization of a set of objects in a multidimensional space into cohesive groups, called clusters. • Each cluster contains objects that are very similar to each other and very dissimilar to objects in other clusters (Rasmussen, 1992). Cluster Analysis
  • 9. Cluster algorithms has two main types: Hierarchical algorithms: Decompose the data of n objects into several levels of nested clusters represented by a dendrogram. So that each node of the tree represents a cluster of data. Partitioning algorithms: Construct a flat (single level) partition of a data of n objects into a set of k clusters such that the objects in a cluster are more similar to each other than to objects in different clusters like K-Means and DBSCAN. Cluster Analysis
  • 10. Cluster analysis steps: Cluster Analysis 1. Choose a Distance Function 2. Construct Proximities Matrix 3. Choose a Clustering Algorithm
  • 11. Cluster analysis steps: ▪ Manhattan distance: Cluster Analysis 1. Choose a distance function: ▪ Euclidean distance: or
  • 12. 2. Calculate differences between observations by proximities matrix: Cluster analysis steps: Cluster Analysis . . . . . . . . .
  • 13. 1)Hierarchical Clustering 2)K-MEANS 3)K-Medians 4)Expectation Maximization 5)Fuzzy Clustering 6)Non Negative Matrix Factorization 7)Latent Dirichlet Allocation (LDA) 8)DBSCAN Cluster analysis steps: 3. Choosing Clustering Algorithms: Cluster Analysis
  • 14. K-Means Clustering: • Each data point belongs to the cluster with the nearest mean, this algorithm proposed by Stuart Lloyd (1957). • Requires only the number of required clusters (K), what makes it the most popular algorithm. Cluster Analysis
  • 15. 1 2 43 Cluster Analysis D = {d1, d2,......,dn} k: number of desired clusters (e.g. k=2) 1. Arbitrarily choose k data-items from D as initial centroids; 2. Assign each item di to the cluster which has the closest centroid 3. Calculate new mean for each cluster 4. Until convergence criteria is met. K-Means algorithm: 1
  • 16. Advantages: 1. Simple, easy to implement, and interpret clustering results 2. Fast and efficient in terms of computational cost Disadvantages: 1. Often produce clusters with relatively uniform size even if the data have different cluster size. 2. Cannot find non-linear clusters or clusters with unusual shapes. K-Means Clustering: Cluster Analysis
  • 17. DBSCAN: • The Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by (Martin Ester, et al, 1996). • It based on connecting points within certain distance thresholds • It only connects points that satisfy a density criterion of (Ɛ , MinPts). Cluster Analysis
  • 18. Choose Ɛ and MinPints (by field Expert). 1. Arbitrary select point p 2. Label Core point: which has a neighborhood with MinPts or more within the radius Ɛ. 3. Label Border Point which has a neighborhood that has less than MinPts within the radius Ɛ. 4. Otherwise it will be considered as a noise 5. Continue until it covers all points DBSCAN algorithm: Cluster Analysis
  • 20. Advantages 1. Clusters can have arbitrary shape and size 2. Number of clusters is determined automatically (not like K-Means). 3. Can separate clusters from surrounding noise (it define noise). 4. Parameters MinPts and Ɛ should be set by the domain expert (not by Statisticians!) Disadvantages: • Selecting MinPts and Ɛ which very sensitive and difficult to determine. DBSCAN Cluster Analysis
  • 22. Simulation Study 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 45 • We generated two non-linear groups of data in Microsoft-Excel that it is like an overlapped moon shapes in two dimensions (X,Y) by 346 points. Descriptive Statistics X Y Mean 20.97 21.9 Median 21 22 SD 10.55 4.84 Range 39 21 Minimum 1 11 Maximum 40 32
  • 23. K-Means (K=2) Example of K-Means clustering Simulation Study
  • 24. DBSCAN (Ɛ = 1, MinPts = 4) Example of DBSCAN Simulation Study
  • 25. Misclassification of Clustering True Cluster K-means DBSCAN Total 1 2 1 2 3 1 117 31 148 0 0 148 2 56 142 0 195 3 198 Total 173 173 148 195 3 346 Simulation Study
  • 26. Clustering for DNA methylation
  • 27. Dendrograms of Clusters for Samples and CpG Sites Clustering for DNA methylation Usual clustering for DNA methylation is conducted by two-way
  • 28. Clustering for DNA methylation Description of the DNA Methylation Data: • The data that had been collected is a microarray data from the TCGAAnalysis of DNA Methylation for lung adenocarcinoma using Illumina Infinium Human Methylation 27 platform. Methylation Ratios Data–Descriptive STAT Status Count Min Max Ave. Cancer 65 0.0076 0.9703 0.2683 Normal 24 0.0083 0.9584 0.2562 Total 89 0.0076 0.9703 0.265
  • 29. Clustering for DNA methylation • So, we examined randomly selected two CpG sites 117586918 and117746793 for the linearity of groups of samples. • Notice the non-linearity of the samples 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Cancer Normal Samples:
  • 30. Clustering for DNA methylation • We checked the samples against each other and we found that the first sample and the sample number 13 have a non-linear shape that lead us to be quite sure of the difficult possibility to classify them linearly. • We see the necessity to use DBSCAN algorithm! 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 CpG sites:
  • 31. Clustering for DNA methylation • The CpG sites have a non-symmetric distributions, which is the first indictor of non-linearity of the methylation data.
  • 32. Logit transformation: Methylation Ratios Data – Descriptive Statistics Status Count Min Max Ave. Cancer 65 0.0076 0.9703 0.2683 Normal 24 0.0083 0.9584 0.2562 Total 89 0.0076 0.9703 0.265 Clustering for DNA methylation Summary of DNA Methylations Ratios to Analyze Min Max Ave. -4.8628 3.4868 -1.814 -4.7809 3.1381 -1.9554 -4.862 3.486 -1.852
  • 33. Clustering Samples: Clustering for DNA methylation • DBSCAN is giving more valuable and useful results, since it separates the cancer samples • While the K-means has divided the cancer samples into useless two clusters. Comparison between DBSCAN and K-means for DNA Methylation Rations K-Means DBSCAN Total Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cancer 30 35 4 61 65 Normal 24 0 23 1 24 Total 54 35 27 62 89
  • 34. Clustering CpG sites: DBSCAN and K-Means for the CpG sites Cluster DBSCAN K- Means 1 21 17 2 7 11 Total 28 28 Clustering for DNA methylation • DBSCAN identified small number of differentially methylated CpG sites and large number of non-differentially methylated CpG sites. • while K-Means has led to similar numbers of differentially methylated and non- differentially methylated CpG sites!
  • 35. • The gene located after those 7 CpG sites that identifying as differentially methylated are suspected to have a crucial role for the cancer, and according to Santa Cruz Genome Browser this genome has a function of Protects DRG2 from proteolytic degradation, that would be another motivation to study more about this in the future studies. Clustering for DNA methylation Necessary work afterwards: Santa Cruz Genome Browser

Editor's Notes

  1. Why DNA methylation is important in disease and cancer studies?
  2. What are the difficulties to analyze DNA methylation?
  3. What you are going to do to analyze DNA methylation?
  4. What type of cluster analysis are you considering?
  5. Make 3 slides with this and the next one: Slide1: Cluster analysis steps Slide2: Distance matrix Slide3: Clustering algorithm
  6. Make 3 slides with this and the next one: Slide1: Cluster analysis steps Slide2: Distance matrix Slide3: Clustering algorithm
  7. Make 3 slides with this and the next one: Slide1: Cluster analysis steps Slide2: Distance matrix Slide3: Clustering algorithm
  8. 1)+2) Hierarchical clustering (e.g., single-linkage)
  9. Write how did you generate simulation data.
  10. Itemize the comments on the left-side.
  11. What you observe from the data, boxplot? Insert a slide for the summary.
  12. Insert a slide to summary what you have found from DBSCAN.
  13. Note that this is the future works to do, after identifying differentially methylated CpG sites.