SlideShare a Scribd company logo
Anupama Joshi/Matt Negulescu
Cassandra/Spark solutions for Genomic big data Analysis and Visualization
Introduction
© DataStax, All Rights Reserved. 2
Anupama Joshi – Technology Infrastructure and Execution at
Epinomics
ajoshi@epinomics.co
http://www.linkedin.com/in/anupamajoshi
Matt Negulescu - Product Requirements and User Interaction
mnegulescu@epinomics.co
1 Introduction
2 What is Epigenomics?
3 Genomic Data and Epigenomic Data
4 Why Cassandra?
5 Demo
3© DataStax, All Rights Reserved.
Epinomics
© DataStax, All Rights Reserved. 4
A platform that drives
personalized medicine by
leveraging big data analytics
and proprietary epigenomics
technology.
What is Epigenomics?
© DataStax, All Rights Reserved. 5
The study of modifications that turn genes on or off, without affecting the DNA sequence.
Genomics
DNA is the hardware of the body: static
and descriptive (i.e. nature).
Epigenomics
Epigenome is the software layer:
dynamically turns genes on or off (i.e.
nature and nurture).
Typical Genomic data
© DataStax, All Rights Reserved. 6
Typical genomic sequencing
data contains the protein letters
ATCG .
Most research work focuses on
variation from standard
genome sequences.
From: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
Epigenomic Data
Peaks Data
Regions of the genome where DNA was accessible during the
experiment.
chr1 713701 714600 peak.1 899 +
chr1 804976 805650 peak.2 674 +
Footprinting Data (Transcription factors)
A signal indicating a protein (i.e. a transcription factor) binding
to the DNA.
© DataStax, All Rights Reserved. 7
High performance
Fault Tolerant
Linear scalability
Dynamic columns
Structured and unstructured
data
Flexible data model
Real time querying
© DataStax, All Rights Reserved. 8
Why Cassandra?
Epinomics Cluster
© DataStax, All Rights Reserved. 9
• 6 nodes
• @600gb and growing
• 2 datacenters
Epinomics Pipeline
Start with Sequencing data
Find peaks
Find footprints
Do differential analyses
Apply machine learning
Visualize results
© DataStax, All Rights Reserved. 10
Epinomics ETL
© DataStax, All Rights Reserved. 11
© DataStax, All Rights Reserved. 12
Epinomics ETL
A picture is worth a thousand words
Visual inspection of model components is useful for interpretation
© DataStax, All Rights Reserved. 13
From:- http://undsci.berkeley.edu/article/0_0_0/howscienceworks_09
TF Analysis
© DataStax, All Rights Reserved. 14
Footprint Detection and Storage
Footprint Detection
Identify genomic binding sites of transcription factors (TFs) at particular genomic locations.
533 transcription factors/sample x 200k rows
Chromosome start end length strand pwm purity IsBound
chr10 100001379 100001390 501 - 10.95492 -0.96717 FALSE
Chr10 100010611 100010622 500 + 11.32268 -0.86117 FALSE
Retrieve on various attributes and region identifier (transcription factor name)
select * from tf_purity_piq_new where sample_id = 2225 AND tf_name= 'CTCF.known1' AND
purity >= 0.7 AND purity <= 0.9 ;
© DataStax, All Rights Reserved. 15
© DataStax, All Rights Reserved. 16
Retrieve data from Cassandra and process using Spark to
calculate the signal strength of each TF in the sample.
Store the signal data in Cassandra to draw online
visualizations.
Footprint Detection and Storage
Peaks Processing and Storage
Each sample will have between 150K to 200K peaks
A typical biological experiment can have between 10 to 200 samples.
Consolidate and process overlapping peaks
Processed
Using Spark Graphx
A typical experiment will have between 300K to 600k overlapping peaks. (depending on dataset and
sequencing depth)
© DataStax, All Rights Reserved. 17
Source -:http://bedtools.readthedocs.io/
© DataStax, All Rights Reserved. 18
Differential Peaks Data
Use Machine Learning to identify regions showing significant differences between two sets of data
(i.e. peaks data).@ 100k to 200K peaks
create table IF NOT EXISTS project_norm_values_diff (
project_id int, peak_window text, pvalue double, sample_id_value_map map<int, double>,
PRIMARY KEY (project_id,pvalue,peak_window)
) ;
select * from project_norm_values_diff where project_id = 333 and pvalue > 0.9 limit 100;
© DataStax, All Rights Reserved. 19
© DataStax, All Rights Reserved. 20
© DataStax, All Rights Reserved. 21
Differential Peaks Analysis
Differential peaks are further grouped with kMeans-clustering using Spark Mlib.
Clustered data is stored in Cassandra.
CREATE TABLE IF NOT EXISTS diffpeak_sample_clusterinfo (
project_id int, kvalue int, cluster_location int, sample_id int, avg_peakvalue double, num_peaks_in_cluster int,
PRIMARY KEY (project_id,kvalue,sample_id,cluster_location) ) WITH CLUSTERING ORDER BY (kvalue ASC, sample_id ASC);
© DataStax, All Rights Reserved. 22
More machine learning and analysis
1. Dimensionality Reduction
(Principal Component Analysis)
rows = Projects x Samples X Samples
project_id | sample_id | pc_name | pc_value
------------+-----------+---------+-----------
237 | 4430 | 0 | -0.179271
237 | 4430 | 1 | 0.340772
237 | 4430 | 2 | 0.308466
© DataStax, All Rights Reserved. 23
Correlation Analysis
Pearson correlation of peaks among all samples.
© DataStax, All Rights Reserved. 24
Reproducibility Analysis
© DataStax, All Rights Reserved. 25
Shows variance and similarity among replicates

More Related Content

Viewers also liked

NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
DataStax Academy
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
jbellis
 

Viewers also liked (8)

Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
 
Python and MongoDB
Python and MongoDBPython and MongoDB
Python and MongoDB
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decade
 
Personalized Medicine
Personalized MedicinePersonalized Medicine
Personalized Medicine
 
My First 100 days with a Cassandra Cluster
My First 100 days with a Cassandra ClusterMy First 100 days with a Cassandra Cluster
My First 100 days with a Cassandra Cluster
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 

Similar to Epinomics cassandra summit-submit

Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
TERN Australia
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
IAEME Publication
 

Similar to Epinomics cassandra summit-submit (20)

Network predictive analysis
Network predictive analysisNetwork predictive analysis
Network predictive analysis
 
Appistry WGDAS Presentation
Appistry WGDAS PresentationAppistry WGDAS Presentation
Appistry WGDAS Presentation
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
 
Plexon Capabilities
Plexon CapabilitiesPlexon Capabilities
Plexon Capabilities
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
BDE SC3.3 Workshop - Options for Wind Farm performance assessment and Power f...
BDE SC3.3 Workshop - Options for Wind Farm performance assessment and Power f...BDE SC3.3 Workshop - Options for Wind Farm performance assessment and Power f...
BDE SC3.3 Workshop - Options for Wind Farm performance assessment and Power f...
 
Cytoscape plugins - GeneMania and CentiScape
Cytoscape plugins - GeneMania and CentiScapeCytoscape plugins - GeneMania and CentiScape
Cytoscape plugins - GeneMania and CentiScape
 
Intel's Machine Learning Strategy
Intel's Machine Learning StrategyIntel's Machine Learning Strategy
Intel's Machine Learning Strategy
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Parallel and distributed system projects for java and dot net
Parallel and distributed system projects for java and dot netParallel and distributed system projects for java and dot net
Parallel and distributed system projects for java and dot net
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
 
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in TokyoSummary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
 
Cheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesCheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case Studies
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 

Recently uploaded

Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 

Recently uploaded (20)

The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisIT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 

Epinomics cassandra summit-submit

  • 1. Anupama Joshi/Matt Negulescu Cassandra/Spark solutions for Genomic big data Analysis and Visualization
  • 2. Introduction © DataStax, All Rights Reserved. 2 Anupama Joshi – Technology Infrastructure and Execution at Epinomics ajoshi@epinomics.co http://www.linkedin.com/in/anupamajoshi Matt Negulescu - Product Requirements and User Interaction mnegulescu@epinomics.co
  • 3. 1 Introduction 2 What is Epigenomics? 3 Genomic Data and Epigenomic Data 4 Why Cassandra? 5 Demo 3© DataStax, All Rights Reserved.
  • 4. Epinomics © DataStax, All Rights Reserved. 4 A platform that drives personalized medicine by leveraging big data analytics and proprietary epigenomics technology.
  • 5. What is Epigenomics? © DataStax, All Rights Reserved. 5 The study of modifications that turn genes on or off, without affecting the DNA sequence. Genomics DNA is the hardware of the body: static and descriptive (i.e. nature). Epigenomics Epigenome is the software layer: dynamically turns genes on or off (i.e. nature and nurture).
  • 6. Typical Genomic data © DataStax, All Rights Reserved. 6 Typical genomic sequencing data contains the protein letters ATCG . Most research work focuses on variation from standard genome sequences. From: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"
  • 7. Epigenomic Data Peaks Data Regions of the genome where DNA was accessible during the experiment. chr1 713701 714600 peak.1 899 + chr1 804976 805650 peak.2 674 + Footprinting Data (Transcription factors) A signal indicating a protein (i.e. a transcription factor) binding to the DNA. © DataStax, All Rights Reserved. 7
  • 8. High performance Fault Tolerant Linear scalability Dynamic columns Structured and unstructured data Flexible data model Real time querying © DataStax, All Rights Reserved. 8 Why Cassandra?
  • 9. Epinomics Cluster © DataStax, All Rights Reserved. 9 • 6 nodes • @600gb and growing • 2 datacenters
  • 10. Epinomics Pipeline Start with Sequencing data Find peaks Find footprints Do differential analyses Apply machine learning Visualize results © DataStax, All Rights Reserved. 10
  • 11. Epinomics ETL © DataStax, All Rights Reserved. 11
  • 12. © DataStax, All Rights Reserved. 12 Epinomics ETL
  • 13. A picture is worth a thousand words Visual inspection of model components is useful for interpretation © DataStax, All Rights Reserved. 13 From:- http://undsci.berkeley.edu/article/0_0_0/howscienceworks_09
  • 14. TF Analysis © DataStax, All Rights Reserved. 14
  • 15. Footprint Detection and Storage Footprint Detection Identify genomic binding sites of transcription factors (TFs) at particular genomic locations. 533 transcription factors/sample x 200k rows Chromosome start end length strand pwm purity IsBound chr10 100001379 100001390 501 - 10.95492 -0.96717 FALSE Chr10 100010611 100010622 500 + 11.32268 -0.86117 FALSE Retrieve on various attributes and region identifier (transcription factor name) select * from tf_purity_piq_new where sample_id = 2225 AND tf_name= 'CTCF.known1' AND purity >= 0.7 AND purity <= 0.9 ; © DataStax, All Rights Reserved. 15
  • 16. © DataStax, All Rights Reserved. 16 Retrieve data from Cassandra and process using Spark to calculate the signal strength of each TF in the sample. Store the signal data in Cassandra to draw online visualizations. Footprint Detection and Storage
  • 17. Peaks Processing and Storage Each sample will have between 150K to 200K peaks A typical biological experiment can have between 10 to 200 samples. Consolidate and process overlapping peaks Processed Using Spark Graphx A typical experiment will have between 300K to 600k overlapping peaks. (depending on dataset and sequencing depth) © DataStax, All Rights Reserved. 17 Source -:http://bedtools.readthedocs.io/
  • 18. © DataStax, All Rights Reserved. 18
  • 19. Differential Peaks Data Use Machine Learning to identify regions showing significant differences between two sets of data (i.e. peaks data).@ 100k to 200K peaks create table IF NOT EXISTS project_norm_values_diff ( project_id int, peak_window text, pvalue double, sample_id_value_map map<int, double>, PRIMARY KEY (project_id,pvalue,peak_window) ) ; select * from project_norm_values_diff where project_id = 333 and pvalue > 0.9 limit 100; © DataStax, All Rights Reserved. 19
  • 20. © DataStax, All Rights Reserved. 20
  • 21. © DataStax, All Rights Reserved. 21
  • 22. Differential Peaks Analysis Differential peaks are further grouped with kMeans-clustering using Spark Mlib. Clustered data is stored in Cassandra. CREATE TABLE IF NOT EXISTS diffpeak_sample_clusterinfo ( project_id int, kvalue int, cluster_location int, sample_id int, avg_peakvalue double, num_peaks_in_cluster int, PRIMARY KEY (project_id,kvalue,sample_id,cluster_location) ) WITH CLUSTERING ORDER BY (kvalue ASC, sample_id ASC); © DataStax, All Rights Reserved. 22
  • 23. More machine learning and analysis 1. Dimensionality Reduction (Principal Component Analysis) rows = Projects x Samples X Samples project_id | sample_id | pc_name | pc_value ------------+-----------+---------+----------- 237 | 4430 | 0 | -0.179271 237 | 4430 | 1 | 0.340772 237 | 4430 | 2 | 0.308466 © DataStax, All Rights Reserved. 23
  • 24. Correlation Analysis Pearson correlation of peaks among all samples. © DataStax, All Rights Reserved. 24
  • 25. Reproducibility Analysis © DataStax, All Rights Reserved. 25 Shows variance and similarity among replicates

Editor's Notes

  1. Hi All I am Anupama and This is Matt . We are from Epinomics and we want to present how we use Cassandra and Spark to find solutions for genomic data Analysis and Visualization.
  2. Matt will give you overview of Epinomics and Epigenomics. (slides 1 and 2)
  3. At Epinomics we have a typical big data pipeline collect data Analyze data interpret results
  4. At Epinomics we have a typical big data pipeline collect data Analyze data interpret results
  5. At Epinomics we have a typical big data pipeline collect data Analyze data interpret results- For interpreting of genomic data analysis visualization is the most effective way.
  6. Evaluating an idea in light of the evidence should be simple, right? Either the results match the expectations generated by the idea (thus, supporting it) or they don't (thus, refuting it). Data become evidence only when they have been interpreted in a way that reflects on the accuracy or inaccuracy of a scientific idea. For interpreting of genomic data analysis visualization is the most effective way.  
  7. Lets start with the visualizations we show. This is transcription factor analysis. Identify genomic binding sites of transcription factors (TFs) at particular genomic locations. 533 transcription factors/sample This shows the tFs in order of the bound sites. Order is number of bound sites and Color is % of bound sites. You can change it and you can compare 2 samples And you can click on the TF to get the details
  8. Lets look at how we store the data for the picture . We identify if a TF is bound for the particular genomic location. (chr/start/end) We store the data and then retrieve dynamically using the desired thresholds.
  9. We also store data for the signal strength at each location and draw a plot to indicate the signal strength at bound and unbound locations around the TF location.
  10. Next lets look at the Peaks visualization and data Each sample will have between 150K to 200K peaks A typical biological experiment can have between 10 to 200 samples. Consolidate and process overlapping peaks  
  11. Use Machine Learning to identify regions showing significant differences between two sets of data (i.e. peaks data).@ 100k to 200K peaks This visualization indicates the patterns of significant differences between the user-defined sample groupings
  12. The data to power this viz is stored in cassandra and retrieved dynamically based on a pvalue limit.
  13. Those were the top differetial peaks. But we also want to see the matching patterns across all the significant peaks. So we perform machine learning to the data. We Do kmeans clustering and then hierarchical clustering on all peaks, The height of the cluster in the viz is representative on the number of peaks in that cluster and the color is the normalized average value for all the peaks.
  14. This is how we store the clustered data which is retrieved dynamically . Hierarchical clustering is done in front end.
  15. PCA indicates clear differences consistent with the previous visualizations. Scientists are more likely to trust ideas that more closely explain the actual observations. or contradict