Epinomics cassandra summit-submit

Anupama Joshi/Matt Negulescu
Cassandra/Spark solutions for Genomic big data Analysis and Visualization

Introduction
© DataStax, All Rights Reserved. 2
Anupama Joshi – Technology Infrastructure and Execution at
Epinomics
ajoshi@epinomics.co
http://www.linkedin.com/in/anupamajoshi
Matt Negulescu - Product Requirements and User Interaction
mnegulescu@epinomics.co

1 Introduction
2 What is Epigenomics?
3 Genomic Data and Epigenomic Data
4 Why Cassandra?
5 Demo
3© DataStax, All Rights Reserved.

Epinomics
A platform that drives
personalized medicine by
leveraging big data analytics
and proprietary epigenomics
technology.

What is Epigenomics?
The study of modifications that turn genes on or off, without affecting the DNA sequence.
Genomics
DNA is the hardware of the body: static
and descriptive (i.e. nature).
Epigenomics
Epigenome is the software layer:
dynamically turns genes on or off (i.e.
nature and nurture).

Typical Genomic data
Typical genomic sequencing
data contains the protein letters
ATCG .
Most research work focuses on
variation from standard
genome sequences.
From: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Epigenomic Data
Peaks Data
Regions of the genome where DNA was accessible during the
experiment.
chr1 713701 714600 peak.1 899 +
chr1 804976 805650 peak.2 674 +
Footprinting Data (Transcription factors)
A signal indicating a protein (i.e. a transcription factor) binding
to the DNA.

High performance
Fault Tolerant
Linear scalability
Dynamic columns
Structured and unstructured
data
Flexible data model
Real time querying
Why Cassandra?

Epinomics Cluster
• 6 nodes
• @600gb and growing
• 2 datacenters

Epinomics Pipeline
Start with Sequencing data
Find peaks
Find footprints
Do differential analyses
Apply machine learning
Visualize results

Epinomics ETL

A picture is worth a thousand words
Visual inspection of model components is useful for interpretation
From:- http://undsci.berkeley.edu/article/0_0_0/howscienceworks_09

TF Analysis

Footprint Detection and Storage
Footprint Detection
Identify genomic binding sites of transcription factors (TFs) at particular genomic locations.
533 transcription factors/sample x 200k rows
Chromosome start end length strand pwm purity IsBound
chr10 100001379 100001390 501 - 10.95492 -0.96717 FALSE
Chr10 100010611 100010622 500 + 11.32268 -0.86117 FALSE
Retrieve on various attributes and region identifier (transcription factor name)
select * from tf_purity_piq_new where sample_id = 2225 AND tf_name= 'CTCF.known1' AND
purity >= 0.7 AND purity <= 0.9 ;

Retrieve data from Cassandra and process using Spark to
calculate the signal strength of each TF in the sample.
Store the signal data in Cassandra to draw online
visualizations.
Footprint Detection and Storage

Peaks Processing and Storage
Each sample will have between 150K to 200K peaks
A typical biological experiment can have between 10 to 200 samples.
Consolidate and process overlapping peaks
Processed
Using Spark Graphx
A typical experiment will have between 300K to 600k overlapping peaks. (depending on dataset and
sequencing depth)
Source -:http://bedtools.readthedocs.io/

Differential Peaks Data
Use Machine Learning to identify regions showing significant differences between two sets of data
(i.e. peaks data).@ 100k to 200K peaks
create table IF NOT EXISTS project_norm_values_diff (
project_id int, peak_window text, pvalue double, sample_id_value_map map<int, double>,
PRIMARY KEY (project_id,pvalue,peak_window)
) ;
select * from project_norm_values_diff where project_id = 333 and pvalue > 0.9 limit 100;

Differential Peaks Analysis
Differential peaks are further grouped with kMeans-clustering using Spark Mlib.
Clustered data is stored in Cassandra.
CREATE TABLE IF NOT EXISTS diffpeak_sample_clusterinfo (
project_id int, kvalue int, cluster_location int, sample_id int, avg_peakvalue double, num_peaks_in_cluster int,
PRIMARY KEY (project_id,kvalue,sample_id,cluster_location) ) WITH CLUSTERING ORDER BY (kvalue ASC, sample_id ASC);

More machine learning and analysis
1. Dimensionality Reduction
(Principal Component Analysis)
rows = Projects x Samples X Samples
project_id | sample_id | pc_name | pc_value
------------+-----------+---------+-----------
237 | 4430 | 0 | -0.179271
237 | 4430 | 1 | 0.340772
237 | 4430 | 2 | 0.308466

Correlation Analysis
Pearson correlation of peaks among all samples.

Reproducibility Analysis
Shows variance and similarity among replicates

Epinomics cassandra summit-submit

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Epinomics cassandra summit-submit

Similar to Epinomics cassandra summit-submit (20)

Recently uploaded

Recently uploaded (20)

Epinomics cassandra summit-submit

Editor's Notes