Hi All
I am Anupama and This is Matt . We are from Epinomics and we want to present how we use Cassandra and Spark to find solutions for genomic data Analysis and Visualization.
Matt will give you overview of Epinomics and Epigenomics. (slides 1 and 2)
At Epinomics we have a typical big data pipeline
collect data
Analyze data
interpret results
At Epinomics we have a typical big data pipeline
collect data
Analyze data
interpret results
At Epinomics we have a typical big data pipeline
collect data
Analyze data
interpret results- For interpreting of genomic data analysis visualization is the most effective way.
Evaluating an idea in light of the evidence should be simple, right? Either the results match the expectations generated by the idea (thus, supporting it) or they don't (thus, refuting it).
Data become evidence only when they have been interpreted in a way that reflects on the accuracy or inaccuracy of a scientific idea.
For interpreting of genomic data analysis visualization is the most effective way.
Lets start with the visualizations we show.
This is transcription factor analysis.
Identify genomic binding sites of transcription factors (TFs) at particular genomic locations.
533 transcription factors/sample
This shows the tFs in order of the bound sites.
Order is number of bound sites and
Color is % of bound sites.
You can change it and you can compare 2 samples
And you can click on the TF to get the details
Lets look at how we store the data for the picture .
We identify if a TF is bound for the particular genomic location. (chr/start/end)
We store the data and then retrieve dynamically using the desired thresholds.
We also store data for the signal strength at each location and draw a plot to indicate the signal strength at bound and unbound locations around the TF location.
Next lets look at the Peaks visualization and data
Each sample will have between 150K to 200K peaks
A typical biological experiment can have between 10 to 200 samples.
Consolidate and process overlapping peaks
Use Machine Learning to identify regions showing significant differences between two sets of data (i.e. peaks data).@ 100k to 200K peaks
This visualization indicates the patterns of significant differences between the user-defined sample groupings
The data to power this viz is stored in cassandra and retrieved dynamically based on a pvalue limit.
Those were the top differetial peaks. But we also want to see the matching patterns across all the significant peaks. So we perform machine learning to the data.
We Do kmeans clustering and then hierarchical clustering on all peaks,
The height of the cluster in the viz is representative on the number of peaks in that cluster and the color is the normalized average value for all the peaks.
This is how we store the clustered data which is retrieved dynamically . Hierarchical clustering is done in front end.
PCA indicates clear differences consistent with the previous visualizations.
Scientists are more likely to trust ideas that more closely explain the actual observations. or contradict