"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age", Dev Lakhani, Founder of Batch Insights

Spark, DeepLearning and
Life Sciences, Systems
Biology in the Big Data Age
Dev Lakhani

Overview of Apache Spark
Applications to Sequencing
Applications to Protein Interactions using Deep Learning
Protein Folding
Using GraphX for Ontology Mining
Dev Lakhani
https://uk.linkedin.com/in/devlakhani
Www.batchinsights.com
Dl@batchinsights.com

About Me
• I am founder of Batch Insights, a Big Data consultancy that has
worked on numerous Big Data architectures and data science
projects in Tier 1 banking, global telecoms, retail, media and
fashion.
• I have been actively working with the Spark infrastructure since
it's inception and is currently working on various Spark
deployments
• My background is in Software Engineering with Computational
Biology & Statistics from Oxford.

Life: The Original Big Data Problem
564,831 Protein Interactions
569,832 Curated Proteins
54,540,851 Computationally
Hyptohesised proteins
28645 miRNA entries
source: EBI, miRBase

• Distributed Processing Platform
• In-memory processing
• In built frameworks for Machine Learning
• Out built frameworks for Deep Learning
• Broadcast Capability
• Graph Libraries
• Text Processing
• Resilient Distributed Datasets (RDDs)
Spark - Data Processing at Scale

Spark - Parallel Read Mapping
Haas, Brian J,Zody, Michael
C,Advancing RNA-Seq
analysis,Nat
Biotech,2010/05//print,28,5,4
21,423,Nature Publishing
Group,1087-0156

Spark - Parallel Read Mapping
val broadcastGenome = sc.broadcast (Reference Genome)
On worker node
val MappedReads= textFile.flatMap
.map(reads => reads.matchwith(broadcastGenome ))

Spark - Project Tungsten
Java is not the best for String Matching
String objects are "large"
GC pauses and tuning.
Make use of Unsafe - UTF8Strings
C style direct memory access - of heap

Spark - Protein/miRNA Interactions
?
© 2015 David Goodsell
& RCSB Protein Data
Bank

Spark - Representation of Interactions
Interactor A
RPKM
Interactor B
RPKM
Target A
RPKM
10,010,100 18272 129192
9,000,000 20219 1210
2019 122 11
328289 83232 232323222
34243 2333 ?????
(RPKM stands for Reads Per Kilobase of transcript per Million mapped reads.)

Apply Deep Learning
http://tensorflow.org/tutorials/mnist/beginners/index.md

Spark - Protein Folding
http://www.chemistry.emory.
edu/faculty/dyer/protein.php
http://www.lanl.gov/bmsi/Individual%20Research/Werner/foldwide4by7.png

Spark - Parallel Landscape
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
val z = Math.random()
evaluate MFE (Minimum Free Energy)
}.reduce(_ + _)

Spark - Ontology Mining
source: EBI, miRBase

Mining the ontology
Over-represented terms in experiments
Neighborhood analysis - which terms appear together
Phenotype inference using triangle counting
Fast lookup of relationships between terms
Use Scale Free analysis to find hubs in the nodes to find drug targets
Error and attack tolerance of
complex networks
Réka Albert, Hawoong Jeong
and Albert-László Barabási
Nature 406, 378-382(27 July
2000)
doi:10.1038/35019019

Spark - Ontology Mining
http://geneontology.org/

Spark - Other Use Cases
Virtual Drug Screening
Genome Wide Association Studies
Disease Modeling
Environmental Phenotype analysis.
And so on ...
..
…
….
….....
........... https://homes.cs.washington.
edu/~suinlee/research.html

Summary
Intro to Apache Spark
Use scale out computing and broadcast variables to distribute
DNA/RNA Sequence Analysis
Use Machine and Deep Learning to infer network structure
Use brute force landscape modeling for protein folding
Leverage graph libraries for ontology mining
Questions?
Icon made by Appzgear from "http://www.flaticon.com" is licensed under CC BY 3.0
<div>Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a
href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></div>
"TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc."
All trademarks acknowledged

"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age", Dev Lakhani, Founder of Batch Insights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age", Dev Lakhani, Founder of Batch Insights

Similar to "Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age", Dev Lakhani, Founder of Batch Insights (20)

More from Dataconomy Media

More from Dataconomy Media (20)

Recently uploaded

Recently uploaded (20)

"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age", Dev Lakhani, Founder of Batch Insights