Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age

Spark, Deep Learning and Life
Sciences, Systems Biology in
the Big Data Age
Dev Lakhani

Overview of Apache Spark
Applications to Sequencing
Applications to Protein Interactions using Deep Learning
Protein Folding
Using GraphX for Ontology Mining
Dev Lakhani
https://uk.linkedin.com/in/devlakh
ani
www.batchinsights.com
bl@batchinsights.com

About Me
• I am founder of Batch Insights, a Big Data consultancy that has
worked on numerous Big Data architectures and data science
projects in Tier 1 banking, global telecoms, retail, media and
fashion.
• I have been actively working with the Spark infrastructure since
it's inception and is currently working on various Spark
deployments
• My background is in Software Engineering with Computational
Biology & Statistics from Oxford.

Life: The Original Big Data Problem
564,831 Protein Interactions
569,832 Curated Proteins
54,540,851 Computationally
Hyptohesised proteins
28645 miRNA entries
source: EBI, miRBase

• Distributed Processing Platform
• In-memory processing
• In built frameworks for Machine
Learning
• Out built frameworks for Deep
Learning
• Broadcast Capability
• Graph Libraries
• Text Processing
• Resilient Distributed Datasets (RDDs)
Spark - Data Processing at Scale

Spark - Parallel Read Mapping
Haas, Brian J,Zody,
Michael C,Advancing
RNA-Seq analysis,Nat
Biotech,2010/05//print,2
8,5,421,423,Nature
Publishing Group,1087-
0156

Spark - Parallel Read Mapping
val broadcastGenome = sc.broadcast (Reference Genome)
On worker node
val MappedReads= textFile.flatMap
.map(reads => reads.matchwith(broadcastGenome ))

Spark - Project Tungsten
Java is not the best for String Matching
String objects are "large"
GC pauses and tuning.
Make use of Unsafe - UTF8Strings
C style direct memory access - of heap

Spark - Protein/miRNA Interactions
?
© 2015 David Goodsell
& RCSB Protein Data
Bank

Spark - Representation of Interactions
Interactor A
RPKM
Interactor B
RPKM
Target A
RPKM
10,010,100 18272 129192
9,000,000 20219 1210
2019 122 11
328289 83232 232323222
34243 2333 ?????
(RPKM stands for Reads Per Kilobase of transcript per Million mapped reads.)

Apply Deep Learning
http://tensorflow.org/tutorials/mnist/beginners/index.md

Spark - Protein Folding
http://www.chemistry.em
ory.edu/faculty/dyer/prot
ein.php
http://www.lanl.gov/bmsi/Individual
%20Research/Werner/foldwide4by7.png

Spark - Parallel Landscape
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
val z = Math.random()
evaluate MFE (Minimum Free Energy)
}.reduce(_ + _)

Spark - Ontology Mining
source: EBI, miRBase

Mining the ontology
Over-represented terms in experiments
Neighborhood analysis - which terms appear together
Phenotype inference using triangle counting
Fast lookup of relationships between terms
Use Scale Free analysis to find hubs in the nodes to find drug targets
Error and attack
tolerance of complex
networks
Réka Albert, Hawoong
Jeong and Albert-László
Barabási
Nature 406, 378-382(27
July 2000)
doi:10.1038/35019019

Spark - Ontology Mining
http://geneontology.org/

Spark - Other Use Cases
Virtual Drug Screening
Genome Wide Association Studies
Disease Modeling
Environmental Phenotype analysis.
And so on ...
..
…
….
….....
........... https://homes.cs.washin
gton.edu/~suinlee/resea
rch.html

Summary
Intro to Apache Spark
Use scale out computing and broadcast variables to
distribute DNA/RNA Sequence Analysis
Use Machine and Deep Learning to infer network
structure
Use brute force landscape modeling for protein folding
Leverage graph libraries for ontology mining
Questions?
Icon made by Appzgear from "http://www.flaticon.com" is licensed under CC BY 3.0
<div>Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com"
title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY
3.0</a></div>
"TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc."
All trademarks acknowledged

Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age

More Related Content

What's hot

Viewers also liked

Similar to Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age

Recently uploaded

Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age