Spark, Deep Learning and Life
Sciences, Systems Biology in
the Big Data Age
Dev Lakhani
Overview of Apache Spark
Applications to Sequencing
Applications to Protein Interactions using Deep Learning
Protein Folding
Using GraphX for Ontology Mining
Dev Lakhani
https://uk.linkedin.com/in/devlakh
ani
www.batchinsights.com
bl@batchinsights.com
About Me
• I am founder of Batch Insights, a Big Data consultancy that has
worked on numerous Big Data architectures and data science
projects in Tier 1 banking, global telecoms, retail, media and
fashion.
• I have been actively working with the Spark infrastructure since
it's inception and is currently working on various Spark
deployments
• My background is in Software Engineering with Computational
Biology & Statistics from Oxford.
Life: The Original Big Data Problem
564,831 Protein Interactions
569,832 Curated Proteins
54,540,851 Computationally
Hyptohesised proteins
28645 miRNA entries
source: EBI, miRBase
• Distributed Processing Platform
• In-memory processing
• In built frameworks for Machine
Learning
• Out built frameworks for Deep
Learning
• Broadcast Capability
• Graph Libraries
• Text Processing
• Resilient Distributed Datasets (RDDs)
Spark - Data Processing at Scale
Spark - Parallel Read Mapping
Haas, Brian J,Zody,
Michael C,Advancing
RNA-Seq analysis,Nat
Biotech,2010/05//print,2
8,5,421,423,Nature
Publishing Group,1087-
0156
Spark - Parallel Read Mapping
val broadcastGenome = sc.broadcast (Reference Genome)
On worker node
val MappedReads= textFile.flatMap
.map(reads => reads.matchwith(broadcastGenome ))
Spark - Project Tungsten
Java is not the best for String Matching
String objects are "large"
GC pauses and tuning.
Make use of Unsafe - UTF8Strings
C style direct memory access - of heap
Spark - Protein/miRNA Interactions
?
© 2015 David Goodsell
& RCSB Protein Data
Bank
Spark - Representation of Interactions
Interactor A
RPKM
Interactor B
RPKM
Target A
RPKM
10,010,100 18272 129192
9,000,000 20219 1210
2019 122 11
328289 83232 232323222
34243 2333 ?????
(RPKM stands for Reads Per Kilobase of transcript per Million mapped reads.)
Apply Deep Learning
http://tensorflow.org/tutorials/mnist/beginners/index.md
Apply Deep Learning -At Scale
Spark - Protein Folding
http://www.chemistry.em
ory.edu/faculty/dyer/prot
ein.php
http://www.lanl.gov/bmsi/Individual
%20Research/Werner/foldwide4by7.png
Spark - Parallel Landscape
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
val z = Math.random()
evaluate MFE (Minimum Free Energy)
}.reduce(_ + _)
Spark - Ontology Mining
source: EBI, miRBase
Mining the ontology
Over-represented terms in experiments
Neighborhood analysis - which terms appear together
Phenotype inference using triangle counting
Fast lookup of relationships between terms
Use Scale Free analysis to find hubs in the nodes to find drug targets
Error and attack
tolerance of complex
networks
Réka Albert, Hawoong
Jeong and Albert-László
Barabási
Nature 406, 378-382(27
July 2000)
doi:10.1038/35019019
Spark - Ontology Mining
http://geneontology.org/
Spark - Other Use Cases
Virtual Drug Screening
Genome Wide Association Studies
Disease Modeling
Environmental Phenotype analysis.
And so on ...
..
…
….
….....
........... https://homes.cs.washin
gton.edu/~suinlee/resea
rch.html
Summary
Intro to Apache Spark
Use scale out computing and broadcast variables to
distribute DNA/RNA Sequence Analysis
Use Machine and Deep Learning to infer network
structure
Use brute force landscape modeling for protein folding
Leverage graph libraries for ontology mining
Questions?
Icon made by Appzgear from "http://www.flaticon.com" is licensed under CC BY 3.0
<div>Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com"
title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY
3.0</a></div>
"TensorFlow, the TensorFlow  logo and any related marks are trademarks of Google Inc." 
All trademarks acknowledged

Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age

  • 1.
    Spark, Deep Learningand Life Sciences, Systems Biology in the Big Data Age Dev Lakhani
  • 2.
    Overview of ApacheSpark Applications to Sequencing Applications to Protein Interactions using Deep Learning Protein Folding Using GraphX for Ontology Mining Dev Lakhani https://uk.linkedin.com/in/devlakh ani www.batchinsights.com bl@batchinsights.com
  • 3.
    About Me • Iam founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. • I have been actively working with the Spark infrastructure since it's inception and is currently working on various Spark deployments • My background is in Software Engineering with Computational Biology & Statistics from Oxford.
  • 4.
    Life: The OriginalBig Data Problem 564,831 Protein Interactions 569,832 Curated Proteins 54,540,851 Computationally Hyptohesised proteins 28645 miRNA entries source: EBI, miRBase
  • 5.
    • Distributed ProcessingPlatform • In-memory processing • In built frameworks for Machine Learning • Out built frameworks for Deep Learning • Broadcast Capability • Graph Libraries • Text Processing • Resilient Distributed Datasets (RDDs) Spark - Data Processing at Scale
  • 6.
    Spark - ParallelRead Mapping Haas, Brian J,Zody, Michael C,Advancing RNA-Seq analysis,Nat Biotech,2010/05//print,2 8,5,421,423,Nature Publishing Group,1087- 0156
  • 7.
    Spark - ParallelRead Mapping val broadcastGenome = sc.broadcast (Reference Genome) On worker node val MappedReads= textFile.flatMap .map(reads => reads.matchwith(broadcastGenome ))
  • 8.
    Spark - ProjectTungsten Java is not the best for String Matching String objects are "large" GC pauses and tuning. Make use of Unsafe - UTF8Strings C style direct memory access - of heap
  • 9.
    Spark - Protein/miRNAInteractions ? © 2015 David Goodsell & RCSB Protein Data Bank
  • 10.
    Spark - Representationof Interactions Interactor A RPKM Interactor B RPKM Target A RPKM 10,010,100 18272 129192 9,000,000 20219 1210 2019 122 11 328289 83232 232323222 34243 2333 ????? (RPKM stands for Reads Per Kilobase of transcript per Million mapped reads.)
  • 11.
  • 12.
  • 13.
    Spark - ProteinFolding http://www.chemistry.em ory.edu/faculty/dyer/prot ein.php http://www.lanl.gov/bmsi/Individual %20Research/Werner/foldwide4by7.png
  • 14.
    Spark - ParallelLandscape val count = spark.parallelize(1 to NUM_SAMPLES).map{i => val x = Math.random() val y = Math.random() if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / NUM_SAMPLES) val count = spark.parallelize(1 to NUM_SAMPLES).map{i => val x = Math.random() val y = Math.random() val z = Math.random() evaluate MFE (Minimum Free Energy) }.reduce(_ + _)
  • 15.
    Spark - OntologyMining source: EBI, miRBase
  • 16.
    Mining the ontology Over-representedterms in experiments Neighborhood analysis - which terms appear together Phenotype inference using triangle counting Fast lookup of relationships between terms Use Scale Free analysis to find hubs in the nodes to find drug targets Error and attack tolerance of complex networks Réka Albert, Hawoong Jeong and Albert-László Barabási Nature 406, 378-382(27 July 2000) doi:10.1038/35019019
  • 17.
    Spark - OntologyMining http://geneontology.org/
  • 18.
    Spark - OtherUse Cases Virtual Drug Screening Genome Wide Association Studies Disease Modeling Environmental Phenotype analysis. And so on ... .. … …. …..... ........... https://homes.cs.washin gton.edu/~suinlee/resea rch.html
  • 19.
    Summary Intro to ApacheSpark Use scale out computing and broadcast variables to distribute DNA/RNA Sequence Analysis Use Machine and Deep Learning to infer network structure Use brute force landscape modeling for protein folding Leverage graph libraries for ontology mining Questions? Icon made by Appzgear from "http://www.flaticon.com" is licensed under CC BY 3.0 <div>Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></div> "TensorFlow, the TensorFlow  logo and any related marks are trademarks of Google Inc."  All trademarks acknowledged