"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
2. Overview of Apache Spark
Applications to Sequencing
Applications to Protein Interactions using Deep Learning
Protein Folding
Using GraphX for Ontology Mining
Dev Lakhani
https://uk.linkedin.com/in/devlakhani
Www.batchinsights.com
Dl@batchinsights.com
3. About Me
• I am founder of Batch Insights, a Big Data consultancy that has
worked on numerous Big Data architectures and data science
projects in Tier 1 banking, global telecoms, retail, media and
fashion.
• I have been actively working with the Spark infrastructure since
it's inception and is currently working on various Spark
deployments
• My background is in Software Engineering with Computational
Biology & Statistics from Oxford.
4. Life: The Original Big Data Problem
564,831 Protein Interactions
569,832 Curated Proteins
54,540,851 Computationally
Hyptohesised proteins
28645 miRNA entries
source: EBI, miRBase
5. • Distributed Processing Platform
• In-memory processing
• In built frameworks for Machine Learning
• Out built frameworks for Deep Learning
• Broadcast Capability
• Graph Libraries
• Text Processing
• Resilient Distributed Datasets (RDDs)
Spark - Data Processing at Scale
6. Spark - Parallel Read Mapping
Haas, Brian J,Zody, Michael
C,Advancing RNA-Seq
analysis,Nat
Biotech,2010/05//print,28,5,4
21,423,Nature Publishing
Group,1087-0156
7. Spark - Parallel Read Mapping
val broadcastGenome = sc.broadcast (Reference Genome)
On worker node
val MappedReads= textFile.flatMap
.map(reads => reads.matchwith(broadcastGenome ))
8. Spark - Project Tungsten
Java is not the best for String Matching
String objects are "large"
GC pauses and tuning.
Make use of Unsafe - UTF8Strings
C style direct memory access - of heap
10. Spark - Representation of Interactions
Interactor A
RPKM
Interactor B
RPKM
Target A
RPKM
10,010,100 18272 129192
9,000,000 20219 1210
2019 122 11
328289 83232 232323222
34243 2333 ?????
(RPKM stands for Reads Per Kilobase of transcript per Million mapped reads.)
13. Spark - Protein Folding
http://www.chemistry.emory.
edu/faculty/dyer/protein.php
http://www.lanl.gov/bmsi/Individual%20Research/Werner/foldwide4by7.png
14. Spark - Parallel Landscape
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
val z = Math.random()
evaluate MFE (Minimum Free Energy)
}.reduce(_ + _)
16. Mining the ontology
Over-represented terms in experiments
Neighborhood analysis - which terms appear together
Phenotype inference using triangle counting
Fast lookup of relationships between terms
Use Scale Free analysis to find hubs in the nodes to find drug targets
Error and attack tolerance of
complex networks
Réka Albert, Hawoong Jeong
and Albert-László Barabási
Nature 406, 378-382(27 July
2000)
doi:10.1038/35019019
18. Spark - Other Use Cases
Virtual Drug Screening
Genome Wide Association Studies
Disease Modeling
Environmental Phenotype analysis.
And so on ...
..
…
….
….....
........... https://homes.cs.washington.
edu/~suinlee/research.html
19. Summary
Intro to Apache Spark
Use scale out computing and broadcast variables to distribute
DNA/RNA Sequence Analysis
Use Machine and Deep Learning to infer network structure
Use brute force landscape modeling for protein folding
Leverage graph libraries for ontology mining
Questions?
Icon made by Appzgear from "http://www.flaticon.com" is licensed under CC BY 3.0
<div>Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a
href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></div>
"TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc."
All trademarks acknowledged