2. Overview
● Food shortage: new problems, new solutions
● Intermezzo: how DNA works
● Tach’les: what we do with Apache Spark
3. The planet has gotten very populous
And it’s the only one we got
4. World Population
Annual Growth Rate:
Peak - 2.1% (1962)
Current - 1.1% (2009)
https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg
7. Plant breeding
● An ancient art
● Incremental changes
● Slow but considerable
source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg
8. How long does it take
today?
Maize: 10-15 years
source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf
10. Computational genomics
⬇ Prices of DNA sequencing
⬆ Number of samples per crop sequenced and analyzed
⬆ Amount and quality of genomic data
⬇ Prices of computation
⬇ Prices of storage
We’re entering a new era
BIG DATA Genomics
11. Food security - a computational problem?
● The plant’s potential lies in its DNA.
● We analyze and compare sequences from many plants.
● Resulting in better predictions for breeding.
● Faster rate of crop improvement.
12. Intermezzo: DNA - how does it work?
● Four “letters”:
cytosine(C), guanine(G),
adenine(A), thymine(T)
● Encode 20 amino acids
● Combine to make:
+100K proteins
14. DNA as storage
● Durable
● Supports random access
● Efficient sequential reads
● Easily replicated
● Contains error correction mechanisms
● Maximally “data local”
15. Part 2: What we do with
● Analyze lots of genome sequences.
● Apply similarity algorithms, find where they match.
● Finally, assist the breeding program.
16. Input data is “noisy”
● Contains errors and gaps.
● Is fragmented.
● All due to sequencing technology.
17. Our setup
● Hadoop clusters on both private cloud and AWS
● Textual files, using Parquet.
● MapR 5 Hadoop distro
● Spark 1.4.1
● SparkSQL and Hive (JDBC)
● Instances: ~150GB RAM, 40 cores.
● Provisioning: Ansible
18. Our data
● A dozen or so different crops, going for hundreds.
● Each crop: potentially ~1K fully sequenced samples
● ~100K “markers”.
● Each sequence: 1Gbp - 10Gbp (giga base-pairs =
characters) long
● Current: several terabytes, aiming at petabytes
19. Working with Spark and Scala
● Scala’s type system is your friend
● Thinking functional takes time - and can be “overdone”
● Remember to add @tailrec when needed
● Scala case classes - great
● Nested structure: keeps you DRY, but sluggish.
● Scala has its pitfalls - profile.
● Spark as the “ultimate scala collection” - Martin Odersky.
20. ● Complex unmanaged framework - the usual 20/80 rule:
20% fun algorithmic stuff,
80% integration/devops/tuning/black-voodoo
● Integration with Hive - doable but cumbersome
● DataFrames API - very clean
● Parquet in Spark 1.4 - seamless, Parquet with SparkSQL
< 1.3 - rather sucks.
Integrations with Spark
21. ● If RDD objects need high RAM → memory gets tricky.
● Spark UI in 1.4.1 - very nice
● PairRDD - need to be your own “query optimizer”
● repartition / coalesce - very useful, but gets tricky if data
variability is high (a dynamic real-time optimizer would be
great).
Performance tuning with Spark
22. ● Testing: “local” is great, but means no unit-test :-(
● sbt-pack - good alternative to sbt-assembly.
● Spark packages: spark-csv, spark-notebook and more.
● Speaking of open-source packages...
Testing, packaging and extending Spark
23. ADAM Project - Genomics using Spark
● Fully open sourced from
● Similarity algorithms
● Population clustering
● Predictive analysis using Deep Learning
● And more