SlideShare a Scribd company logo
1 of 24
Download to read offline
Spark Meetup, December 2015
Noam Barkai
noamb@nrgene.com
Overview
● Food shortage: new problems, new solutions
● Intermezzo: how DNA works
● Tach’les: what we do with Apache Spark
The planet has gotten very populous
And it’s the only one we got
World Population
Annual Growth Rate:
Peak - 2.1% (1962)
Current - 1.1% (2009)
https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg
Food intake
source: http://www.coolgeography.co.uk/A-level/AQA/Year%2012/Food%20supply/Patterns%20and%20intro/Food_consumption.gif
Upscale: Same area, more crops
Plant breeding
● An ancient art
● Incremental changes
● Slow but considerable
source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg
How long does it take
today?
Maize: 10-15 years
source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf
How breeding works
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Computational genomics
⬇ Prices of DNA sequencing
⬆ Number of samples per crop sequenced and analyzed
⬆ Amount and quality of genomic data
⬇ Prices of computation
⬇ Prices of storage
We’re entering a new era
BIG DATA Genomics
Food security - a computational problem?
● The plant’s potential lies in its DNA.
● We analyze and compare sequences from many plants.
● Resulting in better predictions for breeding.
● Faster rate of crop improvement.
Intermezzo: DNA - how does it work?
● Four “letters”:
cytosine(C), guanine(G),
adenine(A), thymine(T)
● Encode 20 amino acids
● Combine to make:
+100K proteins
Conceptually we can think of
this as a “pipeline”:“The Central Dogma”
DNA as storage
● Durable
● Supports random access
● Efficient sequential reads
● Easily replicated
● Contains error correction mechanisms
● Maximally “data local”
Part 2: What we do with
● Analyze lots of genome sequences.
● Apply similarity algorithms, find where they match.
● Finally, assist the breeding program.
Input data is “noisy”
● Contains errors and gaps.
● Is fragmented.
● All due to sequencing technology.
Our setup
● Hadoop clusters on both private cloud and AWS
● Textual files, using Parquet.
● MapR 5 Hadoop distro
● Spark 1.4.1
● SparkSQL and Hive (JDBC)
● Instances: ~150GB RAM, 40 cores.
● Provisioning: Ansible
Our data
● A dozen or so different crops, going for hundreds.
● Each crop: potentially ~1K fully sequenced samples
● ~100K “markers”.
● Each sequence: 1Gbp - 10Gbp (giga base-pairs =
characters) long
● Current: several terabytes, aiming at petabytes
Working with Spark and Scala
● Scala’s type system is your friend
● Thinking functional takes time - and can be “overdone”
● Remember to add @tailrec when needed
● Scala case classes - great
● Nested structure: keeps you DRY, but sluggish.
● Scala has its pitfalls - profile.
● Spark as the “ultimate scala collection” - Martin Odersky.
● Complex unmanaged framework - the usual 20/80 rule:
20% fun algorithmic stuff,
80% integration/devops/tuning/black-voodoo
● Integration with Hive - doable but cumbersome
● DataFrames API - very clean
● Parquet in Spark 1.4 - seamless, Parquet with SparkSQL
< 1.3 - rather sucks.
Integrations with Spark
● If RDD objects need high RAM → memory gets tricky.
● Spark UI in 1.4.1 - very nice
● PairRDD - need to be your own “query optimizer”
● repartition / coalesce - very useful, but gets tricky if data
variability is high (a dynamic real-time optimizer would be
great).
Performance tuning with Spark
● Testing: “local” is great, but means no unit-test :-(
● sbt-pack - good alternative to sbt-assembly.
● Spark packages: spark-csv, spark-notebook and more.
● Speaking of open-source packages...
Testing, packaging and extending Spark
ADAM Project - Genomics using Spark
● Fully open sourced from
● Similarity algorithms
● Population clustering
● Predictive analysis using Deep Learning
● And more
Spark Meetup, December 2015
Noam Barkai
noamb@nrgene.com
Thank you

More Related Content

Viewers also liked

Tidak ada ketentuan besar kecilnya maha1
Tidak ada ketentuan besar kecilnya maha1Tidak ada ketentuan besar kecilnya maha1
Tidak ada ketentuan besar kecilnya maha1Septian Muna Barakati
 
How to install ssl certificate from .pem
How to install ssl certificate from .pemHow to install ssl certificate from .pem
How to install ssl certificate from .pemcodeandyou forums
 
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)OLIVER DRAPER
 
Frank Salliau, iMinds @ ICT 2015, TISP workshop
Frank Salliau, iMinds @ ICT 2015, TISP workshopFrank Salliau, iMinds @ ICT 2015, TISP workshop
Frank Salliau, iMinds @ ICT 2015, TISP workshopTISP Project
 
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshop
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshopAlbert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshop
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshopTISP Project
 
Liquid phase alkylation of benzene with-ethylene
Liquid phase alkylation of benzene with-ethyleneLiquid phase alkylation of benzene with-ethylene
Liquid phase alkylation of benzene with-ethyleneLê Thành Phương
 
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015TISP Project
 
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法Kazutoshi Shinoda
 
BUILDING TECHNOLOGY PROJECT 2 REPORT
BUILDING TECHNOLOGY PROJECT 2 REPORTBUILDING TECHNOLOGY PROJECT 2 REPORT
BUILDING TECHNOLOGY PROJECT 2 REPORTJoyeeLee0131
 
Yellowing of cotton fabric due to softners -by Labeesh Kumar
Yellowing of cotton fabric due to softners -by Labeesh KumarYellowing of cotton fabric due to softners -by Labeesh Kumar
Yellowing of cotton fabric due to softners -by Labeesh KumarLabeesh Kumar
 
Operating samza at skyscanner
Operating samza at skyscannerOperating samza at skyscanner
Operating samza at skyscannerJoseph Francis
 
Cursos de Big Data y Machine Learning
Cursos de Big Data y Machine LearningCursos de Big Data y Machine Learning
Cursos de Big Data y Machine LearningStratebi
 
Project1 part1stage2(sohyoushing)
Project1 part1stage2(sohyoushing)Project1 part1stage2(sohyoushing)
Project1 part1stage2(sohyoushing)Soh Shing
 
Jam, jelly &marmalade
Jam, jelly &marmaladeJam, jelly &marmalade
Jam, jelly &marmaladeAbhinav Vivek
 

Viewers also liked (19)

Ukuran sudut
Ukuran sudutUkuran sudut
Ukuran sudut
 
Candle additives
Candle additivesCandle additives
Candle additives
 
Tidak ada ketentuan besar kecilnya maha1
Tidak ada ketentuan besar kecilnya maha1Tidak ada ketentuan besar kecilnya maha1
Tidak ada ketentuan besar kecilnya maha1
 
How to install ssl certificate from .pem
How to install ssl certificate from .pemHow to install ssl certificate from .pem
How to install ssl certificate from .pem
 
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)
 
Unit overview
Unit overviewUnit overview
Unit overview
 
Frank Salliau, iMinds @ ICT 2015, TISP workshop
Frank Salliau, iMinds @ ICT 2015, TISP workshopFrank Salliau, iMinds @ ICT 2015, TISP workshop
Frank Salliau, iMinds @ ICT 2015, TISP workshop
 
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshop
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshopAlbert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshop
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshop
 
Kitab barang temuan
Kitab barang temuanKitab barang temuan
Kitab barang temuan
 
Liquid phase alkylation of benzene with-ethylene
Liquid phase alkylation of benzene with-ethyleneLiquid phase alkylation of benzene with-ethylene
Liquid phase alkylation of benzene with-ethylene
 
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
 
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法
 
BUILDING TECHNOLOGY PROJECT 2 REPORT
BUILDING TECHNOLOGY PROJECT 2 REPORTBUILDING TECHNOLOGY PROJECT 2 REPORT
BUILDING TECHNOLOGY PROJECT 2 REPORT
 
Yellowing of cotton fabric due to softners -by Labeesh Kumar
Yellowing of cotton fabric due to softners -by Labeesh KumarYellowing of cotton fabric due to softners -by Labeesh Kumar
Yellowing of cotton fabric due to softners -by Labeesh Kumar
 
Operating samza at skyscanner
Operating samza at skyscannerOperating samza at skyscanner
Operating samza at skyscanner
 
Cursos de Big Data y Machine Learning
Cursos de Big Data y Machine LearningCursos de Big Data y Machine Learning
Cursos de Big Data y Machine Learning
 
Skyscanner
SkyscannerSkyscanner
Skyscanner
 
Project1 part1stage2(sohyoushing)
Project1 part1stage2(sohyoushing)Project1 part1stage2(sohyoushing)
Project1 part1stage2(sohyoushing)
 
Jam, jelly &marmalade
Jam, jelly &marmaladeJam, jelly &marmalade
Jam, jelly &marmalade
 

Similar to Using apache spark to fight world hunger - Israel spark meetup at taboola

Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data accessOphir Cohen
 
Apache Web Services in the Real World, an E-Science Perspective
Apache Web Services in the Real World, an E-Science PerspectiveApache Web Services in the Real World, an E-Science Perspective
Apache Web Services in the Real World, an E-Science PerspectiveSrinath Perera
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)ArangoDB Database
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
 

Similar to Using apache spark to fight world hunger - Israel spark meetup at taboola (20)

Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
 
Apache Web Services in the Real World, an E-Science Perspective
Apache Web Services in the Real World, an E-Science PerspectiveApache Web Services in the Real World, an E-Science Perspective
Apache Web Services in the Real World, an E-Science Perspective
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark
SparkSpark
Spark
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 

More from tsliwowicz

Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboolatsliwowicz
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolatsliwowicz
 
Inneractive - Spark meetup2
Inneractive - Spark meetup2Inneractive - Spark meetup2
Inneractive - Spark meetup2tsliwowicz
 
Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola) Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola) tsliwowicz
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 

More from tsliwowicz (7)

Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboola
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboola
 
Inneractive - Spark meetup2
Inneractive - Spark meetup2Inneractive - Spark meetup2
Inneractive - Spark meetup2
 
Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola) Spark meetup2 final (Taboola)
Spark meetup2 final (Taboola)
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 

Recently uploaded

WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!WSO2
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2
 
WSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfryanfarris8
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2
 
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & InnovationWSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & InnovationWSO2
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Eraconfluent
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 

Recently uploaded (20)

WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital Businesses
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in Uganda
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & InnovationWSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Using apache spark to fight world hunger - Israel spark meetup at taboola

  • 1. Spark Meetup, December 2015 Noam Barkai noamb@nrgene.com
  • 2. Overview ● Food shortage: new problems, new solutions ● Intermezzo: how DNA works ● Tach’les: what we do with Apache Spark
  • 3. The planet has gotten very populous And it’s the only one we got
  • 4. World Population Annual Growth Rate: Peak - 2.1% (1962) Current - 1.1% (2009) https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg
  • 6. Upscale: Same area, more crops
  • 7. Plant breeding ● An ancient art ● Incremental changes ● Slow but considerable source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg
  • 8. How long does it take today? Maize: 10-15 years source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf
  • 10. Computational genomics ⬇ Prices of DNA sequencing ⬆ Number of samples per crop sequenced and analyzed ⬆ Amount and quality of genomic data ⬇ Prices of computation ⬇ Prices of storage We’re entering a new era BIG DATA Genomics
  • 11. Food security - a computational problem? ● The plant’s potential lies in its DNA. ● We analyze and compare sequences from many plants. ● Resulting in better predictions for breeding. ● Faster rate of crop improvement.
  • 12. Intermezzo: DNA - how does it work? ● Four “letters”: cytosine(C), guanine(G), adenine(A), thymine(T) ● Encode 20 amino acids ● Combine to make: +100K proteins
  • 13. Conceptually we can think of this as a “pipeline”:“The Central Dogma”
  • 14. DNA as storage ● Durable ● Supports random access ● Efficient sequential reads ● Easily replicated ● Contains error correction mechanisms ● Maximally “data local”
  • 15. Part 2: What we do with ● Analyze lots of genome sequences. ● Apply similarity algorithms, find where they match. ● Finally, assist the breeding program.
  • 16. Input data is “noisy” ● Contains errors and gaps. ● Is fragmented. ● All due to sequencing technology.
  • 17. Our setup ● Hadoop clusters on both private cloud and AWS ● Textual files, using Parquet. ● MapR 5 Hadoop distro ● Spark 1.4.1 ● SparkSQL and Hive (JDBC) ● Instances: ~150GB RAM, 40 cores. ● Provisioning: Ansible
  • 18. Our data ● A dozen or so different crops, going for hundreds. ● Each crop: potentially ~1K fully sequenced samples ● ~100K “markers”. ● Each sequence: 1Gbp - 10Gbp (giga base-pairs = characters) long ● Current: several terabytes, aiming at petabytes
  • 19. Working with Spark and Scala ● Scala’s type system is your friend ● Thinking functional takes time - and can be “overdone” ● Remember to add @tailrec when needed ● Scala case classes - great ● Nested structure: keeps you DRY, but sluggish. ● Scala has its pitfalls - profile. ● Spark as the “ultimate scala collection” - Martin Odersky.
  • 20. ● Complex unmanaged framework - the usual 20/80 rule: 20% fun algorithmic stuff, 80% integration/devops/tuning/black-voodoo ● Integration with Hive - doable but cumbersome ● DataFrames API - very clean ● Parquet in Spark 1.4 - seamless, Parquet with SparkSQL < 1.3 - rather sucks. Integrations with Spark
  • 21. ● If RDD objects need high RAM → memory gets tricky. ● Spark UI in 1.4.1 - very nice ● PairRDD - need to be your own “query optimizer” ● repartition / coalesce - very useful, but gets tricky if data variability is high (a dynamic real-time optimizer would be great). Performance tuning with Spark
  • 22. ● Testing: “local” is great, but means no unit-test :-( ● sbt-pack - good alternative to sbt-assembly. ● Spark packages: spark-csv, spark-notebook and more. ● Speaking of open-source packages... Testing, packaging and extending Spark
  • 23. ADAM Project - Genomics using Spark ● Fully open sourced from ● Similarity algorithms ● Population clustering ● Predictive analysis using Deep Learning ● And more
  • 24. Spark Meetup, December 2015 Noam Barkai noamb@nrgene.com Thank you