SlideShare a Scribd company logo
Spark tutorial, developing
locally and deploying on EMR
Use cases (my biased opinion)
• Interactive and Expressive Data Analysis
• If you feel limited when trying to express yourself in “group by”, “join” and
“where”
• Only if it is not possible to work with datasets locally
• Entering Danger Zone:
• Spark SQL engine, like Impala/Hive
• Speed up ETLs if your data can fit in memory (speculation)
• Machine learning
• Graph analytics
• Streaming (not mature yet)
Possible working styles
• Develop in IDE
• Develop as you go in Spark shell
IDE Spark-shell
Easier to manipulate with objects,
inheritance, package management
Easier to debug code with production
scale data
Requires some hacking to get programs
run on both Windows and Prod
environments
Will only run on Windows if you have
correct line endings in spark-shell
launcher scripts or use Cygwin
IntelliJ IDEA
• Basic set up https://gitz.adform.com/dspr/audience-
extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar
k-skeleton
Hacks
• 99% chance that on Windows you won’t be able to use function
`saveAsTextFile()`
• Download exe file from
http://stackoverflow.com/questions/19620642/failed-to-locate-the-
winutils-binary-in-the-hadoop-binary-path
• Place it somewhere on your PC in bin folder
(C:somewherebinwinutils.exe) and set in your code before using
save function
System.setProperty("hadoop.home.dir", "C:somewhere")
When you are done with your code…
• It is time to package everything to fat jar with sbt assembly
• Add “provided” to library dependencies, since spark libs are already in
the classpath if you run job on emr with spark already set-up
• Find more info in Audience Extension project Spark branch build.sbt
file.
libraryDependencies += "org.apache.spark" %% "spark-core" %
"1.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"1.2.0" % "provided"
Running on EMR
• build.sbt can be configured (S3 package) to upload fat jar to s3 when
it is done with assembly, if you don’t have that just upload it manually
• Run bootstrap action s3://support.elasticmapreduce/spark/install-
spark with arguments -v 1.2.0.a -x –g (some documentation in
https://github.com/awslabs/emr-bootstrap-
actions/tree/master/spark)
• Also install ganglia for monitoring cluster load (run this before spark
bootstrap step)
• If you don’t install ganglia ssh tunnels to spark UI won’t work.
Start with local mode first
Use only one instance in cluster, submit your jar with this:
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master local[16] 
--driver-memory 4G 
--conf spark.default.parallelism=112
SimilarityJob.jar 
--remote 
--input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* 
--output s3://dev-adform-data-engineers/tmp/spark/2days 
--similarity-threshold 300
Run on multiple machines with yarn master
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master yarn 
--deploy-mode client  #or cluster
--num-executors 7 
--executor-memory 116736 M 
--executor-cores 16 
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4 
SimilarityJob.jar 
--remote 
… … …
Executor parameters are optional, bootstrap
script will automatically try to maximize spark
configuration options. Note that scripts are
not aware of tasks that you are doing, they
only read emr cluster specifications.
Spark UI
• Need to set up ssh tunnel to use access it from your PC
• Alternative is to use command line browser lynx
• When you submit app with local master UI will be in ip:4040
• When you submit with Yarn master, go to Hadoop UI on port 9026, it
will have Spark task running, click on ApplicationMaster in Tracking UI
column, or get UI url from command line when you submit task
Spark UI
For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs
are Jobs, Stages and Environment.
Some useful settings
• spark.hadoop.validateOutputSpecs useful when developing, set to
false, then you can overwrite output files
• spark.default.parallelism (number of output files / number of cores),
automatically configured when you run bootstrap actions with -x
option
• spark.shuffle.consolidateFiles (default false)
• spark.rdd.compress (default false)
• spark.akka.timeout, spark.akka.frameSize, spark.speculation, …
• http://spark.apache.org/docs/1.2.0/configuration.html
Spark shell
/home/hadoop/spark/bin/spark-shell 
--master <yarn|local[*]> 
--deploy-mode client 
--num-executors 7 
--executor-memory 4G 
--executor-cores 16 
--driver-memory 4G
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4
Spark shell
• In spark shell you don’t need to instantiate spark context, it is already
intantiated, but you can create another one if you like
• Type scala expressions and see what is happening
• Note the lazy evaluation, to force expression evaluation fore
debugging use action functions like [expression].take(n) or
[expression].count to see if your statements are OK
Summary
• Spark is better suited for developing in Linux
• Don’t trust Amazon bootstrap scripts, check if your application is
utilizing resources with Ganglia
• Try to write scala code in a way that it is possible to run parts of it in
spark-shell, otherwise it is hard to debug problems which occur only
at production dataset scale.

More Related Content

What's hot

とりあえず使うScalaz
とりあえず使うScalazとりあえず使うScalaz
とりあえず使うScalaz
Shuya Tsukamoto
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentation
Gene Chang
 
Full Stack Scala
Full Stack ScalaFull Stack Scala
Full Stack Scala
Ramnivas Laddad
 
"Walk in a distributed systems park with Orleans" Евгений Бобров
"Walk in a distributed systems park with Orleans" Евгений Бобров"Walk in a distributed systems park with Orleans" Евгений Бобров
"Walk in a distributed systems park with Orleans" Евгений Бобров
Fwdays
 
Extending ansible
Extending ansibleExtending ansible
Extending ansible
Yan Kurniawan
 
Scala.js - yet another what..?
Scala.js - yet another what..?Scala.js - yet another what..?
Scala.js - yet another what..?
Artur Skowroński
 
First glance at Akka 2.0
First glance at Akka 2.0First glance at Akka 2.0
First glance at Akka 2.0
Vasil Remeniuk
 
Curator intro
Curator introCurator intro
Curator intro
Jordan Zimmerman
 
Akka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsAkka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsNLJUG
 
Real-time search in Drupal. Meet Elasticsearch
Real-time search in Drupal. Meet ElasticsearchReal-time search in Drupal. Meet Elasticsearch
Real-time search in Drupal. Meet ElasticsearchAlexei Gorobets
 
ChefConf 2014 - AWS OpsWorks Under The Hood
ChefConf 2014 - AWS OpsWorks Under The HoodChefConf 2014 - AWS OpsWorks Under The Hood
ChefConf 2014 - AWS OpsWorks Under The Hood
Jonathan Weiss
 
Go database/sql
Go database/sqlGo database/sql
Go database/sql
Artem Kovardin
 
20150627 bigdatala
20150627 bigdatala20150627 bigdatala
20150627 bigdatala
gethue
 
Introduction to Akka - Atlanta Java Users Group
Introduction to Akka - Atlanta Java Users GroupIntroduction to Akka - Atlanta Java Users Group
Introduction to Akka - Atlanta Java Users Group
Roy Russo
 
Terraform day02
Terraform day02Terraform day02
Terraform day02
Gourav Varma
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Lightbend
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10
Anshum Gupta
 
Scala.js for large and complex frontend apps
Scala.js for large and complex frontend appsScala.js for large and complex frontend apps
Scala.js for large and complex frontend apps
Otto Chrons
 
Building an aws sdk for Perl - Granada Perl Workshop 2014
Building an aws sdk for Perl - Granada Perl Workshop 2014Building an aws sdk for Perl - Granada Perl Workshop 2014
Building an aws sdk for Perl - Granada Perl Workshop 2014
Jose Luis Martínez
 

What's hot (20)

とりあえず使うScalaz
とりあえず使うScalazとりあえず使うScalaz
とりあえず使うScalaz
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentation
 
Full Stack Scala
Full Stack ScalaFull Stack Scala
Full Stack Scala
 
"Walk in a distributed systems park with Orleans" Евгений Бобров
"Walk in a distributed systems park with Orleans" Евгений Бобров"Walk in a distributed systems park with Orleans" Евгений Бобров
"Walk in a distributed systems park with Orleans" Евгений Бобров
 
Extending ansible
Extending ansibleExtending ansible
Extending ansible
 
Scala.js - yet another what..?
Scala.js - yet another what..?Scala.js - yet another what..?
Scala.js - yet another what..?
 
First glance at Akka 2.0
First glance at Akka 2.0First glance at Akka 2.0
First glance at Akka 2.0
 
Curator intro
Curator introCurator intro
Curator intro
 
Akka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsAkka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based Applications
 
Real-time search in Drupal. Meet Elasticsearch
Real-time search in Drupal. Meet ElasticsearchReal-time search in Drupal. Meet Elasticsearch
Real-time search in Drupal. Meet Elasticsearch
 
ChefConf 2014 - AWS OpsWorks Under The Hood
ChefConf 2014 - AWS OpsWorks Under The HoodChefConf 2014 - AWS OpsWorks Under The Hood
ChefConf 2014 - AWS OpsWorks Under The Hood
 
Go database/sql
Go database/sqlGo database/sql
Go database/sql
 
20150627 bigdatala
20150627 bigdatala20150627 bigdatala
20150627 bigdatala
 
Introduction to Akka - Atlanta Java Users Group
Introduction to Akka - Atlanta Java Users GroupIntroduction to Akka - Atlanta Java Users Group
Introduction to Akka - Atlanta Java Users Group
 
Wider than rails
Wider than railsWider than rails
Wider than rails
 
Terraform day02
Terraform day02Terraform day02
Terraform day02
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10
 
Scala.js for large and complex frontend apps
Scala.js for large and complex frontend appsScala.js for large and complex frontend apps
Scala.js for large and complex frontend apps
 
Building an aws sdk for Perl - Granada Perl Workshop 2014
Building an aws sdk for Perl - Granada Perl Workshop 2014Building an aws sdk for Perl - Granada Perl Workshop 2014
Building an aws sdk for Perl - Granada Perl Workshop 2014
 

Similar to Spark Intro by Adform Research

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
Peng Cheng
 
Spark 101
Spark 101Spark 101
Spark 101
Mohit Garg
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
Demi Ben-Ari
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
Dylan Wan
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
Holden Karau
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
Tripti Jha
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
Mateusz Dymczyk
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 

Similar to Spark Intro by Adform Research (20)

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 

More from Vasil Remeniuk

Product Minsk - РТБ и Программатик
Product Minsk - РТБ и ПрограмматикProduct Minsk - РТБ и Программатик
Product Minsk - РТБ и Программатик
Vasil Remeniuk
 
Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14
Vasil Remeniuk
 
Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14
Vasil Remeniuk
 
Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3
Vasil Remeniuk
 
Testing in Scala by Adform research
Testing in Scala by Adform researchTesting in Scala by Adform research
Testing in Scala by Adform research
Vasil Remeniuk
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaVasil Remeniuk
 
Types by Adform Research
Types by Adform ResearchTypes by Adform Research
Types by Adform ResearchVasil Remeniuk
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Spark by Adform Research, Paulius
Spark by Adform Research, PauliusSpark by Adform Research, Paulius
Spark by Adform Research, PauliusVasil Remeniuk
 
Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)Vasil Remeniuk
 
SBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius ValatkaSBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius ValatkaVasil Remeniuk
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2
Vasil Remeniuk
 
Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1
Vasil Remeniuk
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
Vasil Remeniuk
 
Опыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событияхОпыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событиях
Vasil Remeniuk
 
ETL со Spark
ETL со SparkETL со Spark
ETL со Spark
Vasil Remeniuk
 
Funtional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWTFuntional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWT
Vasil Remeniuk
 
Vaadin+Scala
Vaadin+ScalaVaadin+Scala
Vaadin+Scala
Vasil Remeniuk
 
[Не]практичные типы
[Не]практичные типы[Не]практичные типы
[Не]практичные типы
Vasil Remeniuk
 
Зачем нужна Scala?
Зачем нужна Scala?Зачем нужна Scala?
Зачем нужна Scala?
Vasil Remeniuk
 

More from Vasil Remeniuk (20)

Product Minsk - РТБ и Программатик
Product Minsk - РТБ и ПрограмматикProduct Minsk - РТБ и Программатик
Product Minsk - РТБ и Программатик
 
Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14Работа с Akka Сluster, @afiskon, scalaby#14
Работа с Akka Сluster, @afiskon, scalaby#14
 
Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14Cake pattern. Presentation by Alex Famin at scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14
 
Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3Scala laboratory: Globus. iteration #3
Scala laboratory: Globus. iteration #3
 
Testing in Scala by Adform research
Testing in Scala by Adform researchTesting in Scala by Adform research
Testing in Scala by Adform research
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 
Types by Adform Research
Types by Adform ResearchTypes by Adform Research
Types by Adform Research
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Spark by Adform Research, Paulius
Spark by Adform Research, PauliusSpark by Adform Research, Paulius
Spark by Adform Research, Paulius
 
Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)Scala Style by Adform Research (Saulius Valatka)
Scala Style by Adform Research (Saulius Valatka)
 
SBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius ValatkaSBT by Aform Research, Saulius Valatka
SBT by Aform Research, Saulius Valatka
 
Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2Scala laboratory: Globus. iteration #2
Scala laboratory: Globus. iteration #2
 
Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1Scala laboratory. Globus. iteration #1
Scala laboratory. Globus. iteration #1
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Опыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событияхОпыт использования Spark, Основано на реальных событиях
Опыт использования Spark, Основано на реальных событиях
 
ETL со Spark
ETL со SparkETL со Spark
ETL со Spark
 
Funtional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWTFuntional Reactive Programming with Examples in Scala + GWT
Funtional Reactive Programming with Examples in Scala + GWT
 
Vaadin+Scala
Vaadin+ScalaVaadin+Scala
Vaadin+Scala
 
[Не]практичные типы
[Не]практичные типы[Не]практичные типы
[Не]практичные типы
 
Зачем нужна Scala?
Зачем нужна Scala?Зачем нужна Scala?
Зачем нужна Scala?
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Spark Intro by Adform Research

  • 1. Spark tutorial, developing locally and deploying on EMR
  • 2. Use cases (my biased opinion) • Interactive and Expressive Data Analysis • If you feel limited when trying to express yourself in “group by”, “join” and “where” • Only if it is not possible to work with datasets locally • Entering Danger Zone: • Spark SQL engine, like Impala/Hive • Speed up ETLs if your data can fit in memory (speculation) • Machine learning • Graph analytics • Streaming (not mature yet)
  • 3. Possible working styles • Develop in IDE • Develop as you go in Spark shell IDE Spark-shell Easier to manipulate with objects, inheritance, package management Easier to debug code with production scale data Requires some hacking to get programs run on both Windows and Prod environments Will only run on Windows if you have correct line endings in spark-shell launcher scripts or use Cygwin
  • 4. IntelliJ IDEA • Basic set up https://gitz.adform.com/dspr/audience- extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar k-skeleton
  • 5. Hacks • 99% chance that on Windows you won’t be able to use function `saveAsTextFile()` • Download exe file from http://stackoverflow.com/questions/19620642/failed-to-locate-the- winutils-binary-in-the-hadoop-binary-path • Place it somewhere on your PC in bin folder (C:somewherebinwinutils.exe) and set in your code before using save function System.setProperty("hadoop.home.dir", "C:somewhere")
  • 6. When you are done with your code… • It is time to package everything to fat jar with sbt assembly • Add “provided” to library dependencies, since spark libs are already in the classpath if you run job on emr with spark already set-up • Find more info in Audience Extension project Spark branch build.sbt file. libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided"
  • 7. Running on EMR • build.sbt can be configured (S3 package) to upload fat jar to s3 when it is done with assembly, if you don’t have that just upload it manually • Run bootstrap action s3://support.elasticmapreduce/spark/install- spark with arguments -v 1.2.0.a -x –g (some documentation in https://github.com/awslabs/emr-bootstrap- actions/tree/master/spark) • Also install ganglia for monitoring cluster load (run this before spark bootstrap step) • If you don’t install ganglia ssh tunnels to spark UI won’t work.
  • 8. Start with local mode first Use only one instance in cluster, submit your jar with this: /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master local[16] --driver-memory 4G --conf spark.default.parallelism=112 SimilarityJob.jar --remote --input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* --output s3://dev-adform-data-engineers/tmp/spark/2days --similarity-threshold 300
  • 9. Run on multiple machines with yarn master /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master yarn --deploy-mode client #or cluster --num-executors 7 --executor-memory 116736 M --executor-cores 16 --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4 SimilarityJob.jar --remote … … … Executor parameters are optional, bootstrap script will automatically try to maximize spark configuration options. Note that scripts are not aware of tasks that you are doing, they only read emr cluster specifications.
  • 10. Spark UI • Need to set up ssh tunnel to use access it from your PC • Alternative is to use command line browser lynx • When you submit app with local master UI will be in ip:4040 • When you submit with Yarn master, go to Hadoop UI on port 9026, it will have Spark task running, click on ApplicationMaster in Tracking UI column, or get UI url from command line when you submit task
  • 11. Spark UI For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs are Jobs, Stages and Environment.
  • 12. Some useful settings • spark.hadoop.validateOutputSpecs useful when developing, set to false, then you can overwrite output files • spark.default.parallelism (number of output files / number of cores), automatically configured when you run bootstrap actions with -x option • spark.shuffle.consolidateFiles (default false) • spark.rdd.compress (default false) • spark.akka.timeout, spark.akka.frameSize, spark.speculation, … • http://spark.apache.org/docs/1.2.0/configuration.html
  • 13. Spark shell /home/hadoop/spark/bin/spark-shell --master <yarn|local[*]> --deploy-mode client --num-executors 7 --executor-memory 4G --executor-cores 16 --driver-memory 4G --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4
  • 14. Spark shell • In spark shell you don’t need to instantiate spark context, it is already intantiated, but you can create another one if you like • Type scala expressions and see what is happening • Note the lazy evaluation, to force expression evaluation fore debugging use action functions like [expression].take(n) or [expression].count to see if your statements are OK
  • 15. Summary • Spark is better suited for developing in Linux • Don’t trust Amazon bootstrap scripts, check if your application is utilizing resources with Ganglia • Try to write scala code in a way that it is possible to run parts of it in spark-shell, otherwise it is hard to debug problems which occur only at production dataset scale.