2. Use cases (my biased opinion)
• Interactive and Expressive Data Analysis
• If you feel limited when trying to express yourself in “group by”, “join” and
“where”
• Only if it is not possible to work with datasets locally
• Entering Danger Zone:
• Spark SQL engine, like Impala/Hive
• Speed up ETLs if your data can fit in memory (speculation)
• Machine learning
• Graph analytics
• Streaming (not mature yet)
3. Possible working styles
• Develop in IDE
• Develop as you go in Spark shell
IDE Spark-shell
Easier to manipulate with objects,
inheritance, package management
Easier to debug code with production
scale data
Requires some hacking to get programs
run on both Windows and Prod
environments
Will only run on Windows if you have
correct line endings in spark-shell
launcher scripts or use Cygwin
4. IntelliJ IDEA
• Basic set up https://gitz.adform.com/dspr/audience-
extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar
k-skeleton
5. Hacks
• 99% chance that on Windows you won’t be able to use function
`saveAsTextFile()`
• Download exe file from
http://stackoverflow.com/questions/19620642/failed-to-locate-the-
winutils-binary-in-the-hadoop-binary-path
• Place it somewhere on your PC in bin folder
(C:somewherebinwinutils.exe) and set in your code before using
save function
System.setProperty("hadoop.home.dir", "C:somewhere")
6. When you are done with your code…
• It is time to package everything to fat jar with sbt assembly
• Add “provided” to library dependencies, since spark libs are already in
the classpath if you run job on emr with spark already set-up
• Find more info in Audience Extension project Spark branch build.sbt
file.
libraryDependencies += "org.apache.spark" %% "spark-core" %
"1.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"1.2.0" % "provided"
7. Running on EMR
• build.sbt can be configured (S3 package) to upload fat jar to s3 when
it is done with assembly, if you don’t have that just upload it manually
• Run bootstrap action s3://support.elasticmapreduce/spark/install-
spark with arguments -v 1.2.0.a -x –g (some documentation in
https://github.com/awslabs/emr-bootstrap-
actions/tree/master/spark)
• Also install ganglia for monitoring cluster load (run this before spark
bootstrap step)
• If you don’t install ganglia ssh tunnels to spark UI won’t work.
8. Start with local mode first
Use only one instance in cluster, submit your jar with this:
/home/hadoop/spark/bin/spark-submit
--class com.adform.dspr.SimilarityJob
--master local[16]
--driver-memory 4G
--conf spark.default.parallelism=112
SimilarityJob.jar
--remote
--input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/*
--output s3://dev-adform-data-engineers/tmp/spark/2days
--similarity-threshold 300
9. Run on multiple machines with yarn master
/home/hadoop/spark/bin/spark-submit
--class com.adform.dspr.SimilarityJob
--master yarn
--deploy-mode client #or cluster
--num-executors 7
--executor-memory 116736 M
--executor-cores 16
--conf spark.default.parallelism=112
--conf spark.task.maxFailures=4
SimilarityJob.jar
--remote
… … …
Executor parameters are optional, bootstrap
script will automatically try to maximize spark
configuration options. Note that scripts are
not aware of tasks that you are doing, they
only read emr cluster specifications.
10. Spark UI
• Need to set up ssh tunnel to use access it from your PC
• Alternative is to use command line browser lynx
• When you submit app with local master UI will be in ip:4040
• When you submit with Yarn master, go to Hadoop UI on port 9026, it
will have Spark task running, click on ApplicationMaster in Tracking UI
column, or get UI url from command line when you submit task
11. Spark UI
For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs
are Jobs, Stages and Environment.
12. Some useful settings
• spark.hadoop.validateOutputSpecs useful when developing, set to
false, then you can overwrite output files
• spark.default.parallelism (number of output files / number of cores),
automatically configured when you run bootstrap actions with -x
option
• spark.shuffle.consolidateFiles (default false)
• spark.rdd.compress (default false)
• spark.akka.timeout, spark.akka.frameSize, spark.speculation, …
• http://spark.apache.org/docs/1.2.0/configuration.html
14. Spark shell
• In spark shell you don’t need to instantiate spark context, it is already
intantiated, but you can create another one if you like
• Type scala expressions and see what is happening
• Note the lazy evaluation, to force expression evaluation fore
debugging use action functions like [expression].take(n) or
[expression].count to see if your statements are OK
15. Summary
• Spark is better suited for developing in Linux
• Don’t trust Amazon bootstrap scripts, check if your application is
utilizing resources with Ganglia
• Try to write scala code in a way that it is possible to run parts of it in
spark-shell, otherwise it is hard to debug problems which occur only
at production dataset scale.