SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
3.
About me
• Sysadmin/DevOps background
• Worked as DevOps @Visualdna
• Now building game analytics platform
@Sony Computer Entertainment Europe
4.
Outline
• What is ETL
• How do we do it in the standard Hadoop stack
• How can we supercharge it with Spark
• Real-life use cases
• How to deploy Spark
• Lessons learned
10.
Hadoop
• Industry standard
• Have you ever looked at Hadoop code and
tried to fix something?
11.
How simple is simple?
”Simple YARN application to run n copies of a unix command -
deliberately kept simple (with minimal error handling etc.)”
➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git
(…)
➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l
232
12.
ETL Workflow
• Get some data from S3/HDFS
• Map
• Shuffle
• Reduce
• Save to S3/HDFS
13.
ETL Workflow
• Get some data from S3/HDFS
• Map
• Shuffle
• Reduce
• Save to S3/HDFS
Repeat 10 times
14.
Issue: Test run time
• Job startup time ~20s to run a job that does nothing
• Hard to test the code without a cluster ( cascading
simulation mode != real life )
15.
Issue: new applications
MapReduce awkward for key big data workloads:
• Low latency dispatch (E.G. quick queries)
• Iterative algorithms (E.G. ML, Graph…)
• Streaming data ingest
16.
Issue: hardware is moving on
Hardware had advanced since Hadoop started:
• Very large RAMs, Faster networks (10Gb+)
• Bandwidth to disk not keeping up
• 1 GB of RAM ~ $0.75/month *
*based on a spot price of AWS r3.8xlarge instance
18.
Use Spark
• Fast and Expressive Cluster Computing Engine
• Compatible with Apache Hadoop
• In-memory storage
• Rich APIs in Java, Scala, Python
19.
Why Spark?
• Up to 40x faster than Hadoop MapReduce
( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ )
• Jobs can be scheduled and run in <1s
• Typically less code (2-5x)
• Seamless Hadoop/HDFS integration
• REPL
• Accessible Source in terms of LOC and modularity
20.
Why Spark?
• Berkeley Data Analytics Stack ecosystem:
• Spark, Spark Streaming, Shark, BlinkDB, MLlib
• Deep integration into Hadoop ecosystem
• Read/write Hadoop formats
• Interoperability with other ecosystem components
• Runs on Mesos & YARN, also MR1
• EC2, EMR
• HDFS, S3
26.
Spark use-cases
• next-generation ETL platform
• No more “multiple chained MapReduce jobs”
architecture
• Less jobs to worry about
• Better sleep for your DevOps team
37.
v1.0 - running on EC2
• Start with an EC2 script
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves>
—instance-type=c3.xlarge launch <cluster-name>
If it does not work for you - modify it, it’s just a simple
python+boto
38.
v2.0 - Autoscaling on spot instances
1x Master - on-demand (c3.large)
XX Slaves - spot instances depending on usage patterns (r3.*)
• no HDFS
• persistence in memory + S3
41.
JVM issues
• java.lang.OutOfMemoryError: GC overhead limit exceeded
• add more memory?
val sparkConf = new SparkConf()
.set("spark.executor.memory", "120g")
.set("spark.storage.memoryFraction","0.3")
.set("spark.shuffle.memoryFraction","0.3")
• increase parallelism:
sc.textFile("s3://..path", 10000)
groupByKey(10000)
42.
Full GC
2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G-
>45G(110G), 79.3771030 secs]
2014-05-21T10:16:42.580+0000: 280.087: Total time for which
application threads were stopped: 79.3773830 seconds
we want to avoid this
• Use G1GC + Java 8
• Store data serialized
set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
set("spark.kryo.registrator","scee.SceeKryoRegistrator")
43.
Bugs
• for example: cdh5 does not work with Amazon S3 out of the
box ( thx to Sean it will be fixed in next release )
• If in doubt use the provided ec2/spark-ec2 script
• ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves>
—instance-type=c3.xlarge launch <cluster-name>
44.
Tips & Tricks
• you do not need to package whole spark with your app, just
specify dependencies as provided in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" %
„provided"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" %
"provided"
assembly jar size from 120MB -> 5MB
• always ensure you are compiling agains the same version of
artifacts, if not ”bad things will happen”™
45.
Future - Spark 1.0
• Voting in progress to release Spark 1.0.0 RC11
• Spark SQL
• History server
• Job Submission Tool
• Java 8 support
46.
Spark - Hadoop done right
• Faster to run, less code to write
• Deploying Spark can be easy and cost-effective
• Still rough around the edges but improves quickly