Lessons from Running Large
Scale Spark Workloads
Reynold Xin, Matei Zaharia
Feb 19, 2015 @ Strata
About Databricks
Founded by the creators of Spark in 2013
Largest organization contributing to Spark
End-to-end hosted service, Databricks Cloud
2
A slide from 2013 …
3
4
Does Spark scale?
5
Does Spark scale? Yes!
6
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
Agenda
Spark “hall of fame”
Architectural improvements for scalability
Q&A
7
8
Spark “Hall of Fame”
LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB
LARGEST SHUFFLE MOST INTERESTING APP
Tencent
(1PB+ /day)
Alibaba
(1 week on 1PB+ data)
Databricks PB Sort
(1PB)
Jeremy Freeman
Mapping the Brain at Scale
(with lasers!)
LARGEST CLUSTER
Tencent
(8000+ nodes)
Based on Reynold’s personal knowledge
Largest Cluster & Daily Intake
9
800 million+
active users
8000+
nodes
150 PB+
1 PB+/day
Spark at Tencent
10Images courtesy of Lianhui Wang
Ads CTR prediction
Similarity metrics on billions of nodes
Spark SQL for BI and ETL
…
Spark
SQL streaming
machine
learning
graphETL
HDFS Hive HBase MySQL PostgreSQL others
See Lianhui’s session this afternoon
Longest Running Spark Job
Alibaba Taobao uses Spark to perform image feature
extraction for product recommendation.
1 week runtime on petabytes of images!
Largest e-commerce site (800m products, 48k sold/min)
ETL, machine learning, graph computation, streaming
11
Alibaba Taobao: Extensive Use of GraphX
12
clustering
(community detection)
belief propagation
(influence & credibility)
collaborative filtering
(recommendation)
Images courtesy of Andy Huang, Joseph Gonzales
See upcoming talk at Spark Summit on streaming graphs
13
Mapping the
brain at scale
Images courtesy of Jeremy Freeman
14Images courtesy of Jeremy Freeman
15Images courtesy of Jeremy Freeman
Architectural Improvements
Elastic Scaling: improve resource utilization (1.2)
Revamped Shuffle: shuffle at scale (1.2)
DataFrames: easier high performance at scale (1.3)
16
Static Resource Allocation
17
Resource
(CPU / Mem)
Time
Allocated
Used
New job New job
Stragglers Stragglers
Static Resource Allocation
18
Resource
(CPU / Mem)
Time
Allocated
Used
More resources allocated than are used!
Static Resource Allocation
19
Resource
(CPU / Mem)
Time
Allocated
Used
Resources wasted!
Elastic Scaling
20
Resource
(CPU / Mem)
Time
Allocated
Used
Elastic Scaling
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 3
spark.dynamicAllocation.maxExecutors 20
- Optional -
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.schedulerBacklogTimeout
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
21
Shuffle at Scale
Sort-based shuffle
Netty-based transport
22
Sort-based Shuffle
Old hash-based shuffle: requires R
(# reduce tasks) concurrent streams with
buffers; limits R.
New sort-based shuffle: sort records by
partition first, and then write them; one
active stream at a time
Went up to 250,000 reduce tasks in PB sort!
Netty-based Network Transport
Zero-copy network send
Explicitly managed memory buffers for shuffle (lowering
GC pressure)
Auto retries on transient network failures
24
Network Transport
Sustaining 1.1GB/s/node in shuffle
DataFrames in Spark
Current Spark API is based on Java / Python objects
–  Hard for engine to store compactly
–  Cannot understand semantics of user functions
DataFrames make it easy to leverage structured data
–  Compact, column-oriented storage
–  Rich optimization via Spark SQL’s Catalyst optimizer
26
DataFrames in Spark
Distributed collection of data with known schema
Domain-specific functions for common tasks
–  Metadata
–  Sampling
–  Project, filter, aggregation, join, …
–  UDFs
27
28
DataFrames
Similar API to data frames
in R and Pandas
Optimized and compiled to
bytecode by Spark SQL
Read/write to Hive, JSON,
Parquet, ORC, Pandas
df = jsonFile(“tweets.json”)
df[df.user == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime
29
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= “2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
physical plan
join
scan
(users)
filter
scan
(events)
this join is expensive à
30
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filter scan
(events)
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= “2015-01-01”)
DataFrames arrive in Spark 1.3!
From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase!
implements Mapper<LongWritable, Text, Text, IntWritable> {!
!
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value,!
OutputCollector<Text, IntWritable> output,!
Reporter reporter) throws IOException {!
String line = value.toString();!
StringTokenizer itr = new StringTokenizer(line);!
while (itr.hasMoreTokens()) {!
word.set(itr.nextToken());!
output.collect(word, one);!
}!
}!
}!
!
public static class WorkdCountReduce extends MapReduceBase!
implements Reducer<Text, IntWritable, Text, IntWritable> {!
!
public void reduce(Text key, Iterator<IntWritable> values,!
OutputCollector<Text, IntWritable> output,!
Reporter reporter) throws IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum += values.next().get();!
}!
output.collect(key, new IntWritable(sum));!
}!
}!
val file = spark.textFile("hdfs://...")!
val counts = file.flatMap(line => line.split(" "))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.saveAsTextFile("hdfs://...")!
32
Thank you! Questions?

Lessons from Running Large Scale Spark Workloads

  • 1.
    Lessons from RunningLarge Scale Spark Workloads Reynold Xin, Matei Zaharia Feb 19, 2015 @ Strata
  • 2.
    About Databricks Founded bythe creators of Spark in 2013 Largest organization contributing to Spark End-to-end hosted service, Databricks Cloud 2
  • 3.
    A slide from2013 … 3
  • 4.
  • 5.
  • 6.
    6 On-Disk Sort Record: Timeto sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 7.
    Agenda Spark “hall offame” Architectural improvements for scalability Q&A 7
  • 8.
    8 Spark “Hall ofFame” LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB LARGEST SHUFFLE MOST INTERESTING APP Tencent (1PB+ /day) Alibaba (1 week on 1PB+ data) Databricks PB Sort (1PB) Jeremy Freeman Mapping the Brain at Scale (with lasers!) LARGEST CLUSTER Tencent (8000+ nodes) Based on Reynold’s personal knowledge
  • 9.
    Largest Cluster &Daily Intake 9 800 million+ active users 8000+ nodes 150 PB+ 1 PB+/day
  • 10.
    Spark at Tencent 10Imagescourtesy of Lianhui Wang Ads CTR prediction Similarity metrics on billions of nodes Spark SQL for BI and ETL … Spark SQL streaming machine learning graphETL HDFS Hive HBase MySQL PostgreSQL others See Lianhui’s session this afternoon
  • 11.
    Longest Running SparkJob Alibaba Taobao uses Spark to perform image feature extraction for product recommendation. 1 week runtime on petabytes of images! Largest e-commerce site (800m products, 48k sold/min) ETL, machine learning, graph computation, streaming 11
  • 12.
    Alibaba Taobao: ExtensiveUse of GraphX 12 clustering (community detection) belief propagation (influence & credibility) collaborative filtering (recommendation) Images courtesy of Andy Huang, Joseph Gonzales See upcoming talk at Spark Summit on streaming graphs
  • 13.
    13 Mapping the brain atscale Images courtesy of Jeremy Freeman
  • 14.
    14Images courtesy ofJeremy Freeman
  • 15.
    15Images courtesy ofJeremy Freeman
  • 16.
    Architectural Improvements Elastic Scaling:improve resource utilization (1.2) Revamped Shuffle: shuffle at scale (1.2) DataFrames: easier high performance at scale (1.3) 16
  • 17.
    Static Resource Allocation 17 Resource (CPU/ Mem) Time Allocated Used New job New job Stragglers Stragglers
  • 18.
    Static Resource Allocation 18 Resource (CPU/ Mem) Time Allocated Used More resources allocated than are used!
  • 19.
    Static Resource Allocation 19 Resource (CPU/ Mem) Time Allocated Used Resources wasted!
  • 20.
    Elastic Scaling 20 Resource (CPU /Mem) Time Allocated Used
  • 21.
    Elastic Scaling spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors3 spark.dynamicAllocation.maxExecutors 20 - Optional - spark.dynamicAllocation.executorIdleTimeout spark.dynamicAllocation.schedulerBacklogTimeout spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 21
  • 22.
    Shuffle at Scale Sort-basedshuffle Netty-based transport 22
  • 23.
    Sort-based Shuffle Old hash-basedshuffle: requires R (# reduce tasks) concurrent streams with buffers; limits R. New sort-based shuffle: sort records by partition first, and then write them; one active stream at a time Went up to 250,000 reduce tasks in PB sort!
  • 24.
    Netty-based Network Transport Zero-copynetwork send Explicitly managed memory buffers for shuffle (lowering GC pressure) Auto retries on transient network failures 24
  • 25.
  • 26.
    DataFrames in Spark CurrentSpark API is based on Java / Python objects –  Hard for engine to store compactly –  Cannot understand semantics of user functions DataFrames make it easy to leverage structured data –  Compact, column-oriented storage –  Rich optimization via Spark SQL’s Catalyst optimizer 26
  • 27.
    DataFrames in Spark Distributedcollection of data with known schema Domain-specific functions for common tasks –  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs 27
  • 28.
    28 DataFrames Similar API todata frames in R and Pandas Optimized and compiled to bytecode by Spark SQL Read/write to Hive, JSON, Parquet, ORC, Pandas df = jsonFile(“tweets.json”) df[df.user == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime
  • 29.
    29 joined = users.join(events,users.id == events.uid) filtered = joined.filter(events.date >= “2015-01-01”) logical plan filter join scan (users) scan (events) physical plan join scan (users) filter scan (events) this join is expensive à
  • 30.
    30 logical plan filter join scan (users) scan (events) optimized plan join scan (users) filter scan (events) optimizedplan with intelligent data sources join scan (users) filter scan (events) joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= “2015-01-01”) DataFrames arrive in Spark 1.3!
  • 31.
    From MapReduce toSpark public static class WordCountMapClass extends MapReduceBase! implements Mapper<LongWritable, Text, Text, IntWritable> {! ! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value,! OutputCollector<Text, IntWritable> output,! Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer itr = new StringTokenizer(line);! while (itr.hasMoreTokens()) {! word.set(itr.nextToken());! output.collect(word, one);! }! }! }! ! public static class WorkdCountReduce extends MapReduceBase! implements Reducer<Text, IntWritable, Text, IntWritable> {! ! public void reduce(Text key, Iterator<IntWritable> values,! OutputCollector<Text, IntWritable> output,! Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }! val file = spark.textFile("hdfs://...")! val counts = file.flatMap(line => line.split(" "))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.saveAsTextFile("hdfs://...")!
  • 32.
  • 33.