Spark Performance
Past, Future, and Present
Kay Ousterhout
Joint work with Christopher Canel, Ryan Rasti,
Sylvia Ratnasamy, Scott Shenker, Byung-Gon
Chun
About Me
Apache Spark PMC Member
Recent PhD graduate from UC Berkeley
Thesis work on performance of large-scale data analytics
Co-founder at Kelda (kelda.io)
How can I make
this faster?
How can I make
this faster?
Should I use a
different cloud
instance type?
Should I trade
more CPU for less
I/O by using
better
compression?
How can I make
this faster?
???
How can I make
this faster?
???
How can I make
this faster?
???
Major performance improvements
possible via tuning, configuration
…if only you knew which knobs to turn
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
spark.textFile(“hdfs://…”) 
.flatMap(lambda l: l.split(“ “)) 
.map(lambda w: (w, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(“hdfs://…”)
Example Spark Job
Split input file into words
and emit count of 1 for each
Word Count:
Example Spark Job
Split input file into words
and emit count of 1 for each
Word Count:
For each word, combine the
counts, and save the output
spark.textFile(“hdfs://…”) 
.flatMap(lambda l: l.split(“ “)) 
.map(lambda w: (w, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(“hdfs://…”)
spark.textFile(“hdfs://…”)
.flatMap(lambda l: l.split(“ “))
.map(lambda w: (w, 1))
Map Stage: Split input file into words
and emit count of 1 for each
Reduce Stage: For each word, combine
the counts, and save the output
Spark Word Count Job:
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(“hdfs://…”)
…	
Worker 1
Worker n
Tasks
…	
Worker 1
Worker n
Spark Word Count Job:
Reduce Stage: For each word, combine
the counts, and save the output
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(“hdfs://…”)
…	
Worker 1
Worker n
compute
network
time
(1) Request a few
shuffle blocks
disk
(5) Continue fetching
remote data
: time to handle one shuffle block
(2) Process local
data
What happens in a reduce task?
(4) Process data fetched remotely
(3) Write output to disk
compute
network
time
disk
: time to handle one shuffle block
What happens in a reduce task?
Bottlenecked on
network and disk
Bottlenecked on network
Bottlenecked on
CPU
compute
network
time
disk
: time to handle one shuffle block
What happens in a reduce task?
Bottlenecked on
network and disk
Bottlenecked on network
Bottlenecked on
CPU
compute
network
time
disk
What instrumentation exists today?
Instrumentation centered on single, main task thread
: shuffle read blocked time
: executor
computing time (!)
actual
What instrumentation exists today?
timeline version
What instrumentation exists today?
Instrumentation centered on single, main task thread
Shuffle read and shuffle write blocked time
Input read and output write blocked time not instrumented
Possible to add!
compute
disk
Instrumenting read and write time
Process shuffle block This is a lie
compute
Reality:
Spark processes and then writes one record at a time
Most writes get buffered Occasionally the buffer is flushed
compute
Spark processes and then writes one record at a time
Most writes get buffered Occasionally the buffer is flushed
Challenges with reality:
Record-level instrumentation is too high overhead
Spark doesn’t know when buffers get flushed
(HDFS does!)
Tasks use fine-grained pipelining to parallelize
resources
Instrumented times are blocked times only
(task is doing other things in background)
Opportunities to improve instrumentation
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
4 concurrent tasks
on a worker
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
Concurrent tasks may
contend for
the same resource
(e.g., network)
What’s the bottleneck?
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
Time t: different
tasks may be
bottlenecked on
different resources
Single task may be
bottlenecked on
different resources
at different times
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
How much faster
would my job be with
2x disk throughput?
How would runtimes for these
disk writes change?
How would that change timing of
(and contention for) other resources?
Today: tasks use pipelining to parallelize
multiple resources
Proposal: build systems using monotasks
that each consume just one resource
Monotasks: Each task uses one resource
Network
monotask Disk monotask
Compute
monotask
Today’s task:
Monotasks don’t start until all dependencies complete
Task 1
Network read
CPU
Disk write
Dedicated schedulers control contention
Network
scheduler
CPU scheduler:
1 monotask / core
Disk drive scheduler:
1 monotask / disk
Monotasks for one of today’s tasks:
Spark today:
Tasks have non-
uniform resource
use
4 multi-resource
tasks run
concurrently
Single-resource
monotasks
scheduled by
per-resource
schedulers
Monotasks:
API-compatible, performance
parity with Spark
Performance telemetry trivial!
How much faster would job run if...
4x more machines
Input stored in-memory
No disk read
No CPU time to deserialize
Flash drives instead of disks
Faster shuffle read/write time 10x improvement predicted
with at most 23% error
�
���
���
���
���
����
����
����
����
��� ����� �������� �� ����� �� �����
����������
��������
�������� ������� �� ��������� ���� ���� ������
��������� ��� ������� ��� ��������� ���� ���� ������
������ ��� ������� ��� ��������� ���� ���� ������
Monotasks: Break jobs into single-resource tasks
Using single-resource monotasks provides clarity
without sacrificing performance
Massive change to Spark internals (>20K lines of code)
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
Spark today:
Task resource use
changes at fine
time granularity
4 multi-resource
tasks run
concurrently
Monotasks:
Single-resource tasks lead
to complete, trivial
performance metrics
Can we get monotask-like per-resource metrics for each
task today?
compute
network
time
disk
Can we get per-task resource use?
Measure machine
resource utilization?
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
4 concurrent tasks
on a worker
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
Concurrent tasks may
contend for
the same resource
(e.g., network)
Contention controlled by
lower layers (e.g.,
operating system)
compute
network
time
disk
Can we get per-task resource use?
Machine utilization
includes other tasks
Can’t directly measure
per-task I/O:
in background, mixed
with other tasks
compute
network
time
disk
Can we get per-task resource use?
Existing metrics: total
data read
How long did it take?
Use machine utilization
metrics to get
bandwidth!
Existing per-task I/O counters (e.g., shuffle bytes read)
+
Machine-level utilization (and bandwidth) metrics
=
Complete metrics about time spent using each resource
Goal: provide performance clarity
Only way to improve performance is to know what to speed up
Why do we care about performance clarity?
Typical performance eval:
group of experts Practical performance: 1 novice
Goal: provide performance clarity
Only way to improve performance is to know what to speed up
Some instrumentation exists already
Focuses on blocked times in the main task thread
Many opportunities to improve instrumentation
(1) Add read/write instrumentation to lower level (e.g., HDFS)
(2) Add machine-level utilization info
(3) Calculate per-resource time
More details at kayousterhout.org

Apache Spark Performance: Past, Future and Present

  • 1.
    Spark Performance Past, Future,and Present Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun
  • 2.
    About Me Apache SparkPMC Member Recent PhD graduate from UC Berkeley Thesis work on performance of large-scale data analytics Co-founder at Kelda (kelda.io)
  • 3.
    How can Imake this faster?
  • 4.
    How can Imake this faster?
  • 5.
    Should I usea different cloud instance type?
  • 6.
    Should I trade moreCPU for less I/O by using better compression?
  • 7.
    How can Imake this faster? ???
  • 8.
    How can Imake this faster? ???
  • 9.
    How can Imake this faster? ???
  • 10.
    Major performance improvements possiblevia tuning, configuration …if only you knew which knobs to turn
  • 11.
    Past: Performance instrumentationin Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  • 12.
    spark.textFile(“hdfs://…”) .flatMap(lambda l:l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Example Spark Job Split input file into words and emit count of 1 for each Word Count:
  • 13.
    Example Spark Job Splitinput file into words and emit count of 1 for each Word Count: For each word, combine the counts, and save the output spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”)
  • 14.
    spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(““)) .map(lambda w: (w, 1)) Map Stage: Split input file into words and emit count of 1 for each Reduce Stage: For each word, combine the counts, and save the output Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) … Worker 1 Worker n Tasks … Worker 1 Worker n
  • 15.
    Spark Word CountJob: Reduce Stage: For each word, combine the counts, and save the output .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) … Worker 1 Worker n
  • 16.
    compute network time (1) Request afew shuffle blocks disk (5) Continue fetching remote data : time to handle one shuffle block (2) Process local data What happens in a reduce task? (4) Process data fetched remotely (3) Write output to disk
  • 17.
    compute network time disk : time tohandle one shuffle block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU
  • 18.
    compute network time disk : time tohandle one shuffle block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU
  • 19.
    compute network time disk What instrumentation existstoday? Instrumentation centered on single, main task thread : shuffle read blocked time : executor computing time (!)
  • 20.
    actual What instrumentation existstoday? timeline version
  • 21.
    What instrumentation existstoday? Instrumentation centered on single, main task thread Shuffle read and shuffle write blocked time Input read and output write blocked time not instrumented Possible to add!
  • 22.
    compute disk Instrumenting read andwrite time Process shuffle block This is a lie compute Reality: Spark processes and then writes one record at a time Most writes get buffered Occasionally the buffer is flushed
  • 23.
    compute Spark processes andthen writes one record at a time Most writes get buffered Occasionally the buffer is flushed Challenges with reality: Record-level instrumentation is too high overhead Spark doesn’t know when buffers get flushed (HDFS does!)
  • 24.
    Tasks use fine-grainedpipelining to parallelize resources Instrumented times are blocked times only (task is doing other things in background) Opportunities to improve instrumentation
  • 25.
    Past: Performance instrumentationin Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  • 26.
    Task 1 Task 2 Task5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker
  • 27.
    Task 1 Task 2 Task5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network)
  • 28.
    What’s the bottleneck? Task1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 Time t: different tasks may be bottlenecked on different resources Single task may be bottlenecked on different resources at different times
  • 29.
    Task 1 Task 2 Task5 Task 3 Task 4 Task 7 Task 6 Task 8 How much faster would my job be with 2x disk throughput? How would runtimes for these disk writes change? How would that change timing of (and contention for) other resources?
  • 30.
    Today: tasks usepipelining to parallelize multiple resources Proposal: build systems using monotasks that each consume just one resource
  • 31.
    Monotasks: Each taskuses one resource Network monotask Disk monotask Compute monotask Today’s task: Monotasks don’t start until all dependencies complete Task 1 Network read CPU Disk write
  • 32.
    Dedicated schedulers controlcontention Network scheduler CPU scheduler: 1 monotask / core Disk drive scheduler: 1 monotask / disk Monotasks for one of today’s tasks:
  • 33.
    Spark today: Tasks havenon- uniform resource use 4 multi-resource tasks run concurrently Single-resource monotasks scheduled by per-resource schedulers Monotasks: API-compatible, performance parity with Spark Performance telemetry trivial!
  • 34.
    How much fasterwould job run if... 4x more machines Input stored in-memory No disk read No CPU time to deserialize Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted with at most 23% error � ��� ��� ��� ��� ���� ���� ���� ���� ��� ����� �������� �� ����� �� ����� ���������� �������� �������� ������� �� ��������� ���� ���� ������ ��������� ��� ������� ��� ��������� ���� ���� ������ ������ ��� ������� ��� ��������� ���� ���� ������
  • 35.
    Monotasks: Break jobsinto single-resource tasks Using single-resource monotasks provides clarity without sacrificing performance Massive change to Spark internals (>20K lines of code)
  • 36.
    Past: Performance instrumentationin Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  • 37.
    Spark today: Task resourceuse changes at fine time granularity 4 multi-resource tasks run concurrently Monotasks: Single-resource tasks lead to complete, trivial performance metrics Can we get monotask-like per-resource metrics for each task today?
  • 38.
    compute network time disk Can we getper-task resource use? Measure machine resource utilization?
  • 39.
    Task 1 Task 2 Task5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker
  • 40.
    Task 1 Task 2 Task5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network) Contention controlled by lower layers (e.g., operating system)
  • 41.
    compute network time disk Can we getper-task resource use? Machine utilization includes other tasks Can’t directly measure per-task I/O: in background, mixed with other tasks
  • 42.
    compute network time disk Can we getper-task resource use? Existing metrics: total data read How long did it take? Use machine utilization metrics to get bandwidth!
  • 43.
    Existing per-task I/Ocounters (e.g., shuffle bytes read) + Machine-level utilization (and bandwidth) metrics = Complete metrics about time spent using each resource
  • 44.
    Goal: provide performanceclarity Only way to improve performance is to know what to speed up Why do we care about performance clarity? Typical performance eval: group of experts Practical performance: 1 novice
  • 45.
    Goal: provide performanceclarity Only way to improve performance is to know what to speed up Some instrumentation exists already Focuses on blocked times in the main task thread Many opportunities to improve instrumentation (1) Add read/write instrumentation to lower level (e.g., HDFS) (2) Add machine-level utilization info (3) Calculate per-resource time More details at kayousterhout.org