Spark makes it easy to build and deploy complex data processing applications onto shared compute platforms, but tuning them is a skill in itself and can get overlooked. Uncontrolled, this leads to over specified resource requirements, unnecessary platform load and increases the chances of resource contention, degrading overall performance. By identifying inefficient jobs, development teams and platform administrators can wrestle back control of system resources, improve efficiency and lessen the effect of contention across the cluster.
Sparklint uses the Spark metrics steam and a custom event listener to analyze individual Spark jobs for over specified or unbalanced resources, incorrect partitioning and sub optimal worker locality. It is easily attached to any Spark job and can also run standalone against historical event logs, presenting data for analysis through a web UI and providing a unique resource focused view of the application runtime.
3. Why Sparklint?
• A successful Spark cluster grows rapidly
• Capacity and capability mismatches arise
• Leads to resource contention
• Tuning process is non-trivial
• Current UI operational in focus
We wanted to understand application efficiency
4. Sparklint provides:
• Live view of batch & streaming application stats
or
• Event by event analysis of historical event logs
• Stats and graphs for:
– Idle time
– Core usage
– Task locality
8. Demo…
• Simulated workload analyzing site access logs:
– read text file as JSON
– convert to Record(ip, verb, status, time)
– countByIp, countByStatus, countByVerb
9. Job took 10m7s to finish
Already pretty good
distribution; low idle time
indicates good worker
usage, minimal driver node
interaction in job
But overall utilization is low
Which is reflected in the
common occurrence of the
IDLE state (unused cores)
10. Job took 15m14s to
finish
Core usage increased,
job is more efficient,
execution time
increased, but the app
is not cpu bound
11. Job took 9m24s to finish
Core utilization decreased
proportionally, trading execution time
for efficiency
Lots of IDLE state shows
we are over allocating
resources
12. Job took 11m34s to finish
Core utilization remains
low, the config settings
are not right for this
workload.
Dynamic allocation only effective at
app start due to long
executorIdleTimeout setting
13. Job took 33m5s to finish Core utilization is up, but execution time is
up dramatically due to reclaiming
resources before each short running task.
IDLE state is reduced to a minimum, looks
efficient, but execution is much slower due to
dynamic allocation overhead
14. Job took 7m34s to finish
Core utilization way up,
with lower execution time
Parallel execution is
clearly visible in
overlapping stages
Flat tops show we are
becoming CPU bound
15. Job took 5m6s to finish
Core utilization decreases,
trading execution time for
efficiency again here
16. Thanks to dynamic allocation the
utilization is high despite being a bi-
modal application
Data loading and mapping requires
a large core count to get throughput
Aggregation and IO of results
optimized for end file size,
therefore requires less cores
Spark cluster success
Platform rolls out with a maximum supported load.
Early projects ramp up, usage is fine
Early successes feed back into recommendations to use the platform
New users start loading up the platform just as initial successes are being scaled
Platform limits hit, scaling requirements now begin to be understood and planned for
Rough times whilst the platform operation learns to lead the application usage
◦ spark ui provides masses of info for only recent jobs / stages/tasks by default when the job is alive
◦ when serving spark ui from history server, there is still little summary information to debug the job config: Have I used the right magic number (locality wait, cores, numPartitions, job scheduling mode, etc.)
◦ difficult to compare different execution of the same job, due to this missing level of summary, (execution time is almost the only metrics to compare)
◦ A mechanism to listen the spark event log stream, and accumulate life time stats without losing (too many) details using constant memory in live mode because of the gauge we are using
◦ The mechanism also provides convenient replay when serving from a file
◦ A set of stats and graphs to describe the job performance uniformly:
1. idle time (duration when all calc are done on driver node, things to avoid)
2. max core usage, core usage percentage (should not be too high or too low, thinking about using avg numTaskInWait to supplement it)
3. task execution time for a certain stage by locality, (which honestly describes the opportunity cost of a lower locality, and indicates the idle locality wait config.)
using the ReduceByKey.scala in repo as a sample to demo a series of attempts when we try to optimize a Spark application. The logs are included as well. The highlights for each run have been annotated in the screenshots in the attachment.
The application is basically reading a text file, json parse and convert to "Record(ip: String, verb: String, status: Int, time: Long)", then do countByIP, countByStatus, countByVerb on them, repeat 10 times.
These are three independent map reduce jobs, each has one map stage (parsing) and one reduce stage (countByXXX).
Algo level optimization is out of the discussion here. The app need a constant number of CPU seconds, and a floating but bound amount of network i/o time (decided by job locality) to finish the execution.
We use 16 cores as the baseline standard.
The job takes 10 min to finish.
The annotations in the pic describes what are we running here, and how to read sparklint graph.
After reading the chart, we decided to decrease core to see if the execution time doubles or not, to figure out if we are bonded by CPU.
by using 8 cores, the job took 15 min to finish, shorter than the 20 min expectation, proving that we are not bonded by cpu.
Actually this saw tooth pattern easily indicates we are not bonded by CPU, and can be used as a classic example; An example of cpu bounded application can be found in the last demo slide.
This leads to another angle of optimization: job scheduling tweaking.
by using 32 cores, the job took 9 min to finish, proving again that throwing more cores doesn't provide commensurate performance gains..
The graph is a classic example about over allocating resources.
We can assume we need no more than 24 cores to do the work effectively, so now we can look into other ways of tuning the job: dynamic allocation and increased parallelism.
we try to optimize resource requirement by using dynamic allocation, initially just using the default executorIdleTimeout of 1min.
This has also led us to try 1 core / executor.
Since we don't usually have any task longer than 1 minute, we proved that dynamic allocation is not the key in optimizing this kind of app that has shorter tasks.
we reduced executorIdleTimeout to 10s. In this way we decreased resource footprint and increased utilization.
However this is a false saving for this job, because the job throughput is reduced due to low core supply and overhead in getting executors.
This example proved again that dynamic allocation doesn't solve the optimization challenge when we have shorter tasks
So, let’s try parallelism inside the job using FAIR scheduling.
by using 16 cores and FAIR scheduler, this simple tweak cut the execution time from 10 min to 7.5 min, and our job now become CPU bounded (see annotation)
The tweak to run the three count stages in parallel and use FAIR scheduling increases efficiency and reduces runtime, allowing us to become CPU bound,
by using 32 cores and FAIR scheduler, the execution time become 5 min (compare to 9 min in pic3 using the same resource).
We reduce efficiency in order to gain execution time, this is a decision for the team to decide, if there is a hard SLA to hit, it may be worth running with lower utilization.
We can now call the job scheduling optimization done.
Demos the correct scenario of using dynamic allocation, and throwing more cpu will help when the job is CPU bounded (the flat tops in the usage graph is the clear proof)
In this case the partition count is chosen to optimize file size on HDFS, so the team are comfortable with the runtime.
Sparklint can easily distinguish CPU bounded and job scheduling bounded applications. (We are working on automating this judgment, by using average number of pending tasks)
Really easy to spot when a job is not bounded by CPU, but job scheduling (leads to low core usage) and driver node operations (leads to idle time). In theory your app will be 2x faster if you throw 2x cores to it, but this is not always true
The point of spark level optimization is to make your job CPU bounded, when you can decide freely between ($ gain from faster application / $ spent in providing more cores)
If your job is CPU bounded, simply add cores
If your job has a lot of idle time, try decrease it by reducing unwanted/unintended driver node operations. (could be simple things like doing a map on a large array instead of an RDD and they forgot about it)
If your job is job scheduling bounded, you can both reduce waste by using dynamic allocation (which in turns provides you high throughput when needed), and submit independent jobs in parallel using Futures and FAIR scheduler http://spark.apache.org/docs/latest/configuration.html#scheduling