SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Rose Toomey, Coatue Management
Spark At Scale In the
Cloud
#UnifiedDataAnalytics #SparkAISummit
About me
NYC. Finance. Technology. Code.
• Each job I wrote code but found that the data
challenges just kept growing
– Lead API Developer at Gemini Trust
– Director at Novus Partners
• Now: coding and working with data full time
– Software Engineer at Coatue Management
How do you process this…
Numbers are approximate.
• Dataset is 35+ TiB raw
• Input files are 80k+ unsplittable compressed row-based
format with heavy skew, deeply nested directory structure
• Processing results in 275+ billion rows cached to disk
• Lots of data written back out to S3
– Including stages ending in sustained writes of tens of TiB
4
On a very big Spark cluster…
Sometimes you just need to bring the entire
dataset into memory.
The more nodes a Spark cluster has, the more
important configuration tuning becomes.
Even more so in the cloud, where you will
regularly experience I/O variance and
unreliable nodes.
In the cloud?
• Infrastructure management is hard
– Scaling resources and bandwidth in a datacenter
is not instant
– Spark/Hadoop clusters are not islands – you’re
managing an entire ecosystem of supporting
players
• Optimizing Spark jobs is hard
Let’s limit the number of hard things we’re going to tackle
at once.
Things going wrong at scale
Everything is relative. In smaller clusters, these
configurations worked fine.
• Everything is waiting on everything else because Netty
doesn't have enough firepower to shuffle faster
• Speculation meets skew and relaunches the very
slowest parts of a join, leaving most of the cluster idle
• An external service rate limits, which causes blacklisting
to sideline most of a perfectly good cluster
7
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Putting together a big
Spark cluster
• What kind of nodes should the
cluster have? Big? Small?
Medium?
• What's your resource limitation for
the number of executors?
– Just memory (standalone)
– Both memory and vCPUs (YARN)
• Individual executors should have
how much memory and how many
virtual CPUs?Galactic Wreckage in Stephan's Quintet
9
One Very Big Standalone Node
One mega instance configured with many
"just right" executors, each provisioned with
• < 32 GiB heap (sweet spot for GC)
• 5 cores (for good throughput)
• Minimizes shuffle overhead
• Like the pony, not offered by your cloud
provider. Also, poor fault tolerance.
10
Multiple Medium-sized Nodes
When looking at medium sized nodes, we
have a choice:
• Just one executor
• Multiple executors
But a single executor might not be the best
resource usage:
• More cores on a single executor is not
necessarily better
• When using a cluster manager like
YARN, more executors could be a more
efficient use of CPU and memory
11
Many Small Nodes
12
• 500+ small nodes
• Each node over-provisioned
relative to multiple executor per
node configurations
• Single executor per node
• Most fault tolerant but big
communications overhead
“Desperate affairs require
desperate measures.”
Vice Admiral Horatio Nelson
Why ever choose the worst solution?
Single executor per small (or medium) node is the worst
configuration for cost, provisioning, and resource usage. Why not
recommend against it?
• Resilient to node degradation and loss
• Quick transition to production: relative over-provisioning of
resources to each executor behaves more like a notebook
• Awkward instance sizes may provision more quickly than larger
instances
13
Onward!
Now you have your cluster composition in mind, you’ll need to scale
up your base infrastructure to support the number of nodes:
• Memory and garbage collection
• Tune RPC for cluster communications
• Where do you put very large datasets?
• How do you get them off the cluster?
• No task left behind: scheduling in difficult times
14
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Spark memory management
SPARK-1000: Consolidate
storage and execution memory
management
• NewRatio controls
Young/Old proportion
• spark.memory.fraction
sets storage and execution
space to ~60% tenured
space
16
Young Generation 1/3
Old Generation 2/3
300m reserved
spark.memory.fraction ~60%
50% execution
dynamic – will take more
50% storage
spark.memory.storageFraction
~40%
Spark
metadata,
user data
structures,
OOM safety
17
Field guide to Spark GC tuning
• Lots of minor GC - easy fix
– Increase Eden space (high allocation rate)
• Lots of major GC - need to diagnose the trigger
– Triggered by promotion - increase Eden space
– Triggered by Old Generation filling up - increase Old Generation
space or decrease spark.memory.fraction
• Full GC before stage completes
– Trigger minor GC earlier and more often
18
Full GC tailspin
Balance sizing up against tuning code
• Switch to bigger and/or more nodes
• Look for slow running stages caused by avoidable shuffle, tune
joins and aggregation operations
• Checkpoint both to preserve work at strategic points but also to
truncate DAG lineage
• Cache to disk only
• Trade CPU for memory by compressing data in memory using
spark.rdd.compress
19
Which garbage collector?
Throughput or latency?
• ParallelGC favors throughput
• G1GC is low latency
– Shiny new things like string deduplication
– vulnerable to wide rows
Whichever you choose, collect early and often.
20
Where to cache big datasets
• To disk. Which is slow.
• But frees up as much tenured space as possible for
execution, and storing things which must be in memory
– internal metadata
– user data structures
– broadcasting the skew side of joins
21
22
Perils of caching to disk
19/04/13 01:27:33 WARN BlockManagerMasterEndpoint: No more replicas
available for rdd_48_27005 !
When you lose an executor, you lose all the cached blocks stored by that
executor even if the node is still running.
• If lineage is gone, the entire job will fail
• If lineage is present, RDD#getOrCompute tries to compensate for the missing
blocks by re-ingesting the source data. While it keeps your job from failing,
this could introduce enormous slowdowns if the source data is skewed, your
ingestion process is complex, etc.
23
Self healing block management
// use this with replication >= 2 when caching to
disk in non-distributed filesystem
spark.storage.replication.proactive = true
Pro-active block replenishment in case of node/executor failures
https://issues.apache.org/jira/browse/SPARK-15355
https://github.com/apache/spark/pull/14412
24
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Tune RPC for cluster
communications
Netty server processing RPC requests
is the backbone of both authentication
and shuffle services.
Insufficient RPC resources cause slow
speed mayhem: clients disassociate,
operations time out.
org.apache.spark.network.util.
TransportConf is the shared config for
both shuffle and authentication services.
Ruth Teitelbum and Marlyn Meltzer
reprogramming ENIAC, 1946
26
Scaling RPC
// used for auth
spark.rpc.io.serverThreads = coresPerDriver * rpcThreadMultiplier
// used for shuffle
spark.shuffle.io.serverThreads = coresPerDriver * rpcThreadMultiplier
Where "RPC thread multiplier" is a scaling factor to increase the service's thread pool.
• 8 is aggressive, might cause issues
• 4 is moderately aggressive
• 2 is recommended (start here, benchmark, then increase)
• 1 (number of vCPU cores) is default but is too small for a large cluster
27
Shuffle
The definitive presentation on shuffle tuning:
Tuning Apache Spark for Large-Scale Workloads (Gaoxiang Liu
and Sital Kedia)
So this section focuses on
• Some differences to configurations presented in Liu and
Kedia's presentation, as well as
• Configurations that weren't shown in this presentation
28
Strategy for lots of shuffle clients
1. Scale the server way up
// mentioned in Liu/Kedia presentation but now deprecated
// spark.shuffle.service.index.cache.entries = 2048
// default: 100 MiB
spark.shuffle.service.index.cache.size = 256m
// length of accept queue. default: 64
spark.shuffle.io.backLog = 8192
// default (not increased by spark.network.timeout)
spark.rpc.lookupTimeout = 120s
29
Strategy for lots of shuffle clients
2. make clients more patient, more fault tolerant, fewer
simultaneous requests in flight
spark.reducer.maxReqsInFlight = 5 // default:
Int.MaxValue
spark.shuffle.io.maxRetries = 10 // default: 3
spark.shuffle.io.retryWait = 60s // default 5s
30
Strategy for lots of shuffle clients
spark.shuffle.io.numConnectionsPerPeer = 1
Scaling this up conservatively for multiple executor per node
configurations can be helpful.
Not recommended to change the default for single executor per
node.
31
Shuffle partitions
spark.sql.shuffle.partitions = max(1, nodes - 1) *
coresPerExecutor * parallelismPerCore
where parallelism per core is some hyperthreading factor, let's say 2.
It's not the best for large shuffles although it can be adjusted.
Apache Spark Core—Deep Dive—Proper Optimization (Daniel Tomes)
recommends setting this value to max(cluster executor cores,
shuffle stage input / 200 MB). That translates to 5242 partitions
per TB. Highly aggressive shuffle optimization is required for a large
dataset on a cluster with a large number of executors.
32
Kill Spill
spark.shuffle.spill.numElementsForceSpillThreshold = 25000000
spark.sql.windowExec.buffer.spill.threshold = 25000000
spark.sql.sortMergeJoinExec.buffer.spill.threshold = 25000000
• Spill is the number one cause of poor performance on very large
Spark clusters. These settings control when Spark spills data from
memory to disk – the defaults are a bad choice!
• Set these to a big Integer value – start with 25000000 and
increase if you can. More is more.
• SPARK-21595: Separate thresholds for buffering and spilling in
ExternalAppendOnlyUnsafeRowArray
Scaling AWS S3 Writes
Hadoop AWS S3 support in 3.2.0 is
amazing
• Especially the new S3A committers
https://hadoop.apache.org/docs/r3.2.0/hado
op-aws/tools/hadoop-aws/index.html
EMR: write to HDFS and copy off using
s3DistCp (limit reducers if necessary)
Databricks: writing directly to S3 just works
FirstNASAISINGLASSrocketlaunch
34
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Services
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Task Scheduling
Spark's powerful task scheduling
settings can interact in unexpected
ways at scale.
• Dynamic resource allocation
• External shuffle
• Speculative Execution
• Blacklisting
• Task reaper
Apollo 13 Mailbox at Mission Control
36
Dynamic resource allocation
Dynamic resource allocation benefits a multi-tenant cluster where
multiple applications can share resources.
If you have an ETL pipeline running on a large transient Spark
cluster, dynamic allocation is not useful to your single application.
Note that even in the first case, when your application no longer
needs some executors, those cluster nodes don't get spun down:
• Dynamic allocation requires an external shuffle service
• The node stays live and shuffle blocks continue to be served from it
37
External shuffle service
spark.shuffle.service.enabled = true
spark.shuffle.registration.timeout = 60000 // default: 5ms
spark.shuffle.registration.maxAttempts = 5 // default: 3
Even without dynamic allocation, an external shuffle service may be a good idea.
• If you lose executors through dynamic allocation, the external shuffle process still
serves up those blocks.
• The external shuffle service could be more responsive than the executor itself
However, the registration values are insufficient for a large busy cluster:
SPARK-20640 Make rpc timeout and retry for shuffle registration configurable
38
Speculative execution
When speculative execution works as intended, tasks running slowly
due to transient node issues don't bog down that stage indefinitely.
• Spark calculates the median execution time of all tasks in the stage
• spark.speculation.quantile - don't start speculating until this
percentage of tasks are complete (default 0.75)
• spark.speculation.multiplier - expressed as a multiple of the
median execution time, this is how slow a task must be to be
considered for speculation
• Whichever task is still running when the first finishes gets killed
39
One size does not fit all
spark.speculation = true
spark.speculation.quantile = 0.8 //default: 0.75
spark.speculation.multiplier = 4 // default: 1.5
These were our standard speculative execution settings. They
worked "fine" in most of our pipelines. But they worked fine
because the median size of the tasks at 80% was OK.
What happens when reasonable settings meet unreasonable
data?
40
21.2 TB shuffle, 20% of tasks killed
41
Speculation: unintended consequences
The median task length is based on the fast 80% - but due to heavy skew, this estimate is bad!
Causing the scheduler to take the worst part of the job and … launches more copies of the worst
longest running tasks ... one of which then gets killed.
spark.speculation = true
// start later (might get a better estimate)
spark.speculation.quantile = 0.90
// default 1.5 - require a task to be really bad
spark.speculation.multiplier = 6
The solution was two-fold:
• Start speculative execution later (increase the quantile) and require a greater slowness
multiplier
• Do something about the skew
42
Benefits of speculative execution
• Speculation can be very helpful when the application is interacting
with an external service. Example: writing to S3
• When speculation kills a task that was going to fail anyway, it
doesn't count against the failed tasks for that
stage/executor/node/job
• Clusters are not tuned in a day! Speculation can help pave over
slowdowns caused by scaling issues
• Useful canary: when you see tasks being intentionally killed in any
quantity, it's worth investigating why
43
Blacklisting
spark.blacklist.enabled = true
spark.blacklist.task.maxTaskAttemptsPerExecutor = 1 // task blacklisted from
executor
spark.blacklist.stage.maxFailedTasksPerExecutor = 2 // executor blacklisted from
stage
// how many different tasks must fail in successful tasks sets before executor
// blacklisted from application
spark.blacklist.application.maxFailedTasksPerExecutor = 2
spark.blacklist.timeout = 1h // executor removed from blacklist, takes new tasks
Blacklisting prevents Spark from scheduling tasks on executors/nodes which have failed too many
times in the current stage.
The default number of failures are too conservative when using flaky external services. Let's see
how quickly it can add up...
44
45
Blacklisting gone wrong
• While writing three very large datasets to S3, something went
wrong about 17 TiB in
• 8600+ errors trying write to S3 in the space of eight minutes,
distributed across 1000 nodes
– Some executors backoff and retry, succeed
– Speculative execution kicks in, padding the blow
– But all the nodes quickly accumulate at least two failed tasks,
many have more and get blacklisted
• Eventually translating to four failed tasks, killing the job
46
47
Don't blacklist too soon
• We enabled blacklisting but didn't adjust the defaults because - we never "needed" to
before
• Post mortem showed cluster blocks were too large for our s3a settings
spark.blacklist.enabled = true
spark.blacklist.stage.maxFailedTasksPerExecutor = 8 // default: 2
spark.blacklist.application.maxFailedTasksPerExecutor = 24 // default: 2
spark.blacklist.timeout = 15m // default: 1h
Solution was to
• Make blacklisting a lot more tolerant of failure
• Repartition data on write for better block size
• Adjust s3a settings to raise multipart upload size
48
Don't fear the reaper
spark.task.reaper.enabled = true
// default: -1 (prevents executor from self-destructing)
spark.task.reaper.killTimeout = 180s
The task reaper monitors to make sure tasks that get interrupted or killed actually shut
down.
On a large job, give a little extra time before killing the JVM
• If you've increased timeouts, the task may need more time to shut down cleanly
• If the task reaper kills the JVM abruptly, you could lose cached blocks
SPARK-18761 Uncancellable / unkillable tasks may starve jobs of resources
49
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Services
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Increase tolerance
• If you find a timeout or number of retries, raise it
• If you find a buffer, backlog, queue, or threshold, increase it
• If you have a MR task with a number of reducers trying to use
a service concurrently in a large cluster
– Either limit the number of active tasks per reducer, or
– Limit the number of reducers active at the same time
51
Be more patient
// default - might be too low for a large cluster
under load
spark.network.timeout = 120s
Spark has a lot of different networking timeouts. This is the
biggest knob to turn: increasing this increases many settings at
once.
(This setting does not increase the spark.rpc.timeout used by
shuffle and authentication services.)
52
Executor heartbeat timeouts
spark.executor.heartbeatInterval = 10s // default
spark.executor.heartbeatInterval should be significantly
less than spark.network.timeout.
Executors missing heartbeats usually signify a memory issue, not
a network problem.
• Increase the number of partitions in the dataset
• Remediate skew causing some partition(s) to be much larger
than the others
53
Be resilient to failure
spark.stage.maxConsecutiveAttempts = 10 // default: 4
// default: 4 (would go higher for cloud storage misbehavior)
spark.task.maxFailures = 12
spark.max.fetch.failures.per.stage = 10 // default: 4 (helps shuffle)
Increasing the number of failures your application can accept at the task and stage level.
Use blacklisting and speculation to your advantage. It's better to concede some extra resources to a
stage which eventually succeeds than to fail the entire job:
• Note that tasks killed through speculation - which might otherwise have failed - don't count against
you here.
• Blacklisting - which in the best case removes from a stage or job a host which can't participate
anyway - also helps proactively keep this count down. Just be sure to raise the number of failures
there too!
54
Koan
A Spark job that is broken
is only a special case of a
Spark job that is working.
Koan Mu calligraphy by Brigitte D'Ortschy
is licensed under CC BY 3.0
55
Interested?
• What we do: data engineering @ Coatue
‒ Terabyte scale, billions of rows
‒ Lambda architecture
‒ Functional programming
• Stack
‒ Scala (cats, shapeless, fs2, http4s)
‒ Spark / Hadoop / EMR / Databricks
‒ Data warehouses
‒ Python / R / Tableau
‒ Chat with me or email: rtoomey@coatue.com
‒ Twitter: @prasinous
56
Digestifs
Resources, links, configurations
Useful things for later
Desirable heap size for executors
spark.executor.memory = ???
JVM flag -XX:+UseCompressedOops allows you to use 4-byte pointers instead
of 8 (on by default in JDK 7+).
< 32 GB good for prompt GC, supports compressed OOPs.
32-48 GB "dead zone."
without compressed OOPs over 32 GB, you need almost 48GB to hold the
same number of objects.
49 - 64+ GB very large joins or special case with wide rows and G1GC.
58
How many concurrent tasks per executor?
spark.executor.cores = ???
Defaults to number of physical cores, but represents the maximum number of
concurrent tasks that can run on a single executor.
< 2 Too few cores. Doesn't make good use of parallelism.
2 - 4 recommended size for "most" spark apps.
5 HDFS client performance tops out.
> 8 Too many cores. Overhead from context switching outweighs benefit.
59
Memory
• Spark docs: Garbage Collection Tuning
• Distribution of Executors, Cores and Memory for a Spark Application
running in Yarn (spoddutur.github.io/spark-notes)
• How-to: Tune Your Apache Spark Jobs (Part 2) - (Sandy Ryza)
• Why Your Spark Applications Are Slow or Failing, Part 1: Memory
Management (Rishitesh Mishra)
• Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities
(Fabian Lange)
• Everything by Aleksey Shipilëv at https://shipilev.net/, @shipilev, or
anywhere else
60
GC debug logging
Restart your cluster with these options in
spark.executor.extraJavaOptions and
spark.driver.extraJavaOptions
-verbose:gc -XX:+PrintGC -XX:+PrintGCDateStamps 
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails 
-XX:+PrintGCCause -XX:+PrintTenuringDistribution 
-XX:+PrintFlagsFinal
61
Parallel GC: throughput friendly
-XX:+UseParallelGC -XX:ParallelGCThreads=NUM_THREADS
• The heap size set using spark.driver.memory and
spark.executor.memory
• Defaults to one third Young Generations and two thirds Old
Generation
• Number of threads does not scale 1:1 with number of cores
– Start with 8
– After 8 cores, use 5/8 remaining cores
– After 32 cores, use 5/16 remaining cores
62
Parallel GC: sizing Young Generation
• Eden is 3/4 of young generation
• Each of the two survivor spaces is 1/8 of young generation
By default, -XX:NewRatio=2, meaning that Old Generation occupies 2/3
of the heap
• Increase NewRatio to give Old Generation more space (3 for
3/4 of the heap)
• Decrease NewRatio to give Young Generation more space (1
for 1/2 of the heap)
63
Parallel GC: sizing Old Generation
By default, spark.memory.fraction allows cached internal data
to occupy 0.6 * (heap size - 300M). Old Generation needs
to be bigger than spark.memory.fraction.
• Decrease spark.memory.storageFraction (default 0.5) to free
up more space for execution
• Increase Old Generation space to combat spilling to disk,
cache eviction
64
G1 GC: latency friendly
-XX:+UseG1GC -XX:ParallelGCThreads=X 
-XX:ConcGCThreads=(2*X)
Parallel GC threads are the "stop the world" worker threads. Defaults to the same
calculation as parallel GC; some articles recommend 8 + max(0, cores - 8) * 0.625.
Concurrent GC threads mark in parallel with the running application. The default of a
quarter as many threads as used for parallel GC may be conservative for a large Spark
application. Several articles recommended scaling this number of thread up in conjunction
with a lower initiating heap occupancy.
Garbage First Garbage Collector Tuning (Monica Beckwith)
65
G1 GC logging
Same as shown for parallel GC, but also
-XX:+UnlockDiagnosticVMOptions 
-XX:+PrintAdaptiveSizePolicy 
-XX:+G1SummarizeConcMark
G1 offers a range of GC logging information on top of the
standard parallel GC logging options.
Collecting and reading G1 garbage collector logs - part 2 (Matt
Robson)
66
G1 Initiating heap occupancy
-XX:InitiatingHeapOccupancyPercent=35
By default, G1 GC will initiate garbage collection when the heap is 45 percent full. This can lead to
a situation where full GC is necessary before the less costly concurrent phase has run or
completed.
By triggering concurrent GC sooner and scaling up the number of threads available to perform the
concurrent work, the more aggressive concurrent phase can forestall full collections.
Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
(Karunanithi Shanmugam)
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing (Eric Kaczmarek and
Liqi Yi, Intel)
67
G1 Region size
-XX:G1HeapRegionSize=16
The heap defaults to region size between 1 and 32 MiB. For example, a heap with <= 32 GiB has a region size
of 8 MiB; one with <= 16 GiB has 4 MiB.
If you see Humongous Allocation in your GC logs, indicating an object which occupies > 50% of your current
region size, then consider increasing G1HeapRegionSize. Changing this setting is not recommended for most
cases because
• Increasing region size reduces the number of available regions, plus
• The additional cost of copying/cleaning up the larger regions may reduce throughput or increase latency
Most commonly caused by a dataset with very wide rows. If you can't improve G1 performance, switch back to
parallel GC.
Plumbr.io handbook: GC Tuning: In Practice: Other Examples: Humongous Allocations
68
G1 string deduplication
-XX:+UseStringDeduplication 
-XX:+PrintStringDeduplicationStatistics
May decrease your memory usage if you have a significant
number of duplicate String instances in memory.
JEP 192: String Deduplication in G1
69
Shuffle
• Scaling Apache Spark at Facebook (Ankit Agarwal and Sameer Agarwal)
• Spark Shuffle Deep Dive (Bo Yang)
These older presentations sometimes pertain to previous versions of Spark
but still have substantial value.
• Optimal Strategies for Large Scale Batch ETL Jobs (Emma Tang) - 2017
• Apache Spark @Scale: A 60 TB+ production use case from Facebook
(Sital Kedia, Shuojie Wang and Avery Ching) - 2016
• Apache Spark the fastest open source engine for sorting a petabyte
(Reynold Xin) - 2014
70
S3
• Best Practices Design Patterns: Optimizing Amazon S3
Performance (Mai-Lan Tomsen Bukovec, Andy Warfield, and
Tim Harris)
• Seven Tips for Using S3DistCp on Amazon EMR to Move
Data Efficiently Between HDFS and Amazon S3 (Illya
Yalovyy)
• Cost optimization through performance improvement of
S3DistCp (Sarang Anajwala)
71
S3: EMR
Write your data to HDFS and then create a separate step using s3DistCp to
copy the files to S3.
This utility is problematic for large clusters and large datasets:
• Primitive error handling
– Deals with being rate limited by S3 by.... trying harder, choking, failing
– No way to increase the number of failures allowed
– No way to distinguish between being rate limited and getting fatal backend
errors
• If any s3DistCp step fails, EMR job fails even if a later s3DistCp step
succeeds
72
Using s3DistCp on a large cluster
-D mapreduce.job.reduces=(numExecutors / 2)
The default number of reducers is one per executor - documentation says the "right"
number is probably 0.95 or 1.75. All three choices are bad for s3DistCp, where the
reduce phase of the job writes to S3. Experiment to figure out how much to scale down
the number of reducers so the data is copied off in a timely manner without too much
rate limiting.
On large jobs, recommend running s3DistCp step as many times as necessary to
ensure all your data makes it off HDFS to S3 before the cluster shuts down.
Hadoop Map Reduce Tutorial: Map-Reduce User Interfaces
73
Databricks
fs.s3a.multipart.threshold = 2147483647 // default (in bytes)
fs.s3a.multipart.size = 104857600
fs.s3a.connection.maximum = min(clusterNodes, 500)
fs.s3a.connection.timeout = 60000 // default: 20000ms
fs.s3a.block.size = 134217728 // default: 32M - used for reading
fs.s3a.fast.upload = true // disable if writes are failing
// spark.stage.maxConsecutiveAttempts = 10 // default 4 -
increase if writes are failing
Databricks Runtimes uses their own S3 committer code which provides
reliable performance writing directly to S3.
74
Hadoop 3.2.0
// https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/committers.html
fs.s3a.committer.name = directory
fs.s3a.committer.staging.conflict-mode = replace // replace == overwrite
fs.s3a.attempts.maximum = 20 // How many times we should retry commands on transient
errors
fs.s3a.retry.throttle.limit = 20 // number of times to retry throttled request
fs.s3a.retry.throttle.interval = 1000ms
// Controls the maximum number of simultaneous connections to S3
fs.s3a.connection.maximum = ???
// Number of (part)uploads allowed to the queue before blocking additional uploads.
fs.s3a.max.total.tasks = ???
If you're lucky enough to have access to Hadoop 3.2.0, here's some highlights
pertinent to large clusters.
75
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
Cloudera, Inc.
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
Hortonworks
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best PracticesCloudera, Inc.
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
Manish Maheshwari
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
Command Prompt., Inc
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
DataWorks Summit
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Jignesh Shah
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
Amal G Jose
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
Introduction to hazelcast
Introduction to hazelcastIntroduction to hazelcast
Introduction to hazelcast
Emin Demirci
 
Ensuring performance for real time packet processing in open stack white paper
Ensuring performance for real time packet processing in open stack white paperEnsuring performance for real time packet processing in open stack white paper
Ensuring performance for real time packet processing in open stack white paper
hptoga
 
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
EDB
 
Apache kafka configuration-guide
Apache kafka configuration-guideApache kafka configuration-guide
Apache kafka configuration-guide
Chetan Khatri
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
Richard McDougall
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
Shashwat Shriparv
 
Postgres on OpenStack
Postgres on OpenStackPostgres on OpenStack
Postgres on OpenStack
EDB
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 

What's hot (20)

What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Introduction to hazelcast
Introduction to hazelcastIntroduction to hazelcast
Introduction to hazelcast
 
Ensuring performance for real time packet processing in open stack white paper
Ensuring performance for real time packet processing in open stack white paperEnsuring performance for real time packet processing in open stack white paper
Ensuring performance for real time packet processing in open stack white paper
 
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Five steps perform_2013
Five steps perform_2013Five steps perform_2013
Five steps perform_2013
 
Apache kafka configuration-guide
Apache kafka configuration-guideApache kafka configuration-guide
Apache kafka configuration-guide
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Postgres on OpenStack
Postgres on OpenStackPostgres on OpenStack
Postgres on OpenStack
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 

Similar to Apache Spark At Scale in the Cloud

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
Sahdev Zala
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
Databricks
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacenters
Sangjin Han
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Where Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the Seams
Concentric Sky
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
Bogdan Dina
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
Elasticsearch
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
Jean-François Gagné
 

Similar to Apache Spark At Scale in the Cloud (20)

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacenters
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Where Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the Seams
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 

Recently uploaded

Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 

Recently uploaded (20)

Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 

Apache Spark At Scale in the Cloud

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Rose Toomey, Coatue Management Spark At Scale In the Cloud #UnifiedDataAnalytics #SparkAISummit
  • 3. About me NYC. Finance. Technology. Code. • Each job I wrote code but found that the data challenges just kept growing – Lead API Developer at Gemini Trust – Director at Novus Partners • Now: coding and working with data full time – Software Engineer at Coatue Management
  • 4. How do you process this… Numbers are approximate. • Dataset is 35+ TiB raw • Input files are 80k+ unsplittable compressed row-based format with heavy skew, deeply nested directory structure • Processing results in 275+ billion rows cached to disk • Lots of data written back out to S3 – Including stages ending in sustained writes of tens of TiB 4
  • 5. On a very big Spark cluster… Sometimes you just need to bring the entire dataset into memory. The more nodes a Spark cluster has, the more important configuration tuning becomes. Even more so in the cloud, where you will regularly experience I/O variance and unreliable nodes.
  • 6. In the cloud? • Infrastructure management is hard – Scaling resources and bandwidth in a datacenter is not instant – Spark/Hadoop clusters are not islands – you’re managing an entire ecosystem of supporting players • Optimizing Spark jobs is hard Let’s limit the number of hard things we’re going to tackle at once.
  • 7. Things going wrong at scale Everything is relative. In smaller clusters, these configurations worked fine. • Everything is waiting on everything else because Netty doesn't have enough firepower to shuffle faster • Speculation meets skew and relaunches the very slowest parts of a join, leaving most of the cluster idle • An external service rate limits, which causes blacklisting to sideline most of a perfectly good cluster 7
  • 8. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Networking • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 9. Putting together a big Spark cluster • What kind of nodes should the cluster have? Big? Small? Medium? • What's your resource limitation for the number of executors? – Just memory (standalone) – Both memory and vCPUs (YARN) • Individual executors should have how much memory and how many virtual CPUs?Galactic Wreckage in Stephan's Quintet 9
  • 10. One Very Big Standalone Node One mega instance configured with many "just right" executors, each provisioned with • < 32 GiB heap (sweet spot for GC) • 5 cores (for good throughput) • Minimizes shuffle overhead • Like the pony, not offered by your cloud provider. Also, poor fault tolerance. 10
  • 11. Multiple Medium-sized Nodes When looking at medium sized nodes, we have a choice: • Just one executor • Multiple executors But a single executor might not be the best resource usage: • More cores on a single executor is not necessarily better • When using a cluster manager like YARN, more executors could be a more efficient use of CPU and memory 11
  • 12. Many Small Nodes 12 • 500+ small nodes • Each node over-provisioned relative to multiple executor per node configurations • Single executor per node • Most fault tolerant but big communications overhead “Desperate affairs require desperate measures.” Vice Admiral Horatio Nelson
  • 13. Why ever choose the worst solution? Single executor per small (or medium) node is the worst configuration for cost, provisioning, and resource usage. Why not recommend against it? • Resilient to node degradation and loss • Quick transition to production: relative over-provisioning of resources to each executor behaves more like a notebook • Awkward instance sizes may provision more quickly than larger instances 13
  • 14. Onward! Now you have your cluster composition in mind, you’ll need to scale up your base infrastructure to support the number of nodes: • Memory and garbage collection • Tune RPC for cluster communications • Where do you put very large datasets? • How do you get them off the cluster? • No task left behind: scheduling in difficult times 14
  • 15. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Networking • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 16. Spark memory management SPARK-1000: Consolidate storage and execution memory management • NewRatio controls Young/Old proportion • spark.memory.fraction sets storage and execution space to ~60% tenured space 16 Young Generation 1/3 Old Generation 2/3 300m reserved spark.memory.fraction ~60% 50% execution dynamic – will take more 50% storage spark.memory.storageFraction ~40% Spark metadata, user data structures, OOM safety
  • 17. 17
  • 18. Field guide to Spark GC tuning • Lots of minor GC - easy fix – Increase Eden space (high allocation rate) • Lots of major GC - need to diagnose the trigger – Triggered by promotion - increase Eden space – Triggered by Old Generation filling up - increase Old Generation space or decrease spark.memory.fraction • Full GC before stage completes – Trigger minor GC earlier and more often 18
  • 19. Full GC tailspin Balance sizing up against tuning code • Switch to bigger and/or more nodes • Look for slow running stages caused by avoidable shuffle, tune joins and aggregation operations • Checkpoint both to preserve work at strategic points but also to truncate DAG lineage • Cache to disk only • Trade CPU for memory by compressing data in memory using spark.rdd.compress 19
  • 20. Which garbage collector? Throughput or latency? • ParallelGC favors throughput • G1GC is low latency – Shiny new things like string deduplication – vulnerable to wide rows Whichever you choose, collect early and often. 20
  • 21. Where to cache big datasets • To disk. Which is slow. • But frees up as much tenured space as possible for execution, and storing things which must be in memory – internal metadata – user data structures – broadcasting the skew side of joins 21
  • 22. 22
  • 23. Perils of caching to disk 19/04/13 01:27:33 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_48_27005 ! When you lose an executor, you lose all the cached blocks stored by that executor even if the node is still running. • If lineage is gone, the entire job will fail • If lineage is present, RDD#getOrCompute tries to compensate for the missing blocks by re-ingesting the source data. While it keeps your job from failing, this could introduce enormous slowdowns if the source data is skewed, your ingestion process is complex, etc. 23
  • 24. Self healing block management // use this with replication >= 2 when caching to disk in non-distributed filesystem spark.storage.replication.proactive = true Pro-active block replenishment in case of node/executor failures https://issues.apache.org/jira/browse/SPARK-15355 https://github.com/apache/spark/pull/14412 24
  • 25. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Networking • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 26. Tune RPC for cluster communications Netty server processing RPC requests is the backbone of both authentication and shuffle services. Insufficient RPC resources cause slow speed mayhem: clients disassociate, operations time out. org.apache.spark.network.util. TransportConf is the shared config for both shuffle and authentication services. Ruth Teitelbum and Marlyn Meltzer reprogramming ENIAC, 1946 26
  • 27. Scaling RPC // used for auth spark.rpc.io.serverThreads = coresPerDriver * rpcThreadMultiplier // used for shuffle spark.shuffle.io.serverThreads = coresPerDriver * rpcThreadMultiplier Where "RPC thread multiplier" is a scaling factor to increase the service's thread pool. • 8 is aggressive, might cause issues • 4 is moderately aggressive • 2 is recommended (start here, benchmark, then increase) • 1 (number of vCPU cores) is default but is too small for a large cluster 27
  • 28. Shuffle The definitive presentation on shuffle tuning: Tuning Apache Spark for Large-Scale Workloads (Gaoxiang Liu and Sital Kedia) So this section focuses on • Some differences to configurations presented in Liu and Kedia's presentation, as well as • Configurations that weren't shown in this presentation 28
  • 29. Strategy for lots of shuffle clients 1. Scale the server way up // mentioned in Liu/Kedia presentation but now deprecated // spark.shuffle.service.index.cache.entries = 2048 // default: 100 MiB spark.shuffle.service.index.cache.size = 256m // length of accept queue. default: 64 spark.shuffle.io.backLog = 8192 // default (not increased by spark.network.timeout) spark.rpc.lookupTimeout = 120s 29
  • 30. Strategy for lots of shuffle clients 2. make clients more patient, more fault tolerant, fewer simultaneous requests in flight spark.reducer.maxReqsInFlight = 5 // default: Int.MaxValue spark.shuffle.io.maxRetries = 10 // default: 3 spark.shuffle.io.retryWait = 60s // default 5s 30
  • 31. Strategy for lots of shuffle clients spark.shuffle.io.numConnectionsPerPeer = 1 Scaling this up conservatively for multiple executor per node configurations can be helpful. Not recommended to change the default for single executor per node. 31
  • 32. Shuffle partitions spark.sql.shuffle.partitions = max(1, nodes - 1) * coresPerExecutor * parallelismPerCore where parallelism per core is some hyperthreading factor, let's say 2. It's not the best for large shuffles although it can be adjusted. Apache Spark Core—Deep Dive—Proper Optimization (Daniel Tomes) recommends setting this value to max(cluster executor cores, shuffle stage input / 200 MB). That translates to 5242 partitions per TB. Highly aggressive shuffle optimization is required for a large dataset on a cluster with a large number of executors. 32
  • 33. Kill Spill spark.shuffle.spill.numElementsForceSpillThreshold = 25000000 spark.sql.windowExec.buffer.spill.threshold = 25000000 spark.sql.sortMergeJoinExec.buffer.spill.threshold = 25000000 • Spill is the number one cause of poor performance on very large Spark clusters. These settings control when Spark spills data from memory to disk – the defaults are a bad choice! • Set these to a big Integer value – start with 25000000 and increase if you can. More is more. • SPARK-21595: Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray
  • 34. Scaling AWS S3 Writes Hadoop AWS S3 support in 3.2.0 is amazing • Especially the new S3A committers https://hadoop.apache.org/docs/r3.2.0/hado op-aws/tools/hadoop-aws/index.html EMR: write to HDFS and copy off using s3DistCp (limit reducers if necessary) Databricks: writing directly to S3 just works FirstNASAISINGLASSrocketlaunch 34
  • 35. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Services • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 36. Task Scheduling Spark's powerful task scheduling settings can interact in unexpected ways at scale. • Dynamic resource allocation • External shuffle • Speculative Execution • Blacklisting • Task reaper Apollo 13 Mailbox at Mission Control 36
  • 37. Dynamic resource allocation Dynamic resource allocation benefits a multi-tenant cluster where multiple applications can share resources. If you have an ETL pipeline running on a large transient Spark cluster, dynamic allocation is not useful to your single application. Note that even in the first case, when your application no longer needs some executors, those cluster nodes don't get spun down: • Dynamic allocation requires an external shuffle service • The node stays live and shuffle blocks continue to be served from it 37
  • 38. External shuffle service spark.shuffle.service.enabled = true spark.shuffle.registration.timeout = 60000 // default: 5ms spark.shuffle.registration.maxAttempts = 5 // default: 3 Even without dynamic allocation, an external shuffle service may be a good idea. • If you lose executors through dynamic allocation, the external shuffle process still serves up those blocks. • The external shuffle service could be more responsive than the executor itself However, the registration values are insufficient for a large busy cluster: SPARK-20640 Make rpc timeout and retry for shuffle registration configurable 38
  • 39. Speculative execution When speculative execution works as intended, tasks running slowly due to transient node issues don't bog down that stage indefinitely. • Spark calculates the median execution time of all tasks in the stage • spark.speculation.quantile - don't start speculating until this percentage of tasks are complete (default 0.75) • spark.speculation.multiplier - expressed as a multiple of the median execution time, this is how slow a task must be to be considered for speculation • Whichever task is still running when the first finishes gets killed 39
  • 40. One size does not fit all spark.speculation = true spark.speculation.quantile = 0.8 //default: 0.75 spark.speculation.multiplier = 4 // default: 1.5 These were our standard speculative execution settings. They worked "fine" in most of our pipelines. But they worked fine because the median size of the tasks at 80% was OK. What happens when reasonable settings meet unreasonable data? 40
  • 41. 21.2 TB shuffle, 20% of tasks killed 41
  • 42. Speculation: unintended consequences The median task length is based on the fast 80% - but due to heavy skew, this estimate is bad! Causing the scheduler to take the worst part of the job and … launches more copies of the worst longest running tasks ... one of which then gets killed. spark.speculation = true // start later (might get a better estimate) spark.speculation.quantile = 0.90 // default 1.5 - require a task to be really bad spark.speculation.multiplier = 6 The solution was two-fold: • Start speculative execution later (increase the quantile) and require a greater slowness multiplier • Do something about the skew 42
  • 43. Benefits of speculative execution • Speculation can be very helpful when the application is interacting with an external service. Example: writing to S3 • When speculation kills a task that was going to fail anyway, it doesn't count against the failed tasks for that stage/executor/node/job • Clusters are not tuned in a day! Speculation can help pave over slowdowns caused by scaling issues • Useful canary: when you see tasks being intentionally killed in any quantity, it's worth investigating why 43
  • 44. Blacklisting spark.blacklist.enabled = true spark.blacklist.task.maxTaskAttemptsPerExecutor = 1 // task blacklisted from executor spark.blacklist.stage.maxFailedTasksPerExecutor = 2 // executor blacklisted from stage // how many different tasks must fail in successful tasks sets before executor // blacklisted from application spark.blacklist.application.maxFailedTasksPerExecutor = 2 spark.blacklist.timeout = 1h // executor removed from blacklist, takes new tasks Blacklisting prevents Spark from scheduling tasks on executors/nodes which have failed too many times in the current stage. The default number of failures are too conservative when using flaky external services. Let's see how quickly it can add up... 44
  • 45. 45
  • 46. Blacklisting gone wrong • While writing three very large datasets to S3, something went wrong about 17 TiB in • 8600+ errors trying write to S3 in the space of eight minutes, distributed across 1000 nodes – Some executors backoff and retry, succeed – Speculative execution kicks in, padding the blow – But all the nodes quickly accumulate at least two failed tasks, many have more and get blacklisted • Eventually translating to four failed tasks, killing the job 46
  • 47. 47
  • 48. Don't blacklist too soon • We enabled blacklisting but didn't adjust the defaults because - we never "needed" to before • Post mortem showed cluster blocks were too large for our s3a settings spark.blacklist.enabled = true spark.blacklist.stage.maxFailedTasksPerExecutor = 8 // default: 2 spark.blacklist.application.maxFailedTasksPerExecutor = 24 // default: 2 spark.blacklist.timeout = 15m // default: 1h Solution was to • Make blacklisting a lot more tolerant of failure • Repartition data on write for better block size • Adjust s3a settings to raise multipart upload size 48
  • 49. Don't fear the reaper spark.task.reaper.enabled = true // default: -1 (prevents executor from self-destructing) spark.task.reaper.killTimeout = 180s The task reaper monitors to make sure tasks that get interrupted or killed actually shut down. On a large job, give a little extra time before killing the JVM • If you've increased timeouts, the task may need more time to shut down cleanly • If the task reaper kills the JVM abruptly, you could lose cached blocks SPARK-18761 Uncancellable / unkillable tasks may starve jobs of resources 49
  • 50. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Services • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 51. Increase tolerance • If you find a timeout or number of retries, raise it • If you find a buffer, backlog, queue, or threshold, increase it • If you have a MR task with a number of reducers trying to use a service concurrently in a large cluster – Either limit the number of active tasks per reducer, or – Limit the number of reducers active at the same time 51
  • 52. Be more patient // default - might be too low for a large cluster under load spark.network.timeout = 120s Spark has a lot of different networking timeouts. This is the biggest knob to turn: increasing this increases many settings at once. (This setting does not increase the spark.rpc.timeout used by shuffle and authentication services.) 52
  • 53. Executor heartbeat timeouts spark.executor.heartbeatInterval = 10s // default spark.executor.heartbeatInterval should be significantly less than spark.network.timeout. Executors missing heartbeats usually signify a memory issue, not a network problem. • Increase the number of partitions in the dataset • Remediate skew causing some partition(s) to be much larger than the others 53
  • 54. Be resilient to failure spark.stage.maxConsecutiveAttempts = 10 // default: 4 // default: 4 (would go higher for cloud storage misbehavior) spark.task.maxFailures = 12 spark.max.fetch.failures.per.stage = 10 // default: 4 (helps shuffle) Increasing the number of failures your application can accept at the task and stage level. Use blacklisting and speculation to your advantage. It's better to concede some extra resources to a stage which eventually succeeds than to fail the entire job: • Note that tasks killed through speculation - which might otherwise have failed - don't count against you here. • Blacklisting - which in the best case removes from a stage or job a host which can't participate anyway - also helps proactively keep this count down. Just be sure to raise the number of failures there too! 54
  • 55. Koan A Spark job that is broken is only a special case of a Spark job that is working. Koan Mu calligraphy by Brigitte D'Ortschy is licensed under CC BY 3.0 55
  • 56. Interested? • What we do: data engineering @ Coatue ‒ Terabyte scale, billions of rows ‒ Lambda architecture ‒ Functional programming • Stack ‒ Scala (cats, shapeless, fs2, http4s) ‒ Spark / Hadoop / EMR / Databricks ‒ Data warehouses ‒ Python / R / Tableau ‒ Chat with me or email: rtoomey@coatue.com ‒ Twitter: @prasinous 56
  • 58. Desirable heap size for executors spark.executor.memory = ??? JVM flag -XX:+UseCompressedOops allows you to use 4-byte pointers instead of 8 (on by default in JDK 7+). < 32 GB good for prompt GC, supports compressed OOPs. 32-48 GB "dead zone." without compressed OOPs over 32 GB, you need almost 48GB to hold the same number of objects. 49 - 64+ GB very large joins or special case with wide rows and G1GC. 58
  • 59. How many concurrent tasks per executor? spark.executor.cores = ??? Defaults to number of physical cores, but represents the maximum number of concurrent tasks that can run on a single executor. < 2 Too few cores. Doesn't make good use of parallelism. 2 - 4 recommended size for "most" spark apps. 5 HDFS client performance tops out. > 8 Too many cores. Overhead from context switching outweighs benefit. 59
  • 60. Memory • Spark docs: Garbage Collection Tuning • Distribution of Executors, Cores and Memory for a Spark Application running in Yarn (spoddutur.github.io/spark-notes) • How-to: Tune Your Apache Spark Jobs (Part 2) - (Sandy Ryza) • Why Your Spark Applications Are Slow or Failing, Part 1: Memory Management (Rishitesh Mishra) • Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities (Fabian Lange) • Everything by Aleksey Shipilëv at https://shipilev.net/, @shipilev, or anywhere else 60
  • 61. GC debug logging Restart your cluster with these options in spark.executor.extraJavaOptions and spark.driver.extraJavaOptions -verbose:gc -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintFlagsFinal 61
  • 62. Parallel GC: throughput friendly -XX:+UseParallelGC -XX:ParallelGCThreads=NUM_THREADS • The heap size set using spark.driver.memory and spark.executor.memory • Defaults to one third Young Generations and two thirds Old Generation • Number of threads does not scale 1:1 with number of cores – Start with 8 – After 8 cores, use 5/8 remaining cores – After 32 cores, use 5/16 remaining cores 62
  • 63. Parallel GC: sizing Young Generation • Eden is 3/4 of young generation • Each of the two survivor spaces is 1/8 of young generation By default, -XX:NewRatio=2, meaning that Old Generation occupies 2/3 of the heap • Increase NewRatio to give Old Generation more space (3 for 3/4 of the heap) • Decrease NewRatio to give Young Generation more space (1 for 1/2 of the heap) 63
  • 64. Parallel GC: sizing Old Generation By default, spark.memory.fraction allows cached internal data to occupy 0.6 * (heap size - 300M). Old Generation needs to be bigger than spark.memory.fraction. • Decrease spark.memory.storageFraction (default 0.5) to free up more space for execution • Increase Old Generation space to combat spilling to disk, cache eviction 64
  • 65. G1 GC: latency friendly -XX:+UseG1GC -XX:ParallelGCThreads=X -XX:ConcGCThreads=(2*X) Parallel GC threads are the "stop the world" worker threads. Defaults to the same calculation as parallel GC; some articles recommend 8 + max(0, cores - 8) * 0.625. Concurrent GC threads mark in parallel with the running application. The default of a quarter as many threads as used for parallel GC may be conservative for a large Spark application. Several articles recommended scaling this number of thread up in conjunction with a lower initiating heap occupancy. Garbage First Garbage Collector Tuning (Monica Beckwith) 65
  • 66. G1 GC logging Same as shown for parallel GC, but also -XX:+UnlockDiagnosticVMOptions -XX:+PrintAdaptiveSizePolicy -XX:+G1SummarizeConcMark G1 offers a range of GC logging information on top of the standard parallel GC logging options. Collecting and reading G1 garbage collector logs - part 2 (Matt Robson) 66
  • 67. G1 Initiating heap occupancy -XX:InitiatingHeapOccupancyPercent=35 By default, G1 GC will initiate garbage collection when the heap is 45 percent full. This can lead to a situation where full GC is necessary before the less costly concurrent phase has run or completed. By triggering concurrent GC sooner and scaling up the number of threads available to perform the concurrent work, the more aggressive concurrent phase can forestall full collections. Best practices for successfully managing memory for Apache Spark applications on Amazon EMR (Karunanithi Shanmugam) Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing (Eric Kaczmarek and Liqi Yi, Intel) 67
  • 68. G1 Region size -XX:G1HeapRegionSize=16 The heap defaults to region size between 1 and 32 MiB. For example, a heap with <= 32 GiB has a region size of 8 MiB; one with <= 16 GiB has 4 MiB. If you see Humongous Allocation in your GC logs, indicating an object which occupies > 50% of your current region size, then consider increasing G1HeapRegionSize. Changing this setting is not recommended for most cases because • Increasing region size reduces the number of available regions, plus • The additional cost of copying/cleaning up the larger regions may reduce throughput or increase latency Most commonly caused by a dataset with very wide rows. If you can't improve G1 performance, switch back to parallel GC. Plumbr.io handbook: GC Tuning: In Practice: Other Examples: Humongous Allocations 68
  • 69. G1 string deduplication -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics May decrease your memory usage if you have a significant number of duplicate String instances in memory. JEP 192: String Deduplication in G1 69
  • 70. Shuffle • Scaling Apache Spark at Facebook (Ankit Agarwal and Sameer Agarwal) • Spark Shuffle Deep Dive (Bo Yang) These older presentations sometimes pertain to previous versions of Spark but still have substantial value. • Optimal Strategies for Large Scale Batch ETL Jobs (Emma Tang) - 2017 • Apache Spark @Scale: A 60 TB+ production use case from Facebook (Sital Kedia, Shuojie Wang and Avery Ching) - 2016 • Apache Spark the fastest open source engine for sorting a petabyte (Reynold Xin) - 2014 70
  • 71. S3 • Best Practices Design Patterns: Optimizing Amazon S3 Performance (Mai-Lan Tomsen Bukovec, Andy Warfield, and Tim Harris) • Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3 (Illya Yalovyy) • Cost optimization through performance improvement of S3DistCp (Sarang Anajwala) 71
  • 72. S3: EMR Write your data to HDFS and then create a separate step using s3DistCp to copy the files to S3. This utility is problematic for large clusters and large datasets: • Primitive error handling – Deals with being rate limited by S3 by.... trying harder, choking, failing – No way to increase the number of failures allowed – No way to distinguish between being rate limited and getting fatal backend errors • If any s3DistCp step fails, EMR job fails even if a later s3DistCp step succeeds 72
  • 73. Using s3DistCp on a large cluster -D mapreduce.job.reduces=(numExecutors / 2) The default number of reducers is one per executor - documentation says the "right" number is probably 0.95 or 1.75. All three choices are bad for s3DistCp, where the reduce phase of the job writes to S3. Experiment to figure out how much to scale down the number of reducers so the data is copied off in a timely manner without too much rate limiting. On large jobs, recommend running s3DistCp step as many times as necessary to ensure all your data makes it off HDFS to S3 before the cluster shuts down. Hadoop Map Reduce Tutorial: Map-Reduce User Interfaces 73
  • 74. Databricks fs.s3a.multipart.threshold = 2147483647 // default (in bytes) fs.s3a.multipart.size = 104857600 fs.s3a.connection.maximum = min(clusterNodes, 500) fs.s3a.connection.timeout = 60000 // default: 20000ms fs.s3a.block.size = 134217728 // default: 32M - used for reading fs.s3a.fast.upload = true // disable if writes are failing // spark.stage.maxConsecutiveAttempts = 10 // default 4 - increase if writes are failing Databricks Runtimes uses their own S3 committer code which provides reliable performance writing directly to S3. 74
  • 75. Hadoop 3.2.0 // https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/committers.html fs.s3a.committer.name = directory fs.s3a.committer.staging.conflict-mode = replace // replace == overwrite fs.s3a.attempts.maximum = 20 // How many times we should retry commands on transient errors fs.s3a.retry.throttle.limit = 20 // number of times to retry throttled request fs.s3a.retry.throttle.interval = 1000ms // Controls the maximum number of simultaneous connections to S3 fs.s3a.connection.maximum = ??? // Number of (part)uploads allowed to the queue before blocking additional uploads. fs.s3a.max.total.tasks = ??? If you're lucky enough to have access to Hadoop 3.2.0, here's some highlights pertinent to large clusters. 75
  • 76. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT