SlideShare a Scribd company logo
Emma Tang, Neustar
Optimal Strategies for
Large-Scale Batch ETL
Jobs
#EUDev3 October, 2017
2#EUdev3
https://www.neustar.biz/marketing
Neustar
• Help the world’s most valuable brands
understand and target their consumers both
online and offline
• Maximize ROI on Ad spend
• Billions of user events per day, petabytes of data
3#EUdev3
Architecture (simplified view)
4#EUdev3
Batch ETL
• Runs on schedule/ programmatically triggered
• Aim for complete utilization of cluster resources,
esp. memory and CPU
5#EUdev3
Why Batch?
• We care about historical state
• We don’t have SLA other than 1-3x daily delivery
• Efficient, tuned optimal use of resources, cost
efficiency
6#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
7#EUdev3
The attribution problem
• At Neustar, we process large quantities of ad
events
• Familiar events like: impressions, clicks,
conversions
• Which impression/click contributed to
conversion?
8#EUdev3
Example attribution
• Alice goes to her favorite news site, and sees 3
ads – impressions
• She clicks on one of them that leads to Macy’s –
click
• She buys something on Macy’s – conversion
• Her purchase can be attributed to the click and
impression events
9#EUdev3
The approach
• Join conversions with impressions and clicks on
userId
• Go through each user and attribute conversions
to correct target event (impressions/clicks)
• Latest target events are valued more, so
timestamp matters
10#EUdev3
The scale
• Impression: 250 billion
• Clicks: 20 billion
• Conversions: 50 billion
• Join 50 billion x 250 billion
11#EUdev3
impressions
clicks
conversions
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
12#EUdev3
Driver OOM
Exception in thread "map-output-dispatcher-12" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:615)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617)
13#EUdev3
Driver OOM
• Array of mapStatuses of size m, each status
contains info about how the block is used by
each reducer (n).
• m (mappers) x n (reducers)
14#EUdev3
Status1
reducer1
reducer2
Status2
reducer1
reducer2
Driver OOM
• 2 types of mapStatus: highly compressed vs
compressed
• HighlyCompressedMapStatus tracks reduce
partition average size, with bitmap tracking
which blocks are empty for each reducer
15#EUdev3
Driver OOM
• Reduce number of partitions on either side
• 300k x 75k  100k x 75k
16#EUdev3
Disable unnecessary GC
• spark.cleaner.periodicGC.interval
• GC cycles “stop the world”.
• Large heaps means longer GC
• Set to a long period (e.g. twice the length of your
job)
17#EUdev3
Disable unnecessary GC
• ContextCleaner uses weak references to keep
track of every RDD, ShuffleDependency, and
Broadcast, and registers when the objects go
out of scope
• periodicGCService is a single-thread executor
service that calls the JVM garbage collector
periodically
18#EUdev3
Allow extra time
• spark.rpc.askTimeout
• spark.network.timeout
• in case of GC, our heap size is so large, we will
exceed the timeout.
19#EUdev3
Spurious failures
• Reading from s3 can be flaky, especially when
reading millions of files
• Set spark.task.maxFailures higher than default
of 3
• We set to < 10 to ensure true errors propagate
out quickly
20#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
21#EUdev3
The skew
• Extreme skew in data
• A few users have 100k+ events for 90 days. The
average user has < 50
• Executors dying due to handful of extremely
large partitions
22#EUdev3
The skew
• Out of 20.5B users, 20.2B have < 50 events
23#EUdev3
0
5E+09
1E+10
1.5E+10
2E+10
2.5E+10
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
110000
115000
120000
125000
130000
135000
140000
145000
150000
155000
161000
167000
172000
177000
183000
191000
197000
204000
211000
223000
248000
271000
305000
328000
393000
504000
#ofusers
# of events
count of users with number of events bucketed by 1000s
The skew: zoom
24#EUdev3
0
5
10
15
20
25
30
35
75000
79000
83000
87000
91000
95000
99000
103000
107000
111000
115000
119000
123000
127000
131000
135000
139000
143000
147000
151000
155000
160000
165000
169000
173000
177000
182000
189000
193000
198000
204000
210000
218000
231000
256000
271000
303000
314000
360000
395000
504000
#ofusers
# of events
# of users with # of events bucketed by 100s (> 75k)
Strategy: increase # of partitions
• First line of defense - increase number of
partitions so skewed data is more spread out
25#EUdev3
Strategy: Nest
• Group conversions by userId, group target event
by userId, then join the lists
• Avoid cartesian joins
26#EUdev3
Long tail: ganglia
27#EUdev3
Long tail: Spark UI
• 50 min long tail, median 24 s
28#EUdev3
Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
29#EUdev3
Bloom Filter
• Space-efficient probabilistic data structure to test
whether an element is a member of a set
• Size mainly determined by number of items in
the filter, and the probability of false positives
• No false negatives!
• Broadcast filter out to executors
30#EUdev3
Bloom Filter
• Using a high false positive rate, still very good
filter
• P = 5% -> 80% filtered out
• Subsequent join much faster
31#EUdev3
Bloom Filter
• Tradeoff between accuracy & size
• We’ve had great success with Bloom Filters with
size of < 5G
• Experiment with Bloom Filters
32#EUdev3
Bloom Filter Applied
• For conversions of 50 billion, false positive rate
of 0.1%, filter size is 80GB
• False positive rate of 5%, filter size is 35GB
• Still too big
33#EUdev3
Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
34#EUdev3
Long tail: ganglia
35#EUdev3
Long tail: what is it doing?
• Look at executor threads during long tail
com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382)
com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65)
com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630)
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:209
)
org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239)
org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(Wri
tablePartitionedPairCollection.scala:56)
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scal
a:699)
36#EUdev3
Long tail: what is it doing?
• Mappers writing to shuffle space taking long
• Need to reduce data size before going into
shuffle
37#EUdev3
Long tail: what is it doing?
• Events in the long tail had almost identical
information, spread over time.
• For each user, if we retain just 1 event per hour,
at 90 days, it is around 2k events.
• However, this means we need to group by user
first, which requires a shuffle, which defeats the
whole purpose of this exercise, right?
38#EUdev3
Strategy: Filter during map side combine
• Use combineByKey and maximize map side
combine
• Thin collection out during map side combine ->
less is written to shuffle space
39#EUdev3
40#EUdev3
Still slow…
• What else can I do?
41#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
42#EUdev3
Avoid shuffles
• Reuse the same partitioner instance
43#EUdev3
The DAG
44#EUdev3
Avoid shuffles
• Denormalize data or union data to minimize
shuffle
• Rely on the fact we will reduce into a highly
compressed key space.
• For example, we want count of events by
campaign, also count of events by site
45#EUdev3
Avoid shuffle
46#EUdev3
Coalesce partitions when loading
• Loading many small files – coalesce down # of
partitions
• No shuffle
• Reduce task overhead, greatly improve speed
• Going from 300k partitions to 60k, cut time by
half
47#EUdev3
Coalesce partitions when loading
final JavaRDD<Event> eventRDD= loadDataFromS3(); // load data
final int loadingPartitions = eventRDD.getNumPartitions(); // inspect
how many partitions
final int coalescePartitions = loadingPartitions / 5; // use algorithm
to calculate new #
eventRDD
.coalesce(coalescePartitions) // coalesce to smaller #
.map(e -> transform(e)) // faster subsequent operations
48#EUdev3
Materialize data
• Large chunk of data persisted in memory
• Large RDD used to calculate small RDD
• Use an Action to materialize the smaller
calculated result so larger data can be
unpersisted
49#EUdev3
Materialize data
parent.cache() // persist large parent PairRDD to memory
child1 = parent.reduceByKey(a).cache() // calculate child1 from parent
child2 = parent.reduceByKey(b).cache() // calculate child2 from parent
child1.count()// perform an Action
child2.count()// perform an Action
parent.unpersist() // safe to mark parent as unpersisted
// rest of the code can use memory
50#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
51#EUdev3
Ganglia
• Ganglia is an extremely useful tool to
understand performance bottlenecks, and to
tune for highest cluster utilization
52#EUdev3
Ganglia: CPU wave
53#EUdev3
Ganglia: CPU wave
• Executors are going into GC multiple times in
the same stage
• Running out of execution memory
• Persist to StorageLevel.DISK_ONLY()
54#EUdev3
Ganglia: inefficient use
55#EUdev3
Ganglia: inefficient use
• Decrease # of partitions of RDDs used in this
stage
56#EUdev3
Ganglia: much better
57#EUdev3
Final Configuration
• Master 1 r3.4xl
• Executors 110 r3.4xl
• Configurations:
58#EUdev3
spark maximizeResourceAllocation TRUE
spark-defaults spark.executor.cores 16
spark-defaults spark.dynamicAllocation.enabled FALSE
spark-defaults spark.driver.maxResultSize 8g
spark-defaults spark.rpc.message.maxSize 2047
spark-defaults spark.rpc.askTimeout 300
spark-defaults spark.network.timeout 300s
spark-defaults spark.executor.heartbeatInterval 20s
spark-defaults spark.executor.memory 92540m
spark-defaults spark.yarn.executor.memoryOverhead 23300
spark-defaults spark.task.maxFailures 10
spark-defaults spark.executor.extraJavaOptions -XX:+UseG1GC
spark-defaults spark.cleaner.periodicGC.interval 600min
Summary
• Large jobs are special, use special settings
• Outsmart the skew
• Use Ganglia!
59#EUdev3
Thank you
60#EUdev3
Emma Tang
@emmayolotang
@Neustar

More Related Content

What's hot

Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
Pôle Systematic Paris-Region
 
Hadoop: the Big Answer to the Big Question of the Big Data
Hadoop: the Big Answer to the Big Question of the Big DataHadoop: the Big Answer to the Big Question of the Big Data
Hadoop: the Big Answer to the Big Question of the Big Data
Victor Haydin
 
Enabling a Secure Multi-Tenant Environment for HPC
Enabling a Secure Multi-Tenant Environment for HPCEnabling a Secure Multi-Tenant Environment for HPC
Enabling a Secure Multi-Tenant Environment for HPC
inside-BigData.com
 
Apache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayApache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayAndrei Savu
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Amazon Web Services
 
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStackExperiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
inside-BigData.com
 
50120140506014
5012014050601450120140506014
50120140506014
IAEME Publication
 
Hands on Data Grids - Stephen Milidge
Hands on Data Grids - Stephen MilidgeHands on Data Grids - Stephen Milidge
Hands on Data Grids - Stephen Milidge
JAXLondon2014
 
Strata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataStrata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big Data
Srinath Perera
 
Zookeeper big sonata
Zookeeper  big sonataZookeeper  big sonata
Zookeeper big sonataAnh Le
 
Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Action
juvenxu
 
Fredericksburg LUG Bitcoin slides
Fredericksburg LUG Bitcoin slidesFredericksburg LUG Bitcoin slides
Fredericksburg LUG Bitcoin slides
Alex Akselrod
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
adunne
 
Cloud Services - Gluecon 2010
Cloud Services - Gluecon 2010Cloud Services - Gluecon 2010
Cloud Services - Gluecon 2010
Oren Teich
 
How to operate containerized OpenStack
How to operate containerized OpenStackHow to operate containerized OpenStack
How to operate containerized OpenStack
Nalee Jang
 
Ricardo de Oliveria Schmidt - DDoS Attacks on the Root DNS
Ricardo de Oliveria Schmidt - DDoS Attacks on the Root DNSRicardo de Oliveria Schmidt - DDoS Attacks on the Root DNS
Ricardo de Oliveria Schmidt - DDoS Attacks on the Root DNS
Michiel Cazemier
 
CloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows AzureCloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows AzureAnkur Dave
 

What's hot (20)

Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
Hadoop: the Big Answer to the Big Question of the Big Data
Hadoop: the Big Answer to the Big Question of the Big DataHadoop: the Big Answer to the Big Question of the Big Data
Hadoop: the Big Answer to the Big Question of the Big Data
 
Enabling a Secure Multi-Tenant Environment for HPC
Enabling a Secure Multi-Tenant Environment for HPCEnabling a Secure Multi-Tenant Environment for HPC
Enabling a Secure Multi-Tenant Environment for HPC
 
Apache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayApache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesday
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
 
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStackExperiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
Experiences in Providing Secure Mult-Tenant Lustre Access to OpenStack
 
50120140506014
5012014050601450120140506014
50120140506014
 
Hands on Data Grids - Stephen Milidge
Hands on Data Grids - Stephen MilidgeHands on Data Grids - Stephen Milidge
Hands on Data Grids - Stephen Milidge
 
Strata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataStrata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big Data
 
Zookeeper big sonata
Zookeeper  big sonataZookeeper  big sonata
Zookeeper big sonata
 
Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Action
 
Fredericksburg LUG Bitcoin slides
Fredericksburg LUG Bitcoin slidesFredericksburg LUG Bitcoin slides
Fredericksburg LUG Bitcoin slides
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
 
Cloud Services - Gluecon 2010
Cloud Services - Gluecon 2010Cloud Services - Gluecon 2010
Cloud Services - Gluecon 2010
 
How to operate containerized OpenStack
How to operate containerized OpenStackHow to operate containerized OpenStack
How to operate containerized OpenStack
 
Final_Presentation_Docker_KP
Final_Presentation_Docker_KPFinal_Presentation_Docker_KP
Final_Presentation_Docker_KP
 
Ricardo de Oliveria Schmidt - DDoS Attacks on the Root DNS
Ricardo de Oliveria Schmidt - DDoS Attacks on the Root DNSRicardo de Oliveria Schmidt - DDoS Attacks on the Root DNS
Ricardo de Oliveria Schmidt - DDoS Attacks on the Root DNS
 
CloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows AzureCloudClustering: Toward a scalable machine learning toolkit for Windows Azure
CloudClustering: Toward a scalable machine learning toolkit for Windows Azure
 

Similar to Optimal Strategies for Large-Scale Batch ETL Jobs

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databases
ESUG
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
Samantha Quiñones
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Noriaki Tatsumi
 
Gemtalk Systems Product Roadmap
Gemtalk Systems Product RoadmapGemtalk Systems Product Roadmap
Gemtalk Systems Product Roadmap
ESUG
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slapDEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
Felipe Prado
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
Raffael Marty
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
Brendan Gregg
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security world
Andrey Karpov
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
Docker, Inc.
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
Paul Groth
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
swy351
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
Imperva Incapsula
 
Scalable IoT platform
Scalable IoT platformScalable IoT platform
Scalable IoT platform
Swapnil Bawaskar
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with Caching
Amazon Web Services
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
bloomreacheng
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015
kingsBSD
 

Similar to Optimal Strategies for Large-Scale Batch ETL Jobs (20)

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databases
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital One
 
Gemtalk Systems Product Roadmap
Gemtalk Systems Product RoadmapGemtalk Systems Product Roadmap
Gemtalk Systems Product Roadmap
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slapDEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security world
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
Scalable IoT platform
Scalable IoT platformScalable IoT platform
Scalable IoT platform
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with Caching
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015
 

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

Optimal Strategies for Large-Scale Batch ETL Jobs

  • 1. Emma Tang, Neustar Optimal Strategies for Large-Scale Batch ETL Jobs #EUDev3 October, 2017
  • 3. Neustar • Help the world’s most valuable brands understand and target their consumers both online and offline • Maximize ROI on Ad spend • Billions of user events per day, petabytes of data 3#EUdev3
  • 5. Batch ETL • Runs on schedule/ programmatically triggered • Aim for complete utilization of cluster resources, esp. memory and CPU 5#EUdev3
  • 6. Why Batch? • We care about historical state • We don’t have SLA other than 1-3x daily delivery • Efficient, tuned optimal use of resources, cost efficiency 6#EUdev3
  • 7. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 7#EUdev3
  • 8. The attribution problem • At Neustar, we process large quantities of ad events • Familiar events like: impressions, clicks, conversions • Which impression/click contributed to conversion? 8#EUdev3
  • 9. Example attribution • Alice goes to her favorite news site, and sees 3 ads – impressions • She clicks on one of them that leads to Macy’s – click • She buys something on Macy’s – conversion • Her purchase can be attributed to the click and impression events 9#EUdev3
  • 10. The approach • Join conversions with impressions and clicks on userId • Go through each user and attribute conversions to correct target event (impressions/clicks) • Latest target events are valued more, so timestamp matters 10#EUdev3
  • 11. The scale • Impression: 250 billion • Clicks: 20 billion • Conversions: 50 billion • Join 50 billion x 250 billion 11#EUdev3 impressions clicks conversions
  • 12. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 12#EUdev3
  • 13. Driver OOM Exception in thread "map-output-dispatcher-12" java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:615) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617) 13#EUdev3
  • 14. Driver OOM • Array of mapStatuses of size m, each status contains info about how the block is used by each reducer (n). • m (mappers) x n (reducers) 14#EUdev3 Status1 reducer1 reducer2 Status2 reducer1 reducer2
  • 15. Driver OOM • 2 types of mapStatus: highly compressed vs compressed • HighlyCompressedMapStatus tracks reduce partition average size, with bitmap tracking which blocks are empty for each reducer 15#EUdev3
  • 16. Driver OOM • Reduce number of partitions on either side • 300k x 75k  100k x 75k 16#EUdev3
  • 17. Disable unnecessary GC • spark.cleaner.periodicGC.interval • GC cycles “stop the world”. • Large heaps means longer GC • Set to a long period (e.g. twice the length of your job) 17#EUdev3
  • 18. Disable unnecessary GC • ContextCleaner uses weak references to keep track of every RDD, ShuffleDependency, and Broadcast, and registers when the objects go out of scope • periodicGCService is a single-thread executor service that calls the JVM garbage collector periodically 18#EUdev3
  • 19. Allow extra time • spark.rpc.askTimeout • spark.network.timeout • in case of GC, our heap size is so large, we will exceed the timeout. 19#EUdev3
  • 20. Spurious failures • Reading from s3 can be flaky, especially when reading millions of files • Set spark.task.maxFailures higher than default of 3 • We set to < 10 to ensure true errors propagate out quickly 20#EUdev3
  • 21. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 21#EUdev3
  • 22. The skew • Extreme skew in data • A few users have 100k+ events for 90 days. The average user has < 50 • Executors dying due to handful of extremely large partitions 22#EUdev3
  • 23. The skew • Out of 20.5B users, 20.2B have < 50 events 23#EUdev3 0 5E+09 1E+10 1.5E+10 2E+10 2.5E+10 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95000 100000 105000 110000 115000 120000 125000 130000 135000 140000 145000 150000 155000 161000 167000 172000 177000 183000 191000 197000 204000 211000 223000 248000 271000 305000 328000 393000 504000 #ofusers # of events count of users with number of events bucketed by 1000s
  • 25. Strategy: increase # of partitions • First line of defense - increase number of partitions so skewed data is more spread out 25#EUdev3
  • 26. Strategy: Nest • Group conversions by userId, group target event by userId, then join the lists • Avoid cartesian joins 26#EUdev3
  • 28. Long tail: Spark UI • 50 min long tail, median 24 s 28#EUdev3
  • 29. Long tail: what else to do? • If you have domain specific knowledge of your data, use it to filter “bad” data out • Salt your data, and shuffle twice (but shuffling is expensive) • Use bloom filter if one side of your join is much smaller than the other 29#EUdev3
  • 30. Bloom Filter • Space-efficient probabilistic data structure to test whether an element is a member of a set • Size mainly determined by number of items in the filter, and the probability of false positives • No false negatives! • Broadcast filter out to executors 30#EUdev3
  • 31. Bloom Filter • Using a high false positive rate, still very good filter • P = 5% -> 80% filtered out • Subsequent join much faster 31#EUdev3
  • 32. Bloom Filter • Tradeoff between accuracy & size • We’ve had great success with Bloom Filters with size of < 5G • Experiment with Bloom Filters 32#EUdev3
  • 33. Bloom Filter Applied • For conversions of 50 billion, false positive rate of 0.1%, filter size is 80GB • False positive rate of 5%, filter size is 35GB • Still too big 33#EUdev3
  • 34. Long tail: what else to do? • If you have domain specific knowledge of your data, use it to filter “bad” data out • Salt your data, and shuffle twice (but shuffling is expensive) • Use bloom filter if one side of your join is much smaller than the other 34#EUdev3
  • 36. Long tail: what is it doing? • Look at executor threads during long tail com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382) com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65) com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630) org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:209 ) org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134) org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239) org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(Wri tablePartitionedPairCollection.scala:56) org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scal a:699) 36#EUdev3
  • 37. Long tail: what is it doing? • Mappers writing to shuffle space taking long • Need to reduce data size before going into shuffle 37#EUdev3
  • 38. Long tail: what is it doing? • Events in the long tail had almost identical information, spread over time. • For each user, if we retain just 1 event per hour, at 90 days, it is around 2k events. • However, this means we need to group by user first, which requires a shuffle, which defeats the whole purpose of this exercise, right? 38#EUdev3
  • 39. Strategy: Filter during map side combine • Use combineByKey and maximize map side combine • Thin collection out during map side combine -> less is written to shuffle space 39#EUdev3
  • 41. Still slow… • What else can I do? 41#EUdev3
  • 42. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 42#EUdev3
  • 43. Avoid shuffles • Reuse the same partitioner instance 43#EUdev3
  • 45. Avoid shuffles • Denormalize data or union data to minimize shuffle • Rely on the fact we will reduce into a highly compressed key space. • For example, we want count of events by campaign, also count of events by site 45#EUdev3
  • 47. Coalesce partitions when loading • Loading many small files – coalesce down # of partitions • No shuffle • Reduce task overhead, greatly improve speed • Going from 300k partitions to 60k, cut time by half 47#EUdev3
  • 48. Coalesce partitions when loading final JavaRDD<Event> eventRDD= loadDataFromS3(); // load data final int loadingPartitions = eventRDD.getNumPartitions(); // inspect how many partitions final int coalescePartitions = loadingPartitions / 5; // use algorithm to calculate new # eventRDD .coalesce(coalescePartitions) // coalesce to smaller # .map(e -> transform(e)) // faster subsequent operations 48#EUdev3
  • 49. Materialize data • Large chunk of data persisted in memory • Large RDD used to calculate small RDD • Use an Action to materialize the smaller calculated result so larger data can be unpersisted 49#EUdev3
  • 50. Materialize data parent.cache() // persist large parent PairRDD to memory child1 = parent.reduceByKey(a).cache() // calculate child1 from parent child2 = parent.reduceByKey(b).cache() // calculate child2 from parent child1.count()// perform an Action child2.count()// perform an Action parent.unpersist() // safe to mark parent as unpersisted // rest of the code can use memory 50#EUdev3
  • 51. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 51#EUdev3
  • 52. Ganglia • Ganglia is an extremely useful tool to understand performance bottlenecks, and to tune for highest cluster utilization 52#EUdev3
  • 54. Ganglia: CPU wave • Executors are going into GC multiple times in the same stage • Running out of execution memory • Persist to StorageLevel.DISK_ONLY() 54#EUdev3
  • 56. Ganglia: inefficient use • Decrease # of partitions of RDDs used in this stage 56#EUdev3
  • 58. Final Configuration • Master 1 r3.4xl • Executors 110 r3.4xl • Configurations: 58#EUdev3 spark maximizeResourceAllocation TRUE spark-defaults spark.executor.cores 16 spark-defaults spark.dynamicAllocation.enabled FALSE spark-defaults spark.driver.maxResultSize 8g spark-defaults spark.rpc.message.maxSize 2047 spark-defaults spark.rpc.askTimeout 300 spark-defaults spark.network.timeout 300s spark-defaults spark.executor.heartbeatInterval 20s spark-defaults spark.executor.memory 92540m spark-defaults spark.yarn.executor.memoryOverhead 23300 spark-defaults spark.task.maxFailures 10 spark-defaults spark.executor.extraJavaOptions -XX:+UseG1GC spark-defaults spark.cleaner.periodicGC.interval 600min
  • 59. Summary • Large jobs are special, use special settings • Outsmart the skew • Use Ganglia! 59#EUdev3

Editor's Notes

  1. Today I’m going to share with you some tips and tricks to help you get started with processing large data in batch processes.
  2. First, let me tell you a little bit about Neustar. Neustar provides a range of cutting edge marketing solutions, which you can check out online at the above link.
  3. On our team, we focus on helping the world’s most valuable brands
  4. A quick word here on our stack, Our jobs run on AWS EMR, and read and output data to S3. we’ve built infrastructure to support our data pipeline, so that we have a fast, highly fault tolerant, cloud based system. All of our Spark jobs shown in the middle blue box are batch ETL jobs. Which is our focus today.
  5. Our focus today is on Batch ETL jobs. Which differentiates itself from other use cases of Spark such as ad hoc data science uses, or streaming in the following ways. Runs on schedule/ programmatically triggered. So it needs to be reliable and robust, humans will not be manually monitoring the job, or be able to tweak it during runs. Secondly, we Aim for complete utilization of cluster resources, esp. memory and CPU. In streaming, depending on the workload, you might not care as much about using all of your machine resources at every moment. But in batch, the goals is to squeeze every bit of juice out of our machines, so we can arrive at our result with minimal cost.
  6. We’re getting late arriving data all the time, hence historical state.
  7. We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon. By skew we mean when data is distributed in a way where outliers really affect performance.
  8. Let’s use a specific problem to get started. This example is relevant to our business, and we will learn about spark in the process. At Neustar, we process large quantities of ad events. Some examples include impressions, clicks, conversions. The problem we’re going to solve is, which impression or click events contributed to the conversion happening? A user usually has many events, so multiple conversions and multiple target events. We need to correctly attributed each of the conversions events for each user. Let’s use a concrete example:
  9. Let’s use a concrete example. How would we solve this in code?
  10. Attribute correctly based on timestamp, the target event has been occur before the conversion. We care about the latest target events. Also based on metadata to ensure they are for the same advertiser etc.
  11. For 90 days of events: we are loading around 100 T of data. For the maximum join, we are joining 50 billion conversions with 250 billion impressions
  12. Let’s talk about issues you will encounter only at scale
  13. When working at this scale, you will see errors that other people won’t see. For example, driver OOM. that stacktrace is not actually that JVM is out of memory, your metrics will say you have plenty of heap left. It’s java's ability to allocate an array, max size of array is 2B. we have more than 2G of map status outputs. The byte array out stream cannot grow any more. You can see that it Overflowed the buffer in the ByteArrayOutputStream. What is contained in this output stream?
  14. We are serializing statuses, which is an array of MapStatuses. The size of the array is the number of map tasks. Each status contains information about how the map block is used by each reducer. We can already see, this is a m times n problem.
  15. Spark has 2 types of mapStatues: If the number of partitions exceeds 2000, the HighlyCompressedMapStatus is used. If the data also happens to be distributed in a way that would prevent the bitmap from being highly compressible, for example, when each map job output goes to a random set of reducers, with some reducers getting nothing, and some getting output, you can create a situation where your buffer will not be big enough for the compressed statuses! Easiest solution is to reduce partitions #
  16. Luckily the solution is also transparent.
  17. For Batch jobs we don’t have a long running application. Having long GC cycles is a waste of resources on our clusters, so we tweak our jobs and cluster settings so we can avoid large amounts of GC. We’re working with large data, which means longer GC when it happens. The setting to tweak here is the spark.cleaner.periodicGC.interval. The cleaning thread will block on cleanup tasks, for example when the driver performs a GC and cleans up all broadcasts. For larger jobs, we increase the periodic GC on the driver so that it is effectively disabled. For this job we set it to 600 minutes Java’s newer G1 GC completely changes the traditional approach. The heap is partitioned into a set of equal-sized heap regions, each a contiguous range of virtual memory (Figure 2). Certain region sets are assigned the same roles (Eden, survivor, old) as in the older collectors, but there is not a fixed size for them. minor GC occurs, G1 copies live objects from one or more regions of the heap to a single region on the heap Full GC occurs only when all regions hold live objects and no full-empty region can be found. greatly improves heap occupancy rate when full GC is triggered, but also makes the minor GC pause times more controllable
  18. G1GC – low latency high throughput.
  19. we need Spark to be a little bit more patient when dealing with larger jobs, so we ask that it waits longer for communication between machines in case one of the GC, because our heap size is so large, we will exceeds the timeout.
  20. Protection against spurious failures which can occur. Spark default maxFailures per task is 3, not enough in large jobs. On the other hand, we don’t want to set it too high, or else, true errors will get retried forever, preventing timely response, and is costly.
  21. We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon.
  22. At this scale, you’re more likely than not to encounter skew. We had extreme skew in our data.
  23. Overwhelming majority of our users have fewer than 1000 events. In fact, most have fewer than 50 events.
  24. Upon closer inspection, our data had very few extreme outliers, out of 20B, there were only 879 that were greater than 75k these very few bad partitions were causing our executors to die
  25. Executors weren’t as likely to die anymore by grouping in lists.
  26. We increased partitions and nested our data, and our executors weren’t dying anymore! But there were still inefficiencies caused by skew. Notice the lag in CPU use in ganglia. All of those resources were wasted because there were a few bad partitions.
  27. The max was taking 50 minutes vs the median of 24 seconds! We need to do something about this.
  28. If you have domain specific knowledge of your data, use it to filter “bad” data out as early as you can We couldn’t do this in the naïve way because we had no idea which users were going to be bad Salt your data, and join twice. In our empirical studies, this was never worth it since shuffle is so expensive. Use bloom filter if one side of your join is much smaller than the other Bloom filters quickly become large after certain size. For example, if we had xxx conversions, the bloom filter for a 0.1% false positive rate would have been xxx . To broad cast this out and filter would have used more resources than to directly join ----- Meeting Notes (9/25/17 14:38) ----- If
  29. The size mainly determined by number of items in the filter, and the probability of false positives. The more items you have the bigger the filter. The more accurate the filter is, the bigger the filter The filter can be broadcast out to executors, and used in tasks. uses a bittorrest algo broadcast into slices (couple hundred pieces). overlay a couple of gigabytes. bloom filter: hash functions, more hash functions you can have accuacy. bitset. hash(x) && hash2(x) && hash3(x)
  30. One thing to note is, Don’t be afraid of using a high false positive rate. If our goal is to shrink the joinable set down, using a 5% false positive rate results in 80% of the data filtered out. This makes subsequent joins much faster
  31. However, we’ve had great success in many other jobs where the join was even more lopsided. Please experiment and see if it could work for you.
  32. Now going back to our original problem. You can find bloom filter size calculators online to find out if Bloom filter is a good solution for you. Unfortunately, both side of our join were sizable, and the bloom filter solution was not the right solution for this job.
  33. None of these strategies worked directly for us, so we were still left with skew.
  34. Let’s look at the ugly long tail graph again.
  35. Let’s go back to our problem. We have a very long tail. It is always a good idea to understand what your threads are spending time on, and for long tail, we are especially interested. On our r3.4xls with 16 cores, we saw that most threads were stuck on writing. it was spending a lot of time writing to shuffle space, so let’s try to reduce the size to be written
  36. On our r3.4xls with 16 cores, we saw that most threads were stuck on writing. it was spending a lot of time writing to shuffle space, so let’s try to reduce the size to be written
  37. Let’s go back to the data. Inspecting the event data revealed to us that in the long tail, the events had almost identical information, spread over time. If we retain just 1 event per hour, at 90 days, it is at most 2160 events per user. However, this seems to mean that we need to group by userId first, which requires a shuffle, which defeats the whole purpose of this exercise, right? This is where map side combine comes in.
  38. Let’s go back to the data. Inspecting the event data revealed to us that in the long tail, the events had almost identical information, spread over time. If we retain just 1 event per hour, at 90 days, it is at most 2160 events per user. However, this seems to mean that we need to group by userId first, which requires a shuffle, which defeats the whole purpose of this exercise, right? This is where map side combine comes in. If we use CombineByKey, we can specify the operation to perform on the map side. Here we can thin out the collection.
  39. The basic structure is shown here. Notice in the second lambda, we rate limit/ thin out collection when list size reaches a certain max. Feel free to check out the slides, they will be available online.
  40. We got rid of the long tail, but my job is still generally slow, can I squeeze more performance out of it by tweaking my job? The answer is yes.
  41. We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon.
  42. Reusing the partitioner allows all the data to be partitioned the same way, and ready for joins/reduceByKeys. This way we can transform multiple wide transformations into a single narrow transformation
  43. Please don’t try to read this, what I want to point out here, is this long narrow stage. It contains 3 combineByKeys, and 2 left outer joins in the same narrow stage. If we had used different partitioners, each would induce a shuffle.
  44. But we can go further than that to reduce the amount of shuffles. We are willing to pay for it in memory.
  45. Let’s say we want get count of users by campaign, and we also want a count of users by site. The straightforward way to do this is to take the RDD, cache it, and reduceByKey on it twice. But if we duplicate the rows like this, we are able to reduceByKey once and then filter for the two desired results.
  46. Another strategy that greatly improves performance is coalescing # of partitions down.
  47. A simplistic example would be this. We load data from S3, calculate the original partition number, use an algorithm that makes sense for your data to find the new partition #, then coalesce your data down to new # of partitions. Subsequent operations will be more efficient
  48. Do a count, or a reduce.
  49. spark is lazy, that you have to materialize first. or else spark will load from source again.
  50. We’re going to try touch upon everything I promised in the talk description. And if we run out of time, the slides will be available online soon.
  51. Let’s go through two ganglia use cases. It’s available in AWS EMR.
  52. What happens when you see waves occurring within a single stage? Sometimes the waves to longer, and takes longer than 10 minutes from crest to trough.
  53. DISK ONLY caching, it does depend on the machine class. r4 it's not that great. on board SSD is good, EBS is not that great. Tune and test.
  54. Notice the last stage is only using 70% of CPU for the entire process.
  55. Decrease number partitions of RDD used in this stage. Less total overhead, fewer tasks means each task computes more data, and this helps with CPU utilization
  56. After tweaking the above 2 properties, we get much better cluster usage!
  57. We’ve gone though most of these configurations today, the ones I didn’t mention Spark does a good job notifying users to set when needed. Feel free to reach out to me with any questions about this after this session. spark.dynamicAllocation.enabled FALSE not multi tenented spark.driver.maxResultSize 8g set to really high, since our driver is large. We can handle lots of results going to the driver spark.rpc.message.maxSize 2047 if you have many map and reduce tasks, use max. used to send out mapStatues, which we know is m x n Overhead because JVM overhead. Protect against Executor OOM. what goes into the memory overhead? Intern pool, everything off the JVM. Thread stacks, NIO buffers. Shared native libraries. Yarn coordination process takes 896MB (you can find it on yarn application page)
  58. Why not dataframes? Complicated logic difficult to express in SQL. For example, finding the latest events that fit specific criteria after a join, or rate limiting data by hour boundaries. Partitions, partitioners, combine by key. Customization of code. Really large data requires specific partition settings, map side combine optimizations, bloom filters in each type of stage. More difficult to do with dataframes. dataframes are great, but RDD gives us flexibility to exploit patterns exist in the data. In R&O we were able to go from down mutiple reduces with data skew. to 1 single pass with RDDS. DF is great to general data analysis tool, RDDs fantastic for power users. 6x difference in size between DF vs RDDs.