GlobalLogic Webinar: Massive aggregations with Spark and Hadoop

1
Confidential
Massive aggregations with
Spark on Hadoop
Big Data Community

2
Confidential
About the speaker
Danylo Stepanchuk
Big Data Architect, member of Big Data Practice

3
Confidential
Main considerations on massive data processing
What is you environment and processing pipelines?
How can we handle late arrivals?
Multi actions/stages job pitfalls
How can we optimize joins?
Is it always possible to count the exact number of unique items?
How many shuffles should a job have?
Q&A session
Agenda

4
Confidential
Daily
● 150+B of raw events
● 130+TB of raw input data
○ varies hourly from 4TB up to 8TB
○ common spikes on holidays which can double the volume
● over 110 dimensions in the analytic data store
What’s the challenge?

5
Confidential
Main considerations on massive data processing
● Volume
○ Retention policy
○ Valuable records ratio
○ Sampling techniques
● Processing complexity
○ Average input and spikes
○ Logical complexity
○ Late arrivals
○ Accuracy
○ Data access strategy
● Analytics:
○ Granularity
○ Dimensions number and their cardinality
● Consumer level SLA
● Computational resources and multi-tenancy environment

6
Confidential
Source
Systems
Ingest
Layer
HDFS DruidSpark
Visualization
and
Reporting
Tools
batch processing
Real word pipeline example

7
Confidential
What is your environment?
prod
dev / qa

8
Confidential
What is your environment?

9
Confidential
How can we handle late arrivals?
Late arrivals SLA is 3 hours
● Wait until SLA comes (aka Watermark)
● Enhance the above with one more trigger after closing a reporting period (aka Window)
● Mini-batches during reporting period and final one when SLA comes
Why are you reinventing the wheel if there is at least Lamda architecture?
AWS
600 instances: 32 cpu, 256gb RAM, 500 gb storage
Amazon Elastic Block Storage (EBS) pricing (monthly) 30,000.00 USD
Amazon EC2 Reserved instances (monthly) 499,758.00 USD
Total monthly cost: 529,758.00 USD

10
Confidential
Multi actions / stages job pitfalls
● Data buffering
○ You will increase memory per executor
○ Storage & Execution memory competition
● Intermediate files
○ Spark local drives are most likely to run out of space
○ Forcing garbage collection after an action can help, but the behaviour is not officially
documented
Tip: if your shuffle is big organize your pipeline jobs to be as simple as possible

11
Confidential
What is your pretty common processing logic per job?
Data Source 1
Data Source 2
Data Source 3
Data Source X
SELECT
UID, d1, d2, … , dN, aggr1, aggr2, … ,
aggrM
FROM
DS1, ... , DSX
WHERE
DS1.UID=DS2.UID AND ... AND
DS1.UID=DSX.UID
GROUP BY
UID, d1, d2, … , dN
Data Set
Yes, the ONLY aggregation query!!!

12
Confidential
How can we optimize joins?
Data Source 1
Data Source 2
Data Source 3
SELECT ...fields...
FROM DS1
LEFT JOIN DS2 ON DS2.UID = DS1.UID
LEFT JOIN DS3 ON DS3.UID = DS1.UID
WHERE
DS2.UID IS NOT NULL OR DS3.UID IS NOT NULL
Data Set70M - 160M
1G - 2.2G
2T - 4T
- Shuffle is equal to
the total input size
- Most of the Data
Source 1 records don’t
pass the filter

13
Confidential
How can we optimize joins? Broadcast?
Data Source 1
Data Source 2
Data Source 3
70M - 160M
1G - 2.2G
2T - 4T
Pros
● spark-native solution
● drastically reduces shuffle size
● easy to implement (the only consideration maybe is to get distinct keys)
Cons
● the broadcast size
- needs to be filtered
- extract and broadcast keys

14
Confidential
How can we optimize joins? The Bloom Filter

15
Confidential
How can we solve this?
If a visitor is a human being the
cardinality virtually can be
It depends on the visitor cardinality! If it relatively small we are good to use HashSet

16
Confidential
HyperLogLog
The HyperLogLog algorithm is able to
estimate cardinalities of > 109 with a
typical accuracy (standard error) of
2%, using 1.5 kB of memory
It has been available since Apache Spark 2.0 for Python, R, and Scala with
DataFrames and Datasets api
● approx_count_distinct
● approxCountDistinct (before 2.1.0)

17
Confidential
HyperLogLog
It is easy to add your custom version of the algorithm.
You can use it with RDD api as well

18
Confidential
Data Source
SELECT
d1, … , dN, exploded_field, aggr1, … , aggrM
(SELECT * FROM sds EXPLODE(list_field) AS exploded_field) as eds
FROM eds
GROUP BY
d1, … , dN, exploded_field
Data Set
1.3T - 2.3T
This list_field may have dozens of items
We can do the aggregation with the only shuffle

19
Confidential
Input Shuffle Write Size Job Running
Time (Total
Uptime)
Stages Running Time
Original Job 1595.7
GB
1590.7 GB 12 min 7.5 min + 2.5 min
Optimized Job 1595.7
GB
996.0 GB + 84.8 GB 12 min 6.2 min + 2.3 min + 1.8 min

20
Confidential
How many shuffles should a job have? The Double Shuffle
1
2
3

21
Confidential
A couple of other tips
● Use dynamic allocation
○ spark.dynamicAllocation.enabled true
● Control number of shuffle partitions
○ spark.sql.shuffle.partitions = 200 by default
● Use heap memory if shuffle is big, but processing is lightweight
○ spark.shuffle.io.preferDirectBufs = false (true by default)
● Speculation can help in most of the cases
○ spark.speculation = true (false by default)
● Wisely tune executor resources
○ spark.executor.memory
○ spark.executor.cores
○ spark.dynamicAllocation.maxExecutors

22
Confidential
22
Q&A session

GlobalLogic Webinar: Massive aggregations with Spark and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GlobalLogic Webinar: Massive aggregations with Spark and Hadoop

Similar to GlobalLogic Webinar: Massive aggregations with Spark and Hadoop (20)

More from GlobalLogic Ukraine

More from GlobalLogic Ukraine (20)

Recently uploaded

Recently uploaded (20)

GlobalLogic Webinar: Massive aggregations with Spark and Hadoop