SlideShare a Scribd company logo
1
Confidential
Massive aggregations with
Spark on Hadoop
Big Data Community
2
Confidential
About the speaker
Danylo Stepanchuk
Big Data Architect, member of Big Data Practice
3
Confidential
Main considerations on massive data processing
What is you environment and processing pipelines?
How can we handle late arrivals?
Multi actions/stages job pitfalls
How can we optimize joins?
Is it always possible to count the exact number of unique items?
How many shuffles should a job have?
Q&A session
Agenda
4
Confidential
Daily
● 150+B of raw events
● 130+TB of raw input data
○ varies hourly from 4TB up to 8TB
○ common spikes on holidays which can double the volume
● over 110 dimensions in the analytic data store
What’s the challenge?
5
Confidential
Main considerations on massive data processing
● Volume
○ Retention policy
○ Valuable records ratio
○ Sampling techniques
● Processing complexity
○ Average input and spikes
○ Logical complexity
○ Late arrivals
○ Accuracy
○ Data access strategy
● Analytics:
○ Granularity
○ Dimensions number and their cardinality
● Consumer level SLA
● Computational resources and multi-tenancy environment
6
Confidential
Source
Systems
Ingest
Layer
HDFS DruidSpark
Visualization
and
Reporting
Tools
batch processing
Real word pipeline example
7
Confidential
What is your environment?
prod
dev / qa
8
Confidential
What is your environment?
9
Confidential
How can we handle late arrivals?
Late arrivals SLA is 3 hours
● Wait until SLA comes (aka Watermark)
● Enhance the above with one more trigger after closing a reporting period (aka Window)
● Mini-batches during reporting period and final one when SLA comes
Why are you reinventing the wheel if there is at least Lamda architecture?
AWS
600 instances: 32 cpu, 256gb RAM, 500 gb storage
Amazon Elastic Block Storage (EBS) pricing (monthly) 30,000.00 USD
Amazon EC2 Reserved instances (monthly) 499,758.00 USD
Total monthly cost: 529,758.00 USD
10
Confidential
Multi actions / stages job pitfalls
● Data buffering
○ You will increase memory per executor
○ Storage & Execution memory competition
● Intermediate files
○ Spark local drives are most likely to run out of space
○ Forcing garbage collection after an action can help, but the behaviour is not officially
documented
Tip: if your shuffle is big organize your pipeline jobs to be as simple as possible
11
Confidential
What is your pretty common processing logic per job?
Data Source 1
Data Source 2
Data Source 3
Data Source X
SELECT
UID, d1, d2, … , dN, aggr1, aggr2, … ,
aggrM
FROM
DS1, ... , DSX
WHERE
DS1.UID=DS2.UID AND ... AND
DS1.UID=DSX.UID
GROUP BY
UID, d1, d2, … , dN
Data Set
Yes, the ONLY aggregation query!!!
12
Confidential
How can we optimize joins?
Data Source 1
Data Source 2
Data Source 3
SELECT ...fields...
FROM DS1
LEFT JOIN DS2 ON DS2.UID = DS1.UID
LEFT JOIN DS3 ON DS3.UID = DS1.UID
WHERE
DS2.UID IS NOT NULL OR DS3.UID IS NOT NULL
Data Set70M - 160M
1G - 2.2G
2T - 4T
- Shuffle is equal to
the total input size
- Most of the Data
Source 1 records don’t
pass the filter
13
Confidential
How can we optimize joins? Broadcast?
Data Source 1
Data Source 2
Data Source 3
70M - 160M
1G - 2.2G
2T - 4T
Pros
● spark-native solution
● drastically reduces shuffle size
● easy to implement (the only consideration maybe is to get distinct keys)
Cons
● the broadcast size
- needs to be filtered
- extract and broadcast keys
14
Confidential
How can we optimize joins? The Bloom Filter
15
Confidential
Is it always possible to count the exact number of unique items?
How can we solve this?
If a visitor is a human being the
cardinality virtually can be
It depends on the visitor cardinality! If it relatively small we are good to use HashSet
16
Confidential
Is it always possible to count the exact number of unique items?
HyperLogLog
The HyperLogLog algorithm is able to
estimate cardinalities of > 109 with a
typical accuracy (standard error) of
2%, using 1.5 kB of memory
It has been available since Apache Spark 2.0 for Python, R, and Scala with
DataFrames and Datasets api
● approx_count_distinct
● approxCountDistinct (before 2.1.0)
17
Confidential
Is it always possible to count the exact number of unique items?
HyperLogLog
It is easy to add your custom version of the algorithm.
You can use it with RDD api as well
18
Confidential
How many shuffles should a job have?
Data Source
SELECT
d1, … , dN, exploded_field, aggr1, … , aggrM
(SELECT * FROM sds EXPLODE(list_field) AS exploded_field) as eds
FROM eds
GROUP BY
d1, … , dN, exploded_field
Data Set
1.3T - 2.3T
This list_field may have dozens of items
We can do the aggregation with the only shuffle
19
Confidential
How many shuffles should a job have?
Input Shuffle Write Size Job Running
Time (Total
Uptime)
Stages Running Time
Original Job 1595.7
GB
1590.7 GB 12 min 7.5 min + 2.5 min
Optimized Job 1595.7
GB
996.0 GB + 84.8 GB 12 min 6.2 min + 2.3 min + 1.8 min
20
Confidential
How many shuffles should a job have? The Double Shuffle
1
2
3
21
Confidential
A couple of other tips
● Use dynamic allocation
○ spark.dynamicAllocation.enabled true
● Control number of shuffle partitions
○ spark.sql.shuffle.partitions = 200 by default
● Use heap memory if shuffle is big, but processing is lightweight
○ spark.shuffle.io.preferDirectBufs = false (true by default)
● Speculation can help in most of the cases
○ spark.speculation = true (false by default)
● Wisely tune executor resources
○ spark.executor.memory
○ spark.executor.cores
○ spark.dynamicAllocation.maxExecutors
22
Confidential
22
Q&A session
23
Confidential
Thank you!

More Related Content

What's hot

HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
Fujio Turner
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
Codemotion
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
Jody Garnett
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
John Conley
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Implementing HDF5 in MATLAB
Implementing HDF5 in MATLABImplementing HDF5 in MATLAB
Implementing HDF5 in MATLAB
The HDF-EOS Tools and Information Center
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
Daniel Rodriguez
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraVarunkumar Manohar
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
Bahadir Cambel
 

What's hot (20)

HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Implementing HDF5 in MATLAB
Implementing HDF5 in MATLABImplementing HDF5 in MATLAB
Implementing HDF5 in MATLAB
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's Bagheera
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 

Similar to GlobalLogic Webinar: Massive aggregations with Spark and Hadoop

Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
Naukri.com
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리
NAVER D2
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
Innfinision Cloud and BigData Solutions
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 

Similar to GlobalLogic Webinar: Massive aggregations with Spark and Hadoop (20)

Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리
 
Py tables
Py tablesPy tables
Py tables
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Handout3o
Handout3oHandout3o
Handout3o
 

More from GlobalLogic Ukraine

GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Ukraine
 
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic Ukraine
 
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic Ukraine
 
Штучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptxШтучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptx
GlobalLogic Ukraine
 
Задачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptxЗадачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptx
GlobalLogic Ukraine
 
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptxЩо треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
GlobalLogic Ukraine
 
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Ukraine
 
JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"
GlobalLogic Ukraine
 
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic Ukraine
 
Страх і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic EducationСтрах і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic Education
GlobalLogic Ukraine
 
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic Ukraine
 
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic Ukraine
 
“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?
GlobalLogic Ukraine
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Ukraine
 
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Ukraine
 
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic Ukraine
 
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
GlobalLogic Ukraine
 
GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Ukraine
 
C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"
GlobalLogic Ukraine
 
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Ukraine
 

More from GlobalLogic Ukraine (20)

GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
 
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
 
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
 
Штучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptxШтучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptx
 
Задачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptxЗадачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptx
 
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptxЩо треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
 
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
 
JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"
 
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
 
Страх і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic EducationСтрах і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic Education
 
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
 
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
 
“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
 
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
 
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
 
GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"
 
C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"
 
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 

GlobalLogic Webinar: Massive aggregations with Spark and Hadoop

  • 1. 1 Confidential Massive aggregations with Spark on Hadoop Big Data Community
  • 2. 2 Confidential About the speaker Danylo Stepanchuk Big Data Architect, member of Big Data Practice
  • 3. 3 Confidential Main considerations on massive data processing What is you environment and processing pipelines? How can we handle late arrivals? Multi actions/stages job pitfalls How can we optimize joins? Is it always possible to count the exact number of unique items? How many shuffles should a job have? Q&A session Agenda
  • 4. 4 Confidential Daily ● 150+B of raw events ● 130+TB of raw input data ○ varies hourly from 4TB up to 8TB ○ common spikes on holidays which can double the volume ● over 110 dimensions in the analytic data store What’s the challenge?
  • 5. 5 Confidential Main considerations on massive data processing ● Volume ○ Retention policy ○ Valuable records ratio ○ Sampling techniques ● Processing complexity ○ Average input and spikes ○ Logical complexity ○ Late arrivals ○ Accuracy ○ Data access strategy ● Analytics: ○ Granularity ○ Dimensions number and their cardinality ● Consumer level SLA ● Computational resources and multi-tenancy environment
  • 7. 7 Confidential What is your environment? prod dev / qa
  • 9. 9 Confidential How can we handle late arrivals? Late arrivals SLA is 3 hours ● Wait until SLA comes (aka Watermark) ● Enhance the above with one more trigger after closing a reporting period (aka Window) ● Mini-batches during reporting period and final one when SLA comes Why are you reinventing the wheel if there is at least Lamda architecture? AWS 600 instances: 32 cpu, 256gb RAM, 500 gb storage Amazon Elastic Block Storage (EBS) pricing (monthly) 30,000.00 USD Amazon EC2 Reserved instances (monthly) 499,758.00 USD Total monthly cost: 529,758.00 USD
  • 10. 10 Confidential Multi actions / stages job pitfalls ● Data buffering ○ You will increase memory per executor ○ Storage & Execution memory competition ● Intermediate files ○ Spark local drives are most likely to run out of space ○ Forcing garbage collection after an action can help, but the behaviour is not officially documented Tip: if your shuffle is big organize your pipeline jobs to be as simple as possible
  • 11. 11 Confidential What is your pretty common processing logic per job? Data Source 1 Data Source 2 Data Source 3 Data Source X SELECT UID, d1, d2, … , dN, aggr1, aggr2, … , aggrM FROM DS1, ... , DSX WHERE DS1.UID=DS2.UID AND ... AND DS1.UID=DSX.UID GROUP BY UID, d1, d2, … , dN Data Set Yes, the ONLY aggregation query!!!
  • 12. 12 Confidential How can we optimize joins? Data Source 1 Data Source 2 Data Source 3 SELECT ...fields... FROM DS1 LEFT JOIN DS2 ON DS2.UID = DS1.UID LEFT JOIN DS3 ON DS3.UID = DS1.UID WHERE DS2.UID IS NOT NULL OR DS3.UID IS NOT NULL Data Set70M - 160M 1G - 2.2G 2T - 4T - Shuffle is equal to the total input size - Most of the Data Source 1 records don’t pass the filter
  • 13. 13 Confidential How can we optimize joins? Broadcast? Data Source 1 Data Source 2 Data Source 3 70M - 160M 1G - 2.2G 2T - 4T Pros ● spark-native solution ● drastically reduces shuffle size ● easy to implement (the only consideration maybe is to get distinct keys) Cons ● the broadcast size - needs to be filtered - extract and broadcast keys
  • 14. 14 Confidential How can we optimize joins? The Bloom Filter
  • 15. 15 Confidential Is it always possible to count the exact number of unique items? How can we solve this? If a visitor is a human being the cardinality virtually can be It depends on the visitor cardinality! If it relatively small we are good to use HashSet
  • 16. 16 Confidential Is it always possible to count the exact number of unique items? HyperLogLog The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory It has been available since Apache Spark 2.0 for Python, R, and Scala with DataFrames and Datasets api ● approx_count_distinct ● approxCountDistinct (before 2.1.0)
  • 17. 17 Confidential Is it always possible to count the exact number of unique items? HyperLogLog It is easy to add your custom version of the algorithm. You can use it with RDD api as well
  • 18. 18 Confidential How many shuffles should a job have? Data Source SELECT d1, … , dN, exploded_field, aggr1, … , aggrM (SELECT * FROM sds EXPLODE(list_field) AS exploded_field) as eds FROM eds GROUP BY d1, … , dN, exploded_field Data Set 1.3T - 2.3T This list_field may have dozens of items We can do the aggregation with the only shuffle
  • 19. 19 Confidential How many shuffles should a job have? Input Shuffle Write Size Job Running Time (Total Uptime) Stages Running Time Original Job 1595.7 GB 1590.7 GB 12 min 7.5 min + 2.5 min Optimized Job 1595.7 GB 996.0 GB + 84.8 GB 12 min 6.2 min + 2.3 min + 1.8 min
  • 20. 20 Confidential How many shuffles should a job have? The Double Shuffle 1 2 3
  • 21. 21 Confidential A couple of other tips ● Use dynamic allocation ○ spark.dynamicAllocation.enabled true ● Control number of shuffle partitions ○ spark.sql.shuffle.partitions = 200 by default ● Use heap memory if shuffle is big, but processing is lightweight ○ spark.shuffle.io.preferDirectBufs = false (true by default) ● Speculation can help in most of the cases ○ spark.speculation = true (false by default) ● Wisely tune executor resources ○ spark.executor.memory ○ spark.executor.cores ○ spark.dynamicAllocation.maxExecutors