SlideShare a Scribd company logo
1 of 42
Download to read offline
Spark+AI Summit
April 24, 2019
Smart join algorithms
for fighting skew at scale
Andrew Clegg, Applied Machine Learning Group @ Yelp
@andrew_clegg
Data skew, outliers, power laws, and their symptomsOverview
Joining skewed data more efficiently
Extensions and tips
How do joins work in Spark, and why skewed data hurts
Data skew
Classic skewed distributions
DATA SKEW
Image by Rodolfo Hermans, CC BY-SA 3.0, via Wikiversity
Outliers
DATA SKEW
Image from Hedges & Shah, “Comparison of mode estimation methods and application in molecular clock analysis”, CC BY 4.0
Power laws
DATA SKEW
Word rank vs. word frequency in an English text corpus
Electrostatic and gravitational forces (inverse square law)Power laws are
everywhere
(approximately)
‘80/20 rule’ in distribution of income (Pareto principle)
Relationship between body size and metabolism
(Kleiber’s law)
Distribution of earthquake magnitudes
Word frequencies in natural language corpora (Zipf’s law)Power laws are
everywhere
(approximately)
Participation inequality on wikis and forum sites (‘1% rule’)
Popularity of websites and their content
Degree distribution in social networks (‘Bieber problem’)
Why is this a problem?
Hot shards in databases — salt keys, change schemaPopularity hotspots:
symptoms and fixes
Hot mappers in map-only tasks — repartition randomly
Hot reducers during joins and aggregations … ?
Slow load times for certain users — look for O(n2) mistakes
Diagnosing hot executors
POPULARITY HOTSPOTS
!?
Spark joins
under the hood
Shuffled hash join
SPARK JOINS
DataFrame 1
DataFrame 2
Partitions
Shuffle Shuffle
Shuffled hash join with very skewed data
SPARK JOINS
DataFrame 1
DataFrame 2
Partitions
Shuffle Shuffle
Broadcast join can help
SPARK JOINS
DataFrame 2
Broadcast
DataFrame 1
Map
Joined
Broadcast join can help sometimes
SPARK JOINS
DataFrame 2
Broadcast
DataFrame 1
Map
Joined
Must fit in RAM on
each executor
Joining skewed
data faster
Append random int in [0, R) to each key in skewed dataSplitting a single key
across multiple tasks
Append replica ID to original key in non-skewed data
Join on this newly-generated key
Replicate each row in non-skewed data, R times
R = replication factor
Replicated join
FASTER JOINS
DataFrame 1
DataFrame 2_0
Shuffle Shuffle
DataFrame 2_1
DataFrame 2_2
0
0
0
1
2
1
2
1
2
Logical Partitions
Replica IDs
replication_factor = 10
# spark.range creates an 'id' column: 0 <= id < replication_factor
replication_ids = F.broadcast(
spark.range(replication_factor).withColumnRenamed('id', 'replica_id')
)
# Replicate uniform data, one copy of each row per bucket
# composite_key looks like: 12345@3 (original_id@replica_id)
uniform_data_replicated = (
uniform_data
.crossJoin(replication_ids)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
def randint(limit):
return F.least(
F.floor(F.rand() * limit),
F.lit(limit - 1), # just to avoid unlikely edge case
)
# Randomly assign rows in skewed data to buckets
# composite_key has same format as in uniform data
skewed_data_tagged = (
skewed_data
.withColumn(
'composite_key',
F.concat(
'original_id',
F.lit('@'),
randint(replication_factor),
)
)
)
# Join them together on the composite key
joined = skewed_data_tagged.join(
uniform_data_replicated,
on='composite_key',
how='inner',
)
# Join them together on the composite key
joined = skewed_data_tagged.join(
uniform_data_replicated,
on='composite_key',
how='inner',
) WARNING
Inner and left outer
joins only
Remember duplicates…
FASTER JOINS
All different rows
Same row
Same row
Same row
100 million rows of data with uniformly-distributed keysExperiments with
synthetic data
Standard inner join ran for 7+ hours then I killed it!
10x replicated join completed in 1h16m
100 billion rows of data with Zipf-distributed keys
Can we do better?
Very common keys should be replicated many timesDifferential replication
Identify frequent keys before replication
Use different replication policy for those
Rare keys don’t need to be replicated as much (or at all?)
replication_factor_high = 50
replication_high = F.broadcast(
spark
.range(replication_factor_high)
.withColumnRenamed('id', 'replica_id')
)
replication_factor_low = 10
replication_low = … # as above
# Determine which keys are highly over-represented
top_keys = F.broadcast(
skewed_data
.freqItems(['original_id'], 0.0001) # return keys with frequency > this
.select(
F.explode('id_freqItems').alias('id_freqItems')
)
)
uniform_data_top_keys = (
uniform_data
.join(
top_keys,
uniform_data.original_id == top_keys.id_freqItems,
how='inner',
)
.crossJoin(replication_high)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
uniform_data_rest = (
uniform_data
.join(
top_keys,
uniform_data.original_id == top_keys.id_freqItems,
how='leftanti',
)
.crossJoin(replication_low)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
uniform_data_replicated = uniform_data_top_keys.union(uniform_data_rest)
FASTER JOINS
DataFrame 1
DataFrame 2_0
freqItems
DataFrame 2_1
Frequently
occurring keys
Broadcast
DataFrame 2_2
Replicate very frequent keys more often
skewed_data_tagged = (
skewed_data
.join(
top_keys,
skewed_data.id == top_keys.id_freqItems,
how='left',
)
.withColumn(
'replica_id',
F.when(
F.isnull(F.col('id_freqItems')), randint(replication_factor_low),
)
.otherwise(randint(replication_factor_high))
)
.withColumn(
'composite_key',
F.concat('original_id', F.lit('@'), 'replica_id')
)
)
100 million rows of data with uniformly-distributed keysExperiments with
synthetic data
10x replicated join completed in 1h16m
10x/50x differential replication completed in under 1h
100 billion rows of data with Zipf-distributed keys
Other ideas worth mentioning
Identify very common keys in skewed dataPartial broadcasting
Rare keys are joined the traditional way (without replication)
Union the resulting joined DataFrames
Select these rows from uniform data and broadcast join
Get approximate frequency for every key in skewed dataDynamic replication
Intuition: Sliding scale between rarest and most common
Can be hard to make this work in practice!
Replicate uniform data proportional to key frequency
Append random replica_id_that for other side, in [0, Rthat)Double-sided skew
Composite key on left: id, replica_id_that, replica_id_this
Composite key on right: id, replica_id_this, replica_id_that
Replicate each row Rthis times, and append replica_id_this
Summing up
Is the problem just outliers? Can you safely ignore them?Checklist
Look at your data to get an idea of the distribution
Start simple (fixed replication factor) then iterate if necessary
Try broadcast join if possible
Warren & Karau, High Performance Spark (O’Reilly 2017)Credits
Scalding developers: github.com/twitter/scalding
Spark, ML, and distributed systems community @ Yelp
Blog by Maria Restre: datarus.wordpress.com
Thanks!
@andrew_clegg

More Related Content

What's hot

What's hot (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark
SparkSpark
Spark
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 

Similar to Smart Join Algorithms for Fighting Skew at Scale

CJ 560 Module Seven Short Pa
CJ 560 Module Seven Short PaCJ 560 Module Seven Short Pa
CJ 560 Module Seven Short Pa
VinaOconner450
 

Similar to Smart Join Algorithms for Fighting Skew at Scale (20)

Why Our Code Smells
Why Our Code SmellsWhy Our Code Smells
Why Our Code Smells
 
Advanced SQL Webinar
Advanced SQL WebinarAdvanced SQL Webinar
Advanced SQL Webinar
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
 
Sql alchemy bpstyle_4
Sql alchemy bpstyle_4Sql alchemy bpstyle_4
Sql alchemy bpstyle_4
 
CJ 560 Module Seven Short Pa
CJ 560 Module Seven Short PaCJ 560 Module Seven Short Pa
CJ 560 Module Seven Short Pa
 
Django - sql alchemy - jquery
Django - sql alchemy - jqueryDjango - sql alchemy - jquery
Django - sql alchemy - jquery
 
Taking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsTaking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order Functions
 
Bc0041
Bc0041Bc0041
Bc0041
 
Php & my sql
Php & my sqlPhp & my sql
Php & my sql
 
Error Reporting in ZF2: form messages, custom error pages, logging
Error Reporting in ZF2: form messages, custom error pages, loggingError Reporting in ZF2: form messages, custom error pages, logging
Error Reporting in ZF2: form messages, custom error pages, logging
 
Code smells (1).pptx
Code smells (1).pptxCode smells (1).pptx
Code smells (1).pptx
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdf
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdf
 
Drupal7 dbtng
Drupal7  dbtngDrupal7  dbtng
Drupal7 dbtng
 
Java 8 Lambda Expressions
Java 8 Lambda ExpressionsJava 8 Lambda Expressions
Java 8 Lambda Expressions
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Everything is composable
Everything is composableEverything is composable
Everything is composable
 
Fnt Software Solutions Pvt Ltd Placement Papers - PHP Technology
Fnt Software Solutions Pvt Ltd Placement Papers - PHP TechnologyFnt Software Solutions Pvt Ltd Placement Papers - PHP Technology
Fnt Software Solutions Pvt Ltd Placement Papers - PHP Technology
 
A "M"ind Bending Experience. Power Query for Power BI and Beyond.
A "M"ind Bending Experience. Power Query for Power BI and Beyond.A "M"ind Bending Experience. Power Query for Power BI and Beyond.
A "M"ind Bending Experience. Power Query for Power BI and Beyond.
 
Contracts in-clojure-pete
Contracts in-clojure-peteContracts in-clojure-pete
Contracts in-clojure-pete
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Recently uploaded (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Smart Join Algorithms for Fighting Skew at Scale

  • 2. Smart join algorithms for fighting skew at scale Andrew Clegg, Applied Machine Learning Group @ Yelp @andrew_clegg
  • 3. Data skew, outliers, power laws, and their symptomsOverview Joining skewed data more efficiently Extensions and tips How do joins work in Spark, and why skewed data hurts
  • 5. Classic skewed distributions DATA SKEW Image by Rodolfo Hermans, CC BY-SA 3.0, via Wikiversity
  • 6. Outliers DATA SKEW Image from Hedges & Shah, “Comparison of mode estimation methods and application in molecular clock analysis”, CC BY 4.0
  • 7. Power laws DATA SKEW Word rank vs. word frequency in an English text corpus
  • 8. Electrostatic and gravitational forces (inverse square law)Power laws are everywhere (approximately) ‘80/20 rule’ in distribution of income (Pareto principle) Relationship between body size and metabolism (Kleiber’s law) Distribution of earthquake magnitudes
  • 9. Word frequencies in natural language corpora (Zipf’s law)Power laws are everywhere (approximately) Participation inequality on wikis and forum sites (‘1% rule’) Popularity of websites and their content Degree distribution in social networks (‘Bieber problem’)
  • 10. Why is this a problem?
  • 11. Hot shards in databases — salt keys, change schemaPopularity hotspots: symptoms and fixes Hot mappers in map-only tasks — repartition randomly Hot reducers during joins and aggregations … ? Slow load times for certain users — look for O(n2) mistakes
  • 14. Shuffled hash join SPARK JOINS DataFrame 1 DataFrame 2 Partitions Shuffle Shuffle
  • 15. Shuffled hash join with very skewed data SPARK JOINS DataFrame 1 DataFrame 2 Partitions Shuffle Shuffle
  • 16. Broadcast join can help SPARK JOINS DataFrame 2 Broadcast DataFrame 1 Map Joined
  • 17. Broadcast join can help sometimes SPARK JOINS DataFrame 2 Broadcast DataFrame 1 Map Joined Must fit in RAM on each executor
  • 19. Append random int in [0, R) to each key in skewed dataSplitting a single key across multiple tasks Append replica ID to original key in non-skewed data Join on this newly-generated key Replicate each row in non-skewed data, R times R = replication factor
  • 20. Replicated join FASTER JOINS DataFrame 1 DataFrame 2_0 Shuffle Shuffle DataFrame 2_1 DataFrame 2_2 0 0 0 1 2 1 2 1 2 Logical Partitions Replica IDs
  • 21. replication_factor = 10 # spark.range creates an 'id' column: 0 <= id < replication_factor replication_ids = F.broadcast( spark.range(replication_factor).withColumnRenamed('id', 'replica_id') ) # Replicate uniform data, one copy of each row per bucket # composite_key looks like: 12345@3 (original_id@replica_id) uniform_data_replicated = ( uniform_data .crossJoin(replication_ids) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) )
  • 22. def randint(limit): return F.least( F.floor(F.rand() * limit), F.lit(limit - 1), # just to avoid unlikely edge case ) # Randomly assign rows in skewed data to buckets # composite_key has same format as in uniform data skewed_data_tagged = ( skewed_data .withColumn( 'composite_key', F.concat( 'original_id', F.lit('@'), randint(replication_factor), ) ) )
  • 23. # Join them together on the composite key joined = skewed_data_tagged.join( uniform_data_replicated, on='composite_key', how='inner', )
  • 24. # Join them together on the composite key joined = skewed_data_tagged.join( uniform_data_replicated, on='composite_key', how='inner', ) WARNING Inner and left outer joins only
  • 25. Remember duplicates… FASTER JOINS All different rows Same row Same row Same row
  • 26. 100 million rows of data with uniformly-distributed keysExperiments with synthetic data Standard inner join ran for 7+ hours then I killed it! 10x replicated join completed in 1h16m 100 billion rows of data with Zipf-distributed keys
  • 27. Can we do better?
  • 28. Very common keys should be replicated many timesDifferential replication Identify frequent keys before replication Use different replication policy for those Rare keys don’t need to be replicated as much (or at all?)
  • 29. replication_factor_high = 50 replication_high = F.broadcast( spark .range(replication_factor_high) .withColumnRenamed('id', 'replica_id') ) replication_factor_low = 10 replication_low = … # as above # Determine which keys are highly over-represented top_keys = F.broadcast( skewed_data .freqItems(['original_id'], 0.0001) # return keys with frequency > this .select( F.explode('id_freqItems').alias('id_freqItems') ) )
  • 30. uniform_data_top_keys = ( uniform_data .join( top_keys, uniform_data.original_id == top_keys.id_freqItems, how='inner', ) .crossJoin(replication_high) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) )
  • 31. uniform_data_rest = ( uniform_data .join( top_keys, uniform_data.original_id == top_keys.id_freqItems, how='leftanti', ) .crossJoin(replication_low) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) ) uniform_data_replicated = uniform_data_top_keys.union(uniform_data_rest)
  • 32. FASTER JOINS DataFrame 1 DataFrame 2_0 freqItems DataFrame 2_1 Frequently occurring keys Broadcast DataFrame 2_2 Replicate very frequent keys more often
  • 33. skewed_data_tagged = ( skewed_data .join( top_keys, skewed_data.id == top_keys.id_freqItems, how='left', ) .withColumn( 'replica_id', F.when( F.isnull(F.col('id_freqItems')), randint(replication_factor_low), ) .otherwise(randint(replication_factor_high)) ) .withColumn( 'composite_key', F.concat('original_id', F.lit('@'), 'replica_id') ) )
  • 34. 100 million rows of data with uniformly-distributed keysExperiments with synthetic data 10x replicated join completed in 1h16m 10x/50x differential replication completed in under 1h 100 billion rows of data with Zipf-distributed keys
  • 35. Other ideas worth mentioning
  • 36. Identify very common keys in skewed dataPartial broadcasting Rare keys are joined the traditional way (without replication) Union the resulting joined DataFrames Select these rows from uniform data and broadcast join
  • 37. Get approximate frequency for every key in skewed dataDynamic replication Intuition: Sliding scale between rarest and most common Can be hard to make this work in practice! Replicate uniform data proportional to key frequency
  • 38. Append random replica_id_that for other side, in [0, Rthat)Double-sided skew Composite key on left: id, replica_id_that, replica_id_this Composite key on right: id, replica_id_this, replica_id_that Replicate each row Rthis times, and append replica_id_this
  • 40. Is the problem just outliers? Can you safely ignore them?Checklist Look at your data to get an idea of the distribution Start simple (fixed replication factor) then iterate if necessary Try broadcast join if possible
  • 41. Warren & Karau, High Performance Spark (O’Reilly 2017)Credits Scalding developers: github.com/twitter/scalding Spark, ML, and distributed systems community @ Yelp Blog by Maria Restre: datarus.wordpress.com