SlideShare a Scribd company logo
Rob	Keevil	- KnowIT
Fokko Driesprong- GoDataDriven
Working	with	Skewed	Data:
The	Iterative	Broadcast
#EUde11
#EUde11
About	Rob	Keevil
• Philosophy,	Politics	&	Economics
• Big	Data	Architect
• Financial	and	Anti-Fraud	Domains
• Scala	Programmer
#EUde11
About	Fokko Driesprong
• Software	Engineering	and	Distributed	Systems
• Data	Engineer	at	GoDataDriven
• Committer	&	PPMC	Member,	Apache	Airflow
• Open	source	enthusiast
https://godatadriven.com/players/fokko-driesprong
#EUde11
Overview
• The	Skew	Problem	at	ING
• Join	Strategies	in	Spark
• Our	Solution
• Performance	 Analysis
• Considerations
• Future	Work
#EUde11
The	Skew	Problem	at
• Billions	of	transactions…
• Billions	of	accounts…
• Millions	of	companies…
• …with	a	very uneven	distribution!
#EUde11
Familiar?
#EUde11
Join	Types	Native	to	Spark
• Sort Merge Join
• Broadcast Hash Join
• Shuffled Hash Join
• Broadcast Nested Loop Join
• Cartesian Product
Key Column A
1 HELLO
The	Sort	Merge	Join
8#EUde11
Stage	1:
Determine	
partitions	
by	hashing	
the	key(s)
3 GOOD
Key Column A
Key Column BKey Column B
1 WORLD!
2 SUMMIT
1 DUBLIN
2 STREAMING
1 EVERYBODY
2 SQL
3 MORNING
3 EVENING
3 NIGHT
4 SPAM
4 SPAM
4 SPAM
2 SPARK
4 SPAM
Executor	1
Key Column B
1 WORLD!
1 DUBLIN
1 EVERYBODY
3 MORNING
3 EVENING
3 NIGHT
Key Column A Column B
1 WORLD!
1 DUBLIN
1 EVERYBODY
3 MORNING
3 EVENING
3 NIGHT
3 GOOD3 GOOD3 GOOD3 GOOD
Key Column A
Executor	2
9#EUde11
Stage	2:
Merge
Key Column B
2 SUMMIT
2 STREAMING
2 SQL
4 SPAM
4 SPAM
4 SPAM
Key Column AKey Column A
1 HELLO1 HELLO1 HELLO1 HELLO
Key Column A Column B
2 SUMMIT
2 STREAMING
2 SQL
4 SPAM
4 SPAM
4 SPAM
2 SPARK2 SPARK2 SPARK
4 SPAM4 SPAM4 SPAM
2 SPARK
4 SPAM
Executor	1 Executor	2
10#EUde11
Key Column A Column B
1 HELLO WORLD!
1 HELLO DUBLIN
1 HELLO EVERYBODY
3 GOOD MORNING
3 GOOD EVENING
3 GOOD NIGHT
Key Column A Column B
2 SPARK SUMMIT
2 SPARK STREAMING
2 SPARK SQL
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
11
Key Column A
1 HELLO
12#EUde11
Stage	1:
Determine	
partitions	
by	hashing	
the	key(s)
3 GOOD
Key Column A
2 SPARK
4 SPAM
Key Column BKey Column B
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
Regular Table Skewed Table
Executor	2Executor	1
Key Column B
1 WORLD!
3 MORNING
Key Column A Column B
1 WORLD!
3 MORNING
3 GOOD
Key Column A
13#EUde11
Stage	2:
Merge
Key Column B
2 SUMMIT
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
Key Column AKey Column A
1 HELLO
Key Column A Column B
2 SUMMIT
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
4 SPAM
2 SPARK
4 SPAM
1 HELLO
3 GOOD 4 SPAM4 SPAM4 SPAM4 SPAM4 SPAM4 SPAM4 SPAM4 SPAM
2 SPARK
4 SPAM
Executor	1 Executor	2
14#EUde11
Key Column A Column B
1 HELLO WORLD!
3 GOOD MORNING
Key Column A Column B
2 SPARK SUMMIT
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
Potential	Solutions
• Give	it	more	resources?
• Repartition	the	data?
15#EUde11
What	about	Broadcast	Hash	Join?
• Leaves	the	partitions	larger	table	
untouched
• Copies	the	entire	smaller	table	to	
every	executor
• Iterate	over	the	large	table,	and	
join	using	a	HashMap
16#EUde11
Key Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
Key Column A
1 HELLO
4 SPAM
2 SPARK
3 GOOD
Key Column A
1 HELLO
4 SPAM
2 SPARK
3 GOOD
Key Column A
1 HELLO
4 SPAM
2 SPARK
3 GOOD
Key Column B
4 SPAM
4 SPAM
1 WORLD
4 SPAM
4 SPAM
4 SPAM
17#EUde11
Smaller Table Larger Table
Stage	1:
Broadcast	
smaller	
dataset	to	
all	
executors
Key Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
Executor	2
3 GOOD
2 SPARK
Executor	1
Key Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
Key Column A Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
3 GOOD
Key Column A
18#EUde11
Stage	2:
Left	join
Key Column B
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
Key Column A
Key Column A Column B
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
2 SPARK
4 SPAM4 SPAM4 SPAM4 SPAM2 SPARK
4 SPAM
3 GOOD
Key Column AKey Column A
1 HELLO
4 SPAM4 SPAM4 SPAM4 SPAM4 SPAM
1 HELLO
4 SPAM
1 HELLO
Executor	1 Executor	2
19#EUde11
Key Column A Column B
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
2 SPARK SUMMIT
3 GOOD MORNING
4 SPAM SPAM
Key Column A Column B
4 SPAM SPAM
4 SPAM SPAM
1 HELLO WORLD!
4 SPAM SPAM
4 SPAM SPAM
4 SPAM SPAM
But…
• …What	if	your	“smaller”	table	
isn’t	so	small?
• The	whole	small	table	needs	
to	fit	in	driver	memory	and	
every	executor
20#EUde11
Introducing:	The	Iterative	Broadcast!
• Divide	the	smaller	table	into	“passes”
• Broadcast	a	pass	&	left	join	to	the	larger	dataset
• Clear	the	broadcast	partition	from	memory
• Repeat	until	all	passes	are	processed
21#EUde11
22#EUde11
Key Column A
1 HELLO
4 SPAM
2 SPARK
3 GOOD
Stage	1:
Assign	Pass	Number	
to	Smaller	Table
Key Column A Pass
1 HELLO 1
4 SPAM 2
2 SPARK 2
3 GOOD 1
Executor	2Executor	1
23#EUde11
Stage	2:
Broadcast	
first	pass
Key Column B
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
Key Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
3 GOOD
1 HELLO
Key Column A Pass
1 HELLO 1
3 GOOD 1
Key Column A Pass
1 HELLO 1
3 GOOD 1
Executor	2Executor	1
24#EUde11
Stage	3:
Left	join	
first	pass
Key Column B
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
Key Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
Key Column A Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
Key Column A Column B
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
3 GOOD
1 HELLO
Key Column A Pass
1 HELLO 1
3 GOOD 1
Key Column A Pass
1 HELLO 1
3 GOOD 1
Executor	2Executor	1
25#EUde11
Key Column B
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
Key Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
Key Column A Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 GOOD MORNING
4 SPAM
Key Column A Column B
4 SPAM
4 SPAM
1 HELLO WORLD!
4 SPAM
4 SPAM
4 SPAM
1 HELLO
Stage	4:
Broadcast	
second	
pass
Key Column A Pass
1 HELLO 1
3 GOOD 1
Key Column A Pass
1 HELLO 1
3 GOOD 1
Key Column A Pass
2 SPARK 2
4 SPAM 2
Key Column A Pass
2 SPARK 2
4 SPAM 2
Executor	2Executor	1
26#EUde11
Key Column B
4 SPAM
4 SPAM
1 WORLD!
4 SPAM
4 SPAM
4 SPAM
Key Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 MORNING
4 SPAM
Key Column A Column B
4 SPAM
4 SPAM
4 SPAM
2 SUMMIT
3 GOOD MORNING
4 SPAM
Key Column A Column B
4 SPAM
4 SPAM
1 HELLO WORLD!
4 SPAM
4 SPAM
4 SPAM
3 GOOD
1 HELLO
Stage	5:
Left	join	
second	
pass
Key Column A Pass
1 HELLO 1
3 GOOD 1
3 SPARK
3 GOOD3 GOOD3 GOOD3 GOOD 3 GOOD3 GOOD3 GOOD3 GOOD3 GOOD3 GOOD
Key Column A Pass
2 SPARK 2
4 SPAM 2
Key Column A Pass
2 SPARK 2
4 SPAM 2
27#EUde11
Performance	Analysis
• Setup	notes
– Tests	on	Amazon	EMR	using	4x	m4.4xlarge	workers
– 5	cores	and	18gb	memory	 per	executor
– Spark	2.1
– Benchmark	on	GitHub
– Skewed	 dataset,	10000	keys
• Largest	key	10000	rows,	2nd	key	5000	rows,	3rd	key	2500	rows	…	10000th	key	1	row
– Uniform	dataset,	evenly	 spread
28#EUde11
Performance	Analysis	– Skewed	Data
29#EUde11
Performance	Analysis	– Uniform	Data
Considerations
• Hyper	parameters
– Size	of	the	broadcast	variable
– The	number	of	passes
• Additional	complexity
30#EUde11
Future	Work
31#EUde11
• In-memory iterations
• Native to Spark
Recap
32#EUde11
• Skew hurts Spark parallelism and stability
• Default join types have issues with skewed data
• Iterative Broadcast Join can be used to process
skewed data while maintaining parallelism
That’s	All	Folks!
https://github.com/godatadriven/iterative-broadcast-join
#EUde11

More Related Content

What's hot

Alpha beta pruning
Alpha beta pruningAlpha beta pruning
Alpha beta pruning
Nilesh Bandekar
 
44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back
44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back
44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back
44CON
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
POST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEMPOST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEM
Rajendran
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Flink Forward
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
3b. LMD & RMD.pdf
3b. LMD & RMD.pdf3b. LMD & RMD.pdf
3b. LMD & RMD.pdf
TANZINTANZINA
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Flink Forward
 
Theory of Computation Unit 2
Theory of Computation Unit 2Theory of Computation Unit 2
Theory of Computation Unit 2
Jena Catherine Bel D
 
Create Your Own Language
Create Your Own LanguageCreate Your Own Language
Create Your Own Language
Hamidreza Soleimani
 
Time complexity
Time complexityTime complexity
Time complexity
Katang Isip
 
Business objects web intelligence training tasks
Business objects web intelligence training tasksBusiness objects web intelligence training tasks
Business objects web intelligence training tasks
Dmitry Anoshin
 
BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...
BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...
BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...
BlueHat Security Conference
 
Compiler Design Unit 4
Compiler Design Unit 4Compiler Design Unit 4
Compiler Design Unit 4
Jena Catherine Bel D
 
Radix Sort
Radix SortRadix Sort
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 

What's hot (20)

Alpha beta pruning
Alpha beta pruningAlpha beta pruning
Alpha beta pruning
 
44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back
44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back
44CON London 2015 - Windows 10: 2 Steps Forward, 1 Step Back
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
POST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEMPOST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEM
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
3b. LMD & RMD.pdf
3b. LMD & RMD.pdf3b. LMD & RMD.pdf
3b. LMD & RMD.pdf
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
 
Theory of Computation Unit 2
Theory of Computation Unit 2Theory of Computation Unit 2
Theory of Computation Unit 2
 
Create Your Own Language
Create Your Own LanguageCreate Your Own Language
Create Your Own Language
 
Time complexity
Time complexityTime complexity
Time complexity
 
Business objects web intelligence training tasks
Business objects web intelligence training tasksBusiness objects web intelligence training tasks
Business objects web intelligence training tasks
 
BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...
BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...
BlueHat v18 || Record now, decrypt later - future quantum computers are a pre...
 
Compiler Design Unit 4
Compiler Design Unit 4Compiler Design Unit 4
Compiler Design Unit 4
 
Radix Sort
Radix SortRadix Sort
Radix Sort
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 

Viewers also liked

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Spark Summit
 
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
Spark Summit
 
Saving Energy with Apache Spark and Toon with Stephen Galsworthy
Saving Energy with Apache Spark and Toon with Stephen GalsworthySaving Energy with Apache Spark and Toon with Stephen Galsworthy
Saving Energy with Apache Spark and Toon with Stephen Galsworthy
Spark Summit
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 

Viewers also liked (6)

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with...
 
Saving Energy with Apache Spark and Toon with Stephen Galsworthy
Saving Energy with Apache Spark and Toon with Stephen GalsworthySaving Energy with Apache Spark and Toon with Stephen Galsworthy
Saving Energy with Apache Spark and Toon with Stephen Galsworthy
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 

Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob Keevil