SlideShare a Scribd company logo
1 of 21
Download to read offline
Optimize the Large Scale Graph Applications by
using Apache Spark with 4-5x Performance
Improvements
Agenda
© 2020 PayPal Inc. Confidential and proprietary.
Challenges
Our Lesson & Learn
• Improve the scalability of the large graph
computation
• Optimization & Enhancement in the production
environment
Learning Summary
Challenges
The main challenges we are facing
2+ billion Vertices
100+ billion Edges
Degrees
• Avg: 110
• Max : 2+ million
© 2020 PayPal Inc. Confidential and proprietary.
• Large graph with the data skew in nature • Strict SLA but various limitations in the
production
Limited Resources
Various production guidelines
Dedicated pool but shared
common services, E.g.,
NameNode
Our Lesson and Learn
•
•
Use Case#1 Community detection
© 2020 PayPal Inc. Confidential and proprietary.
• Using the Connected Component to
group the communities
• Reference the paper - Connected
Components in MapReduce and Beyond
SCALABILITY
1
5
4
6
3
2
Sample undirected graph
Find Connected Component
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
Community – (1)
The data skew in nature caused “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
Sample illustration
SCALABILITY
(6,1)
(6,1)
(6,2)
…
(5,2)
(4,2 )
(3,2)
Group by
starting
node
1. Find smallest node
in each group
2. Generate new
pairs by linking
node to smallest
node in each
group
(1,6)
(2,6)
(2,5)
(2,4)
(2,3)
(6,1)
(6,2)
(5,2)
(4,2)
(3,2)
(1,6)
(2,3)
(2,4)
(2,5)
(2,6)
Make it
directed
( 1, [6,2] )
( 2, [1,3,4,5,6]
)
( 3, [2] )
( 4, [2] )
( 5, [2] )
( 6, [1,2] )
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Group by
starting
node
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
(1,6)
(2,6)
(1,2)
(2,3)
(2,4)
(2,5)
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Make it
directed
Iteration#1
Iteration#2
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Identify unique
representative vertex within
the community
Find connected components
in Reducer
(1, [6])
(6, [1,2])
(2,
[3,4,5,6])
(5, [2])
(4, [2])
(3, [2])
Iteration#1
1
5
4
6
3
2
Intermediate graph -
1
Iteration#2
1
5
4
6
3
2
Intermediate graph - 2
Dedup
(6,1)
(2,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
…
Dedup
The data skew in nature caused “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
Sample illustration – Cont.
( 1, [2,3,4,5,6] )
( 2, [1,3,4,5] )
( 3, [1,2] )
( 4, [1,2] )
( 5, [1,2] )
( 6, [1] )
Group by
starting
node
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Make it
directed
SCALABILITY
Find connected components
in Reducer
(6,1)
(5,1)
(4,1)
(3,1)
(2,1)
Identify unique
representative vertex within
the community
• The size of connected components increases significantly in each iteration.
• It caused “bucket effect” (Slow Reduce tasks)
• Keeping the connected components in memory caused OOM in some Reducer
For example:
• 50,000,000+ nodes connected
Iteration#3
1
5 4
6
3
2
Found one connected component,
id is 1, members are [1,2,3,4,5,6]
Iteration#3
( 6,1 )
( 5,1 )
( 4,1 )
…
( 5,1 )
( 5,2 )
( 6,1)
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
Dedup
Our approach to resolve “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
SCALABILITY
Separate huge and
normal keys
(2,1)
(3,1)
(3,2)
(4,1)
(4,2)
(5,1)
(5,2)
(6,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
1. Find min for
each huge key
2. Divide the key
by adding
random
number as
prefix
(01,6)
(01,2)
(11,3)
(11,4)
(11,5)
Processed as introduced :
1. Group by starting node
2. Find smallest node in each group
3. explode the map to rows
4. Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1. Group by key
2. Spill to mmap file if list
length of single key
exceed threshold
3. Keep remaining list in
memory
4. Keep min value of original
key in each group (11, ([file1],
[5],1))
(01, ([file2],[],1))
• Spilled [3,4] into file1
• Spilled [2,6] into file2
• Min value of original key is
1
Read list of files and in-
memory list, then
generate new pairs
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
Merge &
Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1.
Separate
keys
2. Splitting
huge keys
3. Spill to
disk
Our Lesson & Learn of the scalability
v Don’t blame Spark when you see OOM
v Elegant memory usage is the KING
v Inevitable data skew, but scalability can be achieved
v Split huge key
v Spill to disk when necessary
© 2020 PayPal Inc. Confidential and proprietary.
Our Lesson and Learn
•
•
Use Case#2 Prepare the graph data by using Hive
© 2020 PayPal Inc. Confidential and proprietary.
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ? LocalHashMap ? SortMergeJoin?
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions
Use Case#2 Prepare the graph data by using Hive
© 2020 PayPal Inc. Confidential and proprietary.
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ? LocalHashMap ? SortMergeJoin?
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
Expectation Execution …
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions
Our approach to enable broadcast join with 3x performance improved
© 2020 PayPal Inc. Confidential and proprietary.
select *
from A inner join B on A.id = B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
SortMergeJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
Before :
PERFORMANCE
Our approach to enable broadcast join with 3x performance improved
© 2020 PayPal Inc. Confidential and proprietary.
After:
select *
from A inner join B on A.id =
B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer with rule
PruneHiveTablePartition
s
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1MB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
BroadcastHashJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
See PR #26805
merged in Spark 3.0
PERFORMANCE
Prune partitions and
update sizeInBytes
sizeInBytes updated
Broadcast join
selected
Use Case#3 Persist the graph data into Hive tables
© 2020 PayPal Inc. Confidential and proprietary.
Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe
Mis-partitioning the column(s) overloaded the HDFS namenode in production
PERFORMANCE
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”)
• address column has been mis-matched to country column
• country has 200+ distinct value while address has 10+ million distinct value
• Tons of new folders and files were created
• Generated platform alerts due to overloading the namenode continuously
Before :
Our approach to refine the interface explicitly
© 2020 PayPal Inc. Confidential and proprietary.
That avoids the column or partitioned column mis-match in compiling your code
Step 1. DDL Auditing process
Step 2. Manipulate the data in Dataframe
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”, true)
def insertInto(tableName: String, byName: Boolean): Unit
If byName is true, spark will do :
1. Match the columns between data frame and target table by name
2. Throw exception if column name in data frame does not exist in target
table
PERFORMANCE
Step 2. Manipulate the data in Dataframe
After:
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)
Our Lesson & Learn of optimization & enhancement in production
Ø Nothing is too tiny to optimize performance
Ø Deep understanding of spark internals is helpful
Ø Misusage may lead to serious impact on shared service
Ø Explicit interface help avoid misusage
Ø Overall, the performance has been improved by 4-5x
© 2020 PayPal Inc. Confidential and proprietary.
Learning Summary
Our Learning summary
Ø Use memory elegantly in user code to improve scalability
Ø Understanding Spark deeply is helpful for optimization
Ø Achieve performance improvement from 2 days to around 10 hours
Open to the new learning journey by connecting with you all.
© 2020 PayPal Inc. Confidential and proprietary.
From our practices of the real cases in production
Q & A

More Related Content

What's hot

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingDatabricks
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesDatabricks
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingDatabricks
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 

What's hot (20)

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 

Similar to Spark Graph 4-5x Performance

One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018Gleb Otochkin
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Graph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleTigerGraph
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfabhaybansal43
 
Beyond the Schema Mapper
Beyond the Schema MapperBeyond the Schema Mapper
Beyond the Schema MapperSafe Software
 
apache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdfapache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdfAlfredo Lorie
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scaleOri Reshef
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between  CAD & GIS: 6 Ways to Automate Your  Data IntegrationBridging Between  CAD & GIS: 6 Ways to Automate Your  Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationSafe Software
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Bin Data in IBM SPSS Modeler.pptx
Bin Data in IBM SPSS Modeler.pptxBin Data in IBM SPSS Modeler.pptx
Bin Data in IBM SPSS Modeler.pptxVersion 1 Analytics
 
Large Table Partitioning with PostgreSQL and Django
 Large Table Partitioning with PostgreSQL and Django Large Table Partitioning with PostgreSQL and Django
Large Table Partitioning with PostgreSQL and DjangoEDB
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...vtunotesbysree
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 

Similar to Spark Graph 4-5x Performance (20)

One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Graph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at Scale
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdf
 
Beyond the Schema Mapper
Beyond the Schema MapperBeyond the Schema Mapper
Beyond the Schema Mapper
 
apache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdfapache-spark-programming-with-databricks.pdf
apache-spark-programming-with-databricks.pdf
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scale
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between  CAD & GIS: 6 Ways to Automate Your  Data IntegrationBridging Between  CAD & GIS: 6 Ways to Automate Your  Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bin Data in IBM SPSS Modeler.pptx
Bin Data in IBM SPSS Modeler.pptxBin Data in IBM SPSS Modeler.pptx
Bin Data in IBM SPSS Modeler.pptx
 
Large Table Partitioning with PostgreSQL and Django
 Large Table Partitioning with PostgreSQL and Django Large Table Partitioning with PostgreSQL and Django
Large Table Partitioning with PostgreSQL and Django
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 

Spark Graph 4-5x Performance

  • 1. Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements
  • 2. Agenda © 2020 PayPal Inc. Confidential and proprietary. Challenges Our Lesson & Learn • Improve the scalability of the large graph computation • Optimization & Enhancement in the production environment Learning Summary
  • 4. The main challenges we are facing 2+ billion Vertices 100+ billion Edges Degrees • Avg: 110 • Max : 2+ million © 2020 PayPal Inc. Confidential and proprietary. • Large graph with the data skew in nature • Strict SLA but various limitations in the production Limited Resources Various production guidelines Dedicated pool but shared common services, E.g., NameNode
  • 5. Our Lesson and Learn • •
  • 6. Use Case#1 Community detection © 2020 PayPal Inc. Confidential and proprietary. • Using the Connected Component to group the communities • Reference the paper - Connected Components in MapReduce and Beyond SCALABILITY 1 5 4 6 3 2 Sample undirected graph Find Connected Component (1,2) (1,3) (1,4) (1,5) (1,6) Community – (1)
  • 7. The data skew in nature caused “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. Sample illustration SCALABILITY (6,1) (6,1) (6,2) … (5,2) (4,2 ) (3,2) Group by starting node 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group (1,6) (2,6) (2,5) (2,4) (2,3) (6,1) (6,2) (5,2) (4,2) (3,2) (1,6) (2,3) (2,4) (2,5) (2,6) Make it directed ( 1, [6,2] ) ( 2, [1,3,4,5,6] ) ( 3, [2] ) ( 4, [2] ) ( 5, [2] ) ( 6, [1,2] ) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) Group by starting node 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) (1,6) (2,6) (1,2) (2,3) (2,4) (2,5) (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) Make it directed Iteration#1 Iteration#2 (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) Identify unique representative vertex within the community Find connected components in Reducer (1, [6]) (6, [1,2]) (2, [3,4,5,6]) (5, [2]) (4, [2]) (3, [2]) Iteration#1 1 5 4 6 3 2 Intermediate graph - 1 Iteration#2 1 5 4 6 3 2 Intermediate graph - 2 Dedup (6,1) (2,1) (2,1) (3,1) (4,1) (5,1) (6,1) … Dedup
  • 8. The data skew in nature caused “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. Sample illustration – Cont. ( 1, [2,3,4,5,6] ) ( 2, [1,3,4,5] ) ( 3, [1,2] ) ( 4, [1,2] ) ( 5, [1,2] ) ( 6, [1] ) Group by starting node (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) (1,6) (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) Make it directed SCALABILITY Find connected components in Reducer (6,1) (5,1) (4,1) (3,1) (2,1) Identify unique representative vertex within the community • The size of connected components increases significantly in each iteration. • It caused “bucket effect” (Slow Reduce tasks) • Keeping the connected components in memory caused OOM in some Reducer For example: • 50,000,000+ nodes connected Iteration#3 1 5 4 6 3 2 Found one connected component, id is 1, members are [1,2,3,4,5,6] Iteration#3 ( 6,1 ) ( 5,1 ) ( 4,1 ) … ( 5,1 ) ( 5,2 ) ( 6,1) 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group Dedup
  • 9. Our approach to resolve “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. SCALABILITY Separate huge and normal keys (2,1) (3,1) (3,2) (4,1) (4,2) (5,1) (5,2) (6,1) (1,2) (1,3) (1,4) (1,5) (1,6) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) (1,6) (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) 1. Find min for each huge key 2. Divide the key by adding random number as prefix (01,6) (01,2) (11,3) (11,4) (11,5) Processed as introduced : 1. Group by starting node 2. Find smallest node in each group 3. explode the map to rows 4. Dedup (2,1) (3,1) (4,1) (5,1) (6,1) 1. Group by key 2. Spill to mmap file if list length of single key exceed threshold 3. Keep remaining list in memory 4. Keep min value of original key in each group (11, ([file1], [5],1)) (01, ([file2],[],1)) • Spilled [3,4] into file1 • Spilled [2,6] into file2 • Min value of original key is 1 Read list of files and in- memory list, then generate new pairs (2,1) (3,1) (4,1) (5,1) (6,1) Merge & Dedup (2,1) (3,1) (4,1) (5,1) (6,1) 1. Separate keys 2. Splitting huge keys 3. Spill to disk
  • 10. Our Lesson & Learn of the scalability v Don’t blame Spark when you see OOM v Elegant memory usage is the KING v Inevitable data skew, but scalability can be achieved v Split huge key v Spill to disk when necessary © 2020 PayPal Inc. Confidential and proprietary.
  • 11. Our Lesson and Learn • •
  • 12. Use Case#2 Prepare the graph data by using Hive © 2020 PayPal Inc. Confidential and proprietary. How to choose the proper join solution in Spark? PERFORMANCE Note: • spark 2.3.0 • join without joining keys is not included here canBroadcastByHints ? BroadCastJoin ShuffleHashJoin SortMergeJoin canBroadcastBySizes ? preferSortMergeJoin ? Y Y N N Y canBuildLocalHashMap ? N Y N --Quiz: Broadcast ? LocalHashMap ? SortMergeJoin? select * from A inner join B on A.id=B.id where B.dt = ‘2020- 06-25’ • Both Table A and Table B are extra large table • Table B contains one partition on Date(dt) column; The partition size is around 1M. • Inner join between small partition in Table B and an extra-large Table A • Broadcast • Smaller table broadcasted • No shuffle • LocalHashMap • Shuffle needed • Build hash map for smaller side in reducer • SortMergeJoin • Shuffle needed • Sorting each partition of both sides before merge Comparison among various join solutions
  • 13. Use Case#2 Prepare the graph data by using Hive © 2020 PayPal Inc. Confidential and proprietary. How to choose the proper join solution in Spark? PERFORMANCE Note: • spark 2.3.0 • join without joining keys is not included here canBroadcastByHints ? BroadCastJoin ShuffleHashJoin SortMergeJoin canBroadcastBySizes ? preferSortMergeJoin ? Y Y N N Y canBuildLocalHashMap ? N Y N --Quiz: Broadcast ? LocalHashMap ? SortMergeJoin? select * from A inner join B on A.id=B.id where B.dt = ‘2020- 06-25’ • Both Table A and Table B are extra large table • Table B contains one partition on Date(dt) column; The partition size is around 1M. • Inner join between small partition in Table B and an extra-large Table A Expectation Execution … • Broadcast • Smaller table broadcasted • No shuffle • LocalHashMap • Shuffle needed • Build hash map for smaller side in reducer • SortMergeJoin • Shuffle needed • Sorting each partition of both sides before merge Comparison among various join solutions
  • 14. Our approach to enable broadcast join with 3x performance improved © 2020 PayPal Inc. Confidential and proprietary. select * from A inner join B on A.id = B.id where B.dt = ‘2020-06-25’ Parser ‘Project (*) ‘Filter (dt=‘2020-06-25’) ‘Join (A.id=B.id) ‘UnresolvedRelation A ‘UnresolvedRelation B Project (*) Filter (dt=‘2020-06-25’) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Analyzer Optimizer Project (*) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Filter (dt=‘2020-06- 25’) Spark Strategies (including JoinSelection) ProjectExec (*) SortMergeJoinExec (A.id=B.id) HiveTableScanExec A HiveTableScanExec B FilterExec (dt=‘2020-06-25’) Before : PERFORMANCE
  • 15. Our approach to enable broadcast join with 3x performance improved © 2020 PayPal Inc. Confidential and proprietary. After: select * from A inner join B on A.id = B.id where B.dt = ‘2020-06-25’ Parser ‘Project (*) ‘Filter (dt=‘2020-06-25’) ‘Join (A.id=B.id) ‘UnresolvedRelation A ‘UnresolvedRelation B Project (*) Filter (dt=‘2020-06-25’) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Analyzer Optimizer with rule PruneHiveTablePartition s Project (*) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1MB) Filter (dt=‘2020-06- 25’) Spark Strategies (including JoinSelection) ProjectExec (*) BroadcastHashJoinExec (A.id=B.id) HiveTableScanExec A HiveTableScanExec B FilterExec (dt=‘2020-06-25’) See PR #26805 merged in Spark 3.0 PERFORMANCE Prune partitions and update sizeInBytes sizeInBytes updated Broadcast join selected
  • 16. Use Case#3 Persist the graph data into Hive tables © 2020 PayPal Inc. Confidential and proprietary. Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe Mis-partitioning the column(s) overloaded the HDFS namenode in production PERFORMANCE DDL Query Example reviewed and : create table default.emp ( dept_id int, --1 emp_id int, --2 age int, --3 gender string, --4 address string --5 ) partitioned by ( country string, --6 city string --7 ) DML Query Example reviewed and : // new a dataframe df1 from the other logic df1.registerTempTable(“tmpTable”) val df2 = sparkSession.sql( “select department_id as dept_id, --1 employee_id as emp_id, --2 emp_age as age, --3 emp_gender as gender, --4 cnty as country, --5 addr as address , --6 city_name as city --7 from tempTable“) df2.write.insertInto(“default.emp”) • address column has been mis-matched to country column • country has 200+ distinct value while address has 10+ million distinct value • Tons of new folders and files were created • Generated platform alerts due to overloading the namenode continuously Before :
  • 17. Our approach to refine the interface explicitly © 2020 PayPal Inc. Confidential and proprietary. That avoids the column or partitioned column mis-match in compiling your code Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe DML Query Example reviewed and : // new a dataframe df1 from the other logic df1.registerTempTable(“tmpTable”) val df2 = sparkSession.sql( “select department_id as dept_id, --1 employee_id as emp_id, --2 emp_age as age, --3 emp_gender as gender, --4 cnty as country, --5 addr as address , --6 city_name as city --7 from tempTable“) df2.write.insertInto(“default.emp”, true) def insertInto(tableName: String, byName: Boolean): Unit If byName is true, spark will do : 1. Match the columns between data frame and target table by name 2. Throw exception if column name in data frame does not exist in target table PERFORMANCE Step 2. Manipulate the data in Dataframe After: DDL Query Example reviewed and : create table default.emp ( dept_id int, --1 emp_id int, --2 age int, --3 gender string, --4 address string --5 ) partitioned by ( country string, --6 city string --7 )
  • 18. Our Lesson & Learn of optimization & enhancement in production Ø Nothing is too tiny to optimize performance Ø Deep understanding of spark internals is helpful Ø Misusage may lead to serious impact on shared service Ø Explicit interface help avoid misusage Ø Overall, the performance has been improved by 4-5x © 2020 PayPal Inc. Confidential and proprietary.
  • 20. Our Learning summary Ø Use memory elegantly in user code to improve scalability Ø Understanding Spark deeply is helpful for optimization Ø Achieve performance improvement from 2 days to around 10 hours Open to the new learning journey by connecting with you all. © 2020 PayPal Inc. Confidential and proprietary. From our practices of the real cases in production
  • 21. Q & A