SlideShare a Scribd company logo
Graph Processing at Scale using
Spark & GraphFrames
Ron Barabash, Yotpo
helps 70K+ online e-commerce brands
collect and leverage User Generated Content (UGC)
REVIEWS
PHOTOS
Q&A
ON-SITE
CONVERSION
SEARCH &
SOCIAL
CONSUMER
INSIGHTS
User Generated Content
Collect & Leverage
YOU ARE
HERE
Reviews are essential for social proof.
■ According to studies, more than 88% of shoppers
incorporate reviews in their purchasing decision
But also, reviews are a valuable source of feedback:
“I manually read through as many as 5,000 reviews
each month to extract customer insights, run different
analyses, and sending reports to the relevant internal
stakeholders.”
– Sandra Negrea, Customer Engagement Analyst
Yotpo InsightsYotpo Insights
Analyze topics
■ Overall sentiment score
■ Breakdown by products
■ Top mentioned opinions
Explore related reviews
■ See what customers
actually say on each topic
The Technology
Automatically analyzes the grammatical
structure of the reviews to identify all topics
and opinions mentioned.
Natural Language Processing (NLP)
Calculates and assigns a sentiment score per
opinion.
Sentiment Analysis
Groups related topics into one to
improve data significance and ease
of use.
Semantic Grouping
Yotpo Insights
Top success stories
Production team searched for quality issues in the reviews.
Discovered through feedback a malfunction in one of their products
A company noticed opinions on shipment, broken down by country.
Discovered a problem with a shipping warehouse that serves certain countries.
A fashion company found that husband came up very positively for certain products.
Changed their promotions for these products based on the “couple experience”, saw great results.
How?
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
STEP 1
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
STEP 2
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly
YOU ARE
HERE
Topic Grouping
Group words with similar contextual meaning
Step 1: Semantic Grouping
■ Use NLP to group words with similar
semantic meaning
■ Build edges and vertices - Create a graph!
■ Calc. connected components - Graph
Algorithms!
shipping cost
delivery deliver
Step 2: Contextual Grouping
■ Group word clusters to groups with
contextuel meaning - Word2Vec
■ Create a graph
■ Finding paths - Graph Queries!
■ Avoid transitivity by relying on path length
cost costly
shipment
shipping
ship
shipping cost delivery deliver
cost costly
shipment
shipping
ship
Graph Processing in Spark
● GraphX
Graph Algorithms VS. Graph Queries
GraphFrames
● Graph Query Translation
● GraphFrames API
● Connected Components
○ GraphX Implementation
○ GraphFrames Implementation
○ Performance
Takeaways
Let's Talk Business
YOU ARE
HERE
Graph Processing in Spark
What?
● General-purpose graph processing library
● Built into Spark
● Optimized for fast distributed computing
● Library of algorithms: PageRank, Connected Components, etc.
The Bad
● Why just Scala? No Java, Python APIs. No Graph Queries
● Lower-level RDD-based API (vs. DataFrames)
● Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten
memory management
Separated Systems
Graph algorithms vs. Graph Queries
Spark evolves RDDs to DataFrames -
enjoy the benefits and optimizations of the
Dataframes API
Provides powerful tools for running queries
and standard graph algorithms - using GraphX
native implementation (if needed)
The unification of graph algorithms and graph
queries APIs - Available in Scala Java and
Python
GraphFrames
GraphFrames
Unified API
GraphFrames API
Spark SQL
● Page Rank
● Connected
Components
● BFS
● Wikipedia Collaborators
● Counting mutual
friends
● Finding paths existence
and patterns
Pattern Query
Optimizer
Query String Parsed Pattern
Logical Plan Optimized LP
DataFrame
Result
Graph
Algorithms
Materialized
Views
Relational plan
translations
View Selection Join Elimination and
Reordering
graph.find("(root)-[]->(layer1)").filter("root.is_root = true")
graph.find("(root)-[]->(layer1); (layer1)-[]->(layer2)").filter("root.is_root = true")
GraphFrames
Under The Hood
YOU ARE
HERE
Relational plan translations
● Edges and vertices are represented as
DataFrames
● Starts building the result DataFrame
● For each new vertex in the query we
generate a join
○ With the edges table - to get the src and
dst of the edge
○ With the vertices table - to get the
property of the vertex
graph.find("(v0)-[]->(v1); (v1)-[]->(v2)").filter(v2.attr = true)
a b
c
v0 v1 v2
a b
src dst
a b
b c
src = b
v0 v1 v2
a b c
id attr
a 1
b 2
c 3
id = c
The GraphFrames API
class GraphFrame {
def vertices: DataFrame
def edges: DataFrame
def find(pattern: String): DataFrame
def registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFrame
def pageRank(): GraphFrame
def connectedComponents(): GraphFrame
...
}
YOU ARE
HERE
Connected Components
Goal:
Assign each vertex a component ID such that vertices receive the same component ID iff they are
connected.
Problem:
What about really large graphs?
In Distributed Systems we really care about
communication and data skew (partitions)
Naive Implementation in GraphX
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v, update:
i. Component ID of v Smallest component ID in neighborhood of v
Pro: Easy to implement
Con: Slow convergence on large-diameter graphs
*diameter is the greatest distance between any pair of vertices
Small/Big star algorithm - In GraphFrames
Kiveris et al. "Connected Components in MapReduce and Beyond."
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v:
i. Connect smaller neighbors to smallest neighbor - Small Star
b. For each vertex v:
i. Connect bigger neighbors to smallest neighbor (or itself) - Big Star
*Motivation - We are mutating the graph without damaging connectivity into a union of Star Graphs
Small-Star Operations
1
5
7
9
8
smallStar(v) - Connect all smaller neighbours and self to the min neighbour.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8
Big-Star Operations
bigStar(v) - Connect all strictly larger neighbours to the min neighbour including self.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8
1
5
7
9
8
Small/Big star algorithm
1
5
7
9
8
Small/big star operations maintains graph connectivity.
Extra edges are pruned during iterations - makes less message
passing.
Each connected component converges to a star graph.
Converges in log²(#nodes) iterations.
42 million vertices, 1.5 billion edges (small diameter)
running on 16 r3.4xlarge workers on Databricks
● GraphX: 4 minutes
● GraphFrames: 6 minutes
Twitter
Let’s Talk about
Performance
● All datasets are taken from
WebGraph Datasets
105 million vertices, 3.7 billion edges
running on 16 r3.4xlarge workers on Databricks
● GraphX: 25 minutes • slow convergence
● GraphFrames: 4.5 minutes
UK Web Graph
grid 32,000 x 32,000 (large diameter)
1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
● GraphX: failed
● GraphFrames: 1 hour
Grid
~11M
# of Semantic Clusters
~124M
# of Opinions
~31M
# of Reviews
50 r3 xLarge
# of Machines
~2 Hours
PIpeline time
~7.5M
# of Topics
How about some numbers?
Key Takeaways
● Graph Queries + Graph Algorithms = GraphFrames ❤️
● Simple
○ Easy and convenient API in the language of your choosing
○ Lives alongside with other Spark components
● Flexible - using different implementations GraphX/GraphFrames
● Watch out for Performance!
○ Graphframes implementation of CC is actually worst than GraphX for some of
the cases
■ No silver bullet - it depends on the actual graph (size, diameter, sparseness)
○ Most of distributed graph algorithm use iterative message passing between
nodes - Shuffle hell.
Key Takeaways
● Monitoring - Hard to understand the execution plan
● Checkpointing is Important! - by default happens every 2 iterations
○ Handle unexpected node failures
○ Query plan explosion
○ Optimizer slowdown
○ Disk out of shuffle space
Future work
● Performance Optimizations
○ Using different checkpointing parameters
○ Test GraphFrames native Connected Components
● Algorithm Evaluation and AI based Clustering
○ Measure the correctness of current algorithm
○ Research the use of Unsupervised Clustering
● Support additional languages
○ Insights currently supports English.
Thank you!Thank You!

More Related Content

Similar to Graph processing at scale using spark & graph frames

20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patenge20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patenge
Karin Patenge
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Evolve with laravel
Evolve with laravelEvolve with laravel
Evolve with laravel
Gayan Sanjeewa
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Overcome the Reign of Chaos
Overcome the Reign of ChaosOvercome the Reign of Chaos
Overcome the Reign of Chaos
Michael Stockerl
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku
 
Write Generic Code with the Tooling API
Write Generic Code with the Tooling APIWrite Generic Code with the Tooling API
Write Generic Code with the Tooling API
Adam Olshansky
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL Advanced
LeanIX GmbH
 
JS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass PluginJS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JSFestUA
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Ramiro Aduviri Velasco
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_ppt
Hitesh Kumar
 
Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!
Edureka!
 
Learning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails LaunchLearning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails Launch
Thiam Hock Ng
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Databricks
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
ssuser3c3f88
 

Similar to Graph processing at scale using spark & graph frames (20)

20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patenge20181123 dn2018 graph_analytics_k_patenge
20181123 dn2018 graph_analytics_k_patenge
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Evolve with laravel
Evolve with laravelEvolve with laravel
Evolve with laravel
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Overcome the Reign of Chaos
Overcome the Reign of ChaosOvercome the Reign of Chaos
Overcome the Reign of Chaos
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Write Generic Code with the Tooling API
Write Generic Code with the Tooling APIWrite Generic Code with the Tooling API
Write Generic Code with the Tooling API
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL Advanced
 
JS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass PluginJS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
JS Fest 2018. Anna Herlihy. How to Write a Compass Plugin
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Agile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_pptAgile_SDLC_Node.js@Paypal_ppt
Agile_SDLC_Node.js@Paypal_ppt
 
Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!Learn Business Analytics with R at edureka!
Learn Business Analytics with R at edureka!
 
Learning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails LaunchLearning Web Development with Ruby on Rails Launch
Learning Web Development with Ruby on Rails Launch
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 

Graph processing at scale using spark & graph frames

  • 1.
  • 2. Graph Processing at Scale using Spark & GraphFrames Ron Barabash, Yotpo
  • 3. helps 70K+ online e-commerce brands collect and leverage User Generated Content (UGC)
  • 5. Reviews are essential for social proof. ■ According to studies, more than 88% of shoppers incorporate reviews in their purchasing decision But also, reviews are a valuable source of feedback: “I manually read through as many as 5,000 reviews each month to extract customer insights, run different analyses, and sending reports to the relevant internal stakeholders.” – Sandra Negrea, Customer Engagement Analyst
  • 7. Analyze topics ■ Overall sentiment score ■ Breakdown by products ■ Top mentioned opinions Explore related reviews ■ See what customers actually say on each topic
  • 8. The Technology Automatically analyzes the grammatical structure of the reviews to identify all topics and opinions mentioned. Natural Language Processing (NLP) Calculates and assigns a sentiment score per opinion. Sentiment Analysis Groups related topics into one to improve data significance and ease of use. Semantic Grouping
  • 9. Yotpo Insights Top success stories Production team searched for quality issues in the reviews. Discovered through feedback a malfunction in one of their products A company noticed opinions on shipment, broken down by country. Discovered a problem with a shipping warehouse that serves certain countries. A fashion company found that husband came up very positively for certain products. Changed their promotions for these products based on the “couple experience”, saw great results.
  • 10. How?
  • 11. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. quality shipping material jeans Topic shipping took so long... very comfortable material Excellent quality Love these jeans! Opinion STEP 1 First jeans I bought from your site Love these jeans! Excellent quality and very comfortable material. Only gave 4 stars because the shipping took so long...
  • 12. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. STEP 2 quality shipping material jeans Topic shipping took so long... very comfortable material Excellent quality Love these jeans! Opinion First jeans I bought from your site Love these jeans! Excellent quality and very comfortable material. Only gave 4 stars because the shipping took so long...
  • 13. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. Step 3: group topics & opinions ■ Group similar topics & similar opinions ■ Determine representatives for each group STEP 3 shipment shipping delivery material jeans Topic Grouping Opinion Grouping great excellent amazing bad not good terrible horrible stinky smelly
  • 14. Algorithm Overview Step 1: extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. Step 3: group topics & opinions ■ Group similar topics & similar opinions ■ Determine representatives for each group STEP 3 shipment shipping delivery material jeans Topic Grouping Opinion Grouping great excellent amazing bad not good terrible horrible stinky smelly YOU ARE HERE
  • 15. Topic Grouping Group words with similar contextual meaning Step 1: Semantic Grouping ■ Use NLP to group words with similar semantic meaning ■ Build edges and vertices - Create a graph! ■ Calc. connected components - Graph Algorithms! shipping cost delivery deliver Step 2: Contextual Grouping ■ Group word clusters to groups with contextuel meaning - Word2Vec ■ Create a graph ■ Finding paths - Graph Queries! ■ Avoid transitivity by relying on path length cost costly shipment shipping ship shipping cost delivery deliver cost costly shipment shipping ship
  • 16. Graph Processing in Spark ● GraphX Graph Algorithms VS. Graph Queries GraphFrames ● Graph Query Translation ● GraphFrames API ● Connected Components ○ GraphX Implementation ○ GraphFrames Implementation ○ Performance Takeaways Let's Talk Business
  • 18. What? ● General-purpose graph processing library ● Built into Spark ● Optimized for fast distributed computing ● Library of algorithms: PageRank, Connected Components, etc. The Bad ● Why just Scala? No Java, Python APIs. No Graph Queries ● Lower-level RDD-based API (vs. DataFrames) ● Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  • 20. Spark evolves RDDs to DataFrames - enjoy the benefits and optimizations of the Dataframes API Provides powerful tools for running queries and standard graph algorithms - using GraphX native implementation (if needed) The unification of graph algorithms and graph queries APIs - Available in Scala Java and Python GraphFrames
  • 21. GraphFrames Unified API GraphFrames API Spark SQL ● Page Rank ● Connected Components ● BFS ● Wikipedia Collaborators ● Counting mutual friends ● Finding paths existence and patterns Pattern Query Optimizer
  • 22. Query String Parsed Pattern Logical Plan Optimized LP DataFrame Result Graph Algorithms Materialized Views Relational plan translations View Selection Join Elimination and Reordering graph.find("(root)-[]->(layer1)").filter("root.is_root = true") graph.find("(root)-[]->(layer1); (layer1)-[]->(layer2)").filter("root.is_root = true") GraphFrames Under The Hood YOU ARE HERE
  • 23. Relational plan translations ● Edges and vertices are represented as DataFrames ● Starts building the result DataFrame ● For each new vertex in the query we generate a join ○ With the edges table - to get the src and dst of the edge ○ With the vertices table - to get the property of the vertex graph.find("(v0)-[]->(v1); (v1)-[]->(v2)").filter(v2.attr = true) a b c v0 v1 v2 a b src dst a b b c src = b v0 v1 v2 a b c id attr a 1 b 2 c 3 id = c
  • 24. The GraphFrames API class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... } YOU ARE HERE
  • 25. Connected Components Goal: Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Problem: What about really large graphs? In Distributed Systems we really care about communication and data skew (partitions)
  • 26. Naive Implementation in GraphX 1. Assign each vertex a unique component ID. 2. Iterate until convergence: a. For each vertex v, update: i. Component ID of v Smallest component ID in neighborhood of v Pro: Easy to implement Con: Slow convergence on large-diameter graphs *diameter is the greatest distance between any pair of vertices
  • 27. Small/Big star algorithm - In GraphFrames Kiveris et al. "Connected Components in MapReduce and Beyond." 1. Assign each vertex a unique component ID. 2. Iterate until convergence: a. For each vertex v: i. Connect smaller neighbors to smallest neighbor - Small Star b. For each vertex v: i. Connect bigger neighbors to smallest neighbor (or itself) - Big Star *Motivation - We are mutating the graph without damaging connectivity into a union of Star Graphs
  • 28. Small-Star Operations 1 5 7 9 8 smallStar(v) - Connect all smaller neighbours and self to the min neighbour. *Happens in parallel on every single node to build a new graph 1 5 7 9 8
  • 29. Big-Star Operations bigStar(v) - Connect all strictly larger neighbours to the min neighbour including self. *Happens in parallel on every single node to build a new graph 1 5 7 9 8 1 5 7 9 8
  • 30. Small/Big star algorithm 1 5 7 9 8 Small/big star operations maintains graph connectivity. Extra edges are pruned during iterations - makes less message passing. Each connected component converges to a star graph. Converges in log²(#nodes) iterations.
  • 31. 42 million vertices, 1.5 billion edges (small diameter) running on 16 r3.4xlarge workers on Databricks ● GraphX: 4 minutes ● GraphFrames: 6 minutes Twitter Let’s Talk about Performance ● All datasets are taken from WebGraph Datasets 105 million vertices, 3.7 billion edges running on 16 r3.4xlarge workers on Databricks ● GraphX: 25 minutes • slow convergence ● GraphFrames: 4.5 minutes UK Web Graph grid 32,000 x 32,000 (large diameter) 1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks ● GraphX: failed ● GraphFrames: 1 hour Grid
  • 32. ~11M # of Semantic Clusters ~124M # of Opinions ~31M # of Reviews 50 r3 xLarge # of Machines ~2 Hours PIpeline time ~7.5M # of Topics How about some numbers?
  • 33. Key Takeaways ● Graph Queries + Graph Algorithms = GraphFrames ❤️ ● Simple ○ Easy and convenient API in the language of your choosing ○ Lives alongside with other Spark components ● Flexible - using different implementations GraphX/GraphFrames ● Watch out for Performance! ○ Graphframes implementation of CC is actually worst than GraphX for some of the cases ■ No silver bullet - it depends on the actual graph (size, diameter, sparseness) ○ Most of distributed graph algorithm use iterative message passing between nodes - Shuffle hell.
  • 34. Key Takeaways ● Monitoring - Hard to understand the execution plan ● Checkpointing is Important! - by default happens every 2 iterations ○ Handle unexpected node failures ○ Query plan explosion ○ Optimizer slowdown ○ Disk out of shuffle space
  • 35. Future work ● Performance Optimizations ○ Using different checkpointing parameters ○ Test GraphFrames native Connected Components ● Algorithm Evaluation and AI based Clustering ○ Measure the correctness of current algorithm ○ Research the use of Unsupervised Clustering ● Support additional languages ○ Insights currently supports English.