Graph Processing at Scale using
Spark & GraphFrames
Ron Barabash, Yotpo
helps 70K+ online e-commerce brands
collect and leverage User Generated Content (UGC)
REVIEWS
PHOTOS
Q&A
ON-SITE
CONVERSION
SEARCH &
SOCIAL
CONSUMER
INSIGHTS
User Generated Content
Collect & Leverage
YOU ARE
HERE
Reviews are essential for social proof.
■ According to studies, more than 88% of shoppers
incorporate reviews in their purchasing decision
But also, reviews are a valuable source of feedback:
“I manually read through as many as 5,000 reviews
each month to extract customer insights, run different
analyses, and sending reports to the relevant internal
stakeholders.”
– Sandra Negrea, Customer Engagement Analyst
Yotpo InsightsYotpo Insights
Analyze topics
■ Overall sentiment score
■ Breakdown by products
■ Top mentioned opinions
Explore related reviews
■ See what customers
actually say on each topic
The Technology
Automatically analyzes the grammatical
structure of the reviews to identify all topics
and opinions mentioned.
Natural Language Processing (NLP)
Calculates and assigns a sentiment score per
opinion.
Sentiment Analysis
Groups related topics into one to
improve data significance and ease
of use.
Semantic Grouping
Yotpo Insights
Top success stories
Production team searched for quality issues in the reviews.
Discovered through feedback a malfunction in one of their products
A company noticed opinions on shipment, broken down by country.
Discovered a problem with a shipping warehouse that serves certain countries.
A fashion company found that husband came up very positively for certain products.
Changed their promotions for these products based on the “couple experience”, saw great results.
How?
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
STEP 1
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
STEP 2
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly
Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly
YOU ARE
HERE
Topic Grouping
Group words with similar contextual meaning
Step 1: Semantic Grouping
■ Use NLP to group words with similar
semantic meaning
■ Build edges and vertices - Create a graph!
■ Calc. connected components - Graph
Algorithms!
shipping cost
delivery deliver
Step 2: Contextual Grouping
■ Group word clusters to groups with
contextuel meaning - Word2Vec
■ Create a graph
■ Finding paths - Graph Queries!
■ Avoid transitivity by relying on path length
cost costly
shipment
shipping
ship
shipping cost delivery deliver
cost costly
shipment
shipping
ship
Graph Processing in Spark
● GraphX
Graph Algorithms VS. Graph Queries
GraphFrames
● Graph Query Translation
● GraphFrames API
● Connected Components
○ GraphX Implementation
○ GraphFrames Implementation
○ Performance
Takeaways
Let's Talk Business
YOU ARE
HERE
Graph Processing in Spark
What?
● General-purpose graph processing library
● Built into Spark
● Optimized for fast distributed computing
● Library of algorithms: PageRank, Connected Components, etc.
The Bad
● Why just Scala? No Java, Python APIs. No Graph Queries
● Lower-level RDD-based API (vs. DataFrames)
● Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten
memory management
Separated Systems
Graph algorithms vs. Graph Queries
Spark evolves RDDs to DataFrames -
enjoy the benefits and optimizations of the
Dataframes API
Provides powerful tools for running queries
and standard graph algorithms - using GraphX
native implementation (if needed)
The unification of graph algorithms and graph
queries APIs - Available in Scala Java and
Python
GraphFrames
GraphFrames
Unified API
GraphFrames API
Spark SQL
● Page Rank
● Connected
Components
● BFS
● Wikipedia Collaborators
● Counting mutual
friends
● Finding paths existence
and patterns
Pattern Query
Optimizer
Query String Parsed Pattern
Logical Plan Optimized LP
DataFrame
Result
Graph
Algorithms
Materialized
Views
Relational plan
translations
View Selection Join Elimination and
Reordering
graph.find("(root)-[]->(layer1)").filter("root.is_root = true")
graph.find("(root)-[]->(layer1); (layer1)-[]->(layer2)").filter("root.is_root = true")
GraphFrames
Under The Hood
YOU ARE
HERE
Relational plan translations
● Edges and vertices are represented as
DataFrames
● Starts building the result DataFrame
● For each new vertex in the query we
generate a join
○ With the edges table - to get the src and
dst of the edge
○ With the vertices table - to get the
property of the vertex
graph.find("(v0)-[]->(v1); (v1)-[]->(v2)").filter(v2.attr = true)
a b
c
v0 v1 v2
a b
src dst
a b
b c
src = b
v0 v1 v2
a b c
id attr
a 1
b 2
c 3
id = c
The GraphFrames API
class GraphFrame {
def vertices: DataFrame
def edges: DataFrame
def find(pattern: String): DataFrame
def registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFrame
def pageRank(): GraphFrame
def connectedComponents(): GraphFrame
...
}
YOU ARE
HERE
Connected Components
Goal:
Assign each vertex a component ID such that vertices receive the same component ID iff they are
connected.
Problem:
What about really large graphs?
In Distributed Systems we really care about
communication and data skew (partitions)
Naive Implementation in GraphX
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v, update:
i. Component ID of v Smallest component ID in neighborhood of v
Pro: Easy to implement
Con: Slow convergence on large-diameter graphs
*diameter is the greatest distance between any pair of vertices
Small/Big star algorithm - In GraphFrames
Kiveris et al. "Connected Components in MapReduce and Beyond."
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v:
i. Connect smaller neighbors to smallest neighbor - Small Star
b. For each vertex v:
i. Connect bigger neighbors to smallest neighbor (or itself) - Big Star
*Motivation - We are mutating the graph without damaging connectivity into a union of Star Graphs
Small-Star Operations
1
5
7
9
8
smallStar(v) - Connect all smaller neighbours and self to the min neighbour.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8
Big-Star Operations
bigStar(v) - Connect all strictly larger neighbours to the min neighbour including self.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8
1
5
7
9
8
Small/Big star algorithm
1
5
7
9
8
Small/big star operations maintains graph connectivity.
Extra edges are pruned during iterations - makes less message
passing.
Each connected component converges to a star graph.
Converges in log²(#nodes) iterations.
42 million vertices, 1.5 billion edges (small diameter)
running on 16 r3.4xlarge workers on Databricks
● GraphX: 4 minutes
● GraphFrames: 6 minutes
Twitter
Let’s Talk about
Performance
● All datasets are taken from
WebGraph Datasets
105 million vertices, 3.7 billion edges
running on 16 r3.4xlarge workers on Databricks
● GraphX: 25 minutes • slow convergence
● GraphFrames: 4.5 minutes
UK Web Graph
grid 32,000 x 32,000 (large diameter)
1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
● GraphX: failed
● GraphFrames: 1 hour
Grid
~11M
# of Semantic Clusters
~124M
# of Opinions
~31M
# of Reviews
50 r3 xLarge
# of Machines
~2 Hours
PIpeline time
~7.5M
# of Topics
How about some numbers?
Key Takeaways
● Graph Queries + Graph Algorithms = GraphFrames ❤️
● Simple
○ Easy and convenient API in the language of your choosing
○ Lives alongside with other Spark components
● Flexible - using different implementations GraphX/GraphFrames
● Watch out for Performance!
○ Graphframes implementation of CC is actually worst than GraphX for some of
the cases
■ No silver bullet - it depends on the actual graph (size, diameter, sparseness)
○ Most of distributed graph algorithm use iterative message passing between
nodes - Shuffle hell.
Key Takeaways
● Monitoring - Hard to understand the execution plan
● Checkpointing is Important! - by default happens every 2 iterations
○ Handle unexpected node failures
○ Query plan explosion
○ Optimizer slowdown
○ Disk out of shuffle space
Future work
● Performance Optimizations
○ Using different checkpointing parameters
○ Test GraphFrames native Connected Components
● Algorithm Evaluation and AI based Clustering
○ Measure the correctness of current algorithm
○ Research the use of Unsupervised Clustering
● Support additional languages
○ Insights currently supports English.
Thank you!Thank You!

Graph processing at scale using spark & graph frames

  • 2.
    Graph Processing atScale using Spark & GraphFrames Ron Barabash, Yotpo
  • 3.
    helps 70K+ onlinee-commerce brands collect and leverage User Generated Content (UGC)
  • 4.
  • 5.
    Reviews are essentialfor social proof. ■ According to studies, more than 88% of shoppers incorporate reviews in their purchasing decision But also, reviews are a valuable source of feedback: “I manually read through as many as 5,000 reviews each month to extract customer insights, run different analyses, and sending reports to the relevant internal stakeholders.” – Sandra Negrea, Customer Engagement Analyst
  • 6.
  • 7.
    Analyze topics ■ Overallsentiment score ■ Breakdown by products ■ Top mentioned opinions Explore related reviews ■ See what customers actually say on each topic
  • 8.
    The Technology Automatically analyzesthe grammatical structure of the reviews to identify all topics and opinions mentioned. Natural Language Processing (NLP) Calculates and assigns a sentiment score per opinion. Sentiment Analysis Groups related topics into one to improve data significance and ease of use. Semantic Grouping
  • 9.
    Yotpo Insights Top successstories Production team searched for quality issues in the reviews. Discovered through feedback a malfunction in one of their products A company noticed opinions on shipment, broken down by country. Discovered a problem with a shipping warehouse that serves certain countries. A fashion company found that husband came up very positively for certain products. Changed their promotions for these products based on the “couple experience”, saw great results.
  • 10.
  • 11.
    Algorithm Overview Step 1:extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. quality shipping material jeans Topic shipping took so long... very comfortable material Excellent quality Love these jeans! Opinion STEP 1 First jeans I bought from your site Love these jeans! Excellent quality and very comfortable material. Only gave 4 stars because the shipping took so long...
  • 12.
    Algorithm Overview Step 1:extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. STEP 2 quality shipping material jeans Topic shipping took so long... very comfortable material Excellent quality Love these jeans! Opinion First jeans I bought from your site Love these jeans! Excellent quality and very comfortable material. Only gave 4 stars because the shipping took so long...
  • 13.
    Algorithm Overview Step 1:extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. Step 3: group topics & opinions ■ Group similar topics & similar opinions ■ Determine representatives for each group STEP 3 shipment shipping delivery material jeans Topic Grouping Opinion Grouping great excellent amazing bad not good terrible horrible stinky smelly
  • 14.
    Algorithm Overview Step 1:extract topics & opinions ■ Each opinion is a substring of the review of uniform sentiment. Step 2: opinion sentiment analysis ■ Classify opinions as Positive, Negative or Neutral. Step 3: group topics & opinions ■ Group similar topics & similar opinions ■ Determine representatives for each group STEP 3 shipment shipping delivery material jeans Topic Grouping Opinion Grouping great excellent amazing bad not good terrible horrible stinky smelly YOU ARE HERE
  • 15.
    Topic Grouping Group wordswith similar contextual meaning Step 1: Semantic Grouping ■ Use NLP to group words with similar semantic meaning ■ Build edges and vertices - Create a graph! ■ Calc. connected components - Graph Algorithms! shipping cost delivery deliver Step 2: Contextual Grouping ■ Group word clusters to groups with contextuel meaning - Word2Vec ■ Create a graph ■ Finding paths - Graph Queries! ■ Avoid transitivity by relying on path length cost costly shipment shipping ship shipping cost delivery deliver cost costly shipment shipping ship
  • 16.
    Graph Processing inSpark ● GraphX Graph Algorithms VS. Graph Queries GraphFrames ● Graph Query Translation ● GraphFrames API ● Connected Components ○ GraphX Implementation ○ GraphFrames Implementation ○ Performance Takeaways Let's Talk Business
  • 17.
  • 18.
    What? ● General-purpose graphprocessing library ● Built into Spark ● Optimized for fast distributed computing ● Library of algorithms: PageRank, Connected Components, etc. The Bad ● Why just Scala? No Java, Python APIs. No Graph Queries ● Lower-level RDD-based API (vs. DataFrames) ● Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  • 19.
  • 20.
    Spark evolves RDDsto DataFrames - enjoy the benefits and optimizations of the Dataframes API Provides powerful tools for running queries and standard graph algorithms - using GraphX native implementation (if needed) The unification of graph algorithms and graph queries APIs - Available in Scala Java and Python GraphFrames
  • 21.
    GraphFrames Unified API GraphFrames API SparkSQL ● Page Rank ● Connected Components ● BFS ● Wikipedia Collaborators ● Counting mutual friends ● Finding paths existence and patterns Pattern Query Optimizer
  • 22.
    Query String ParsedPattern Logical Plan Optimized LP DataFrame Result Graph Algorithms Materialized Views Relational plan translations View Selection Join Elimination and Reordering graph.find("(root)-[]->(layer1)").filter("root.is_root = true") graph.find("(root)-[]->(layer1); (layer1)-[]->(layer2)").filter("root.is_root = true") GraphFrames Under The Hood YOU ARE HERE
  • 23.
    Relational plan translations ●Edges and vertices are represented as DataFrames ● Starts building the result DataFrame ● For each new vertex in the query we generate a join ○ With the edges table - to get the src and dst of the edge ○ With the vertices table - to get the property of the vertex graph.find("(v0)-[]->(v1); (v1)-[]->(v2)").filter(v2.attr = true) a b c v0 v1 v2 a b src dst a b b c src = b v0 v1 v2 a b c id attr a 1 b 2 c 3 id = c
  • 24.
    The GraphFrames API classGraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... } YOU ARE HERE
  • 25.
    Connected Components Goal: Assign eachvertex a component ID such that vertices receive the same component ID iff they are connected. Problem: What about really large graphs? In Distributed Systems we really care about communication and data skew (partitions)
  • 26.
    Naive Implementation inGraphX 1. Assign each vertex a unique component ID. 2. Iterate until convergence: a. For each vertex v, update: i. Component ID of v Smallest component ID in neighborhood of v Pro: Easy to implement Con: Slow convergence on large-diameter graphs *diameter is the greatest distance between any pair of vertices
  • 27.
    Small/Big star algorithm- In GraphFrames Kiveris et al. "Connected Components in MapReduce and Beyond." 1. Assign each vertex a unique component ID. 2. Iterate until convergence: a. For each vertex v: i. Connect smaller neighbors to smallest neighbor - Small Star b. For each vertex v: i. Connect bigger neighbors to smallest neighbor (or itself) - Big Star *Motivation - We are mutating the graph without damaging connectivity into a union of Star Graphs
  • 28.
    Small-Star Operations 1 5 7 9 8 smallStar(v) -Connect all smaller neighbours and self to the min neighbour. *Happens in parallel on every single node to build a new graph 1 5 7 9 8
  • 29.
    Big-Star Operations bigStar(v) -Connect all strictly larger neighbours to the min neighbour including self. *Happens in parallel on every single node to build a new graph 1 5 7 9 8 1 5 7 9 8
  • 30.
    Small/Big star algorithm 1 5 7 9 8 Small/bigstar operations maintains graph connectivity. Extra edges are pruned during iterations - makes less message passing. Each connected component converges to a star graph. Converges in log²(#nodes) iterations.
  • 31.
    42 million vertices,1.5 billion edges (small diameter) running on 16 r3.4xlarge workers on Databricks ● GraphX: 4 minutes ● GraphFrames: 6 minutes Twitter Let’s Talk about Performance ● All datasets are taken from WebGraph Datasets 105 million vertices, 3.7 billion edges running on 16 r3.4xlarge workers on Databricks ● GraphX: 25 minutes • slow convergence ● GraphFrames: 4.5 minutes UK Web Graph grid 32,000 x 32,000 (large diameter) 1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks ● GraphX: failed ● GraphFrames: 1 hour Grid
  • 32.
    ~11M # of SemanticClusters ~124M # of Opinions ~31M # of Reviews 50 r3 xLarge # of Machines ~2 Hours PIpeline time ~7.5M # of Topics How about some numbers?
  • 33.
    Key Takeaways ● GraphQueries + Graph Algorithms = GraphFrames ❤️ ● Simple ○ Easy and convenient API in the language of your choosing ○ Lives alongside with other Spark components ● Flexible - using different implementations GraphX/GraphFrames ● Watch out for Performance! ○ Graphframes implementation of CC is actually worst than GraphX for some of the cases ■ No silver bullet - it depends on the actual graph (size, diameter, sparseness) ○ Most of distributed graph algorithm use iterative message passing between nodes - Shuffle hell.
  • 34.
    Key Takeaways ● Monitoring- Hard to understand the execution plan ● Checkpointing is Important! - by default happens every 2 iterations ○ Handle unexpected node failures ○ Query plan explosion ○ Optimizer slowdown ○ Disk out of shuffle space
  • 35.
    Future work ● PerformanceOptimizations ○ Using different checkpointing parameters ○ Test GraphFrames native Connected Components ● Algorithm Evaluation and AI based Clustering ○ Measure the correctness of current algorithm ○ Research the use of Unsupervised Clustering ● Support additional languages ○ Insights currently supports English.
  • 36.