Graph processing at scale using spark & graph frames

Graph Processing at Scale using
Spark & GraphFrames
Ron Barabash, Yotpo

helps 70K+ online e-commerce brands
collect and leverage User Generated Content (UGC)

REVIEWS
PHOTOS
Q&A
ON-SITE
CONVERSION
SEARCH &
SOCIAL
CONSUMER
INSIGHTS
User Generated Content
Collect & Leverage
YOU ARE
HERE

Reviews are essential for social proof.
■ According to studies, more than 88% of shoppers
incorporate reviews in their purchasing decision
But also, reviews are a valuable source of feedback:
“I manually read through as many as 5,000 reviews
each month to extract customer insights, run different
analyses, and sending reports to the relevant internal
stakeholders.”
– Sandra Negrea, Customer Engagement Analyst

Analyze topics
■ Overall sentiment score
■ Breakdown by products
■ Top mentioned opinions
Explore related reviews
■ See what customers
actually say on each topic

The Technology
Automatically analyzes the grammatical
structure of the reviews to identify all topics
and opinions mentioned.
Natural Language Processing (NLP)
Calculates and assigns a sentiment score per
opinion.
Sentiment Analysis
Groups related topics into one to
improve data significance and ease
of use.
Semantic Grouping

Yotpo Insights
Top success stories
Production team searched for quality issues in the reviews.
Discovered through feedback a malfunction in one of their products
A company noticed opinions on shipment, broken down by country.
Discovered a problem with a shipping warehouse that serves certain countries.
A fashion company found that husband came up very positively for certain products.
Changed their promotions for these products based on the “couple experience”, saw great results.

Algorithm Overview
Step 1: extract topics & opinions
■ Each opinion is a substring of the review of
uniform sentiment.
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
STEP 1
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...

Algorithm Overview
uniform sentiment.
Step 2: opinion sentiment analysis
■ Classify opinions as Positive, Negative or Neutral.
STEP 2
quality
shipping
material
jeans
Topic
shipping took so long...
very comfortable material
Excellent quality
Love these jeans!
Opinion
First jeans I bought from your site
Love these jeans! Excellent quality and
very comfortable material. Only gave 4
stars because the shipping took so long...

Algorithm Overview
uniform sentiment.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly

Algorithm Overview
uniform sentiment.
Step 3: group topics & opinions
■ Group similar topics & similar opinions
■ Determine representatives for each group
STEP 3
shipment
shipping
delivery
material
jeans
Topic
Grouping
Opinion
Grouping
great
excellent
amazing
bad
not good
terrible
horrible
stinky
smelly
YOU ARE
HERE

Topic Grouping
Group words with similar contextual meaning
Step 1: Semantic Grouping
■ Use NLP to group words with similar
semantic meaning
■ Build edges and vertices - Create a graph!
■ Calc. connected components - Graph
Algorithms!
shipping cost
delivery deliver
Step 2: Contextual Grouping
■ Group word clusters to groups with
contextuel meaning - Word2Vec
■ Create a graph
■ Finding paths - Graph Queries!
■ Avoid transitivity by relying on path length
cost costly
shipment
shipping
ship
shipping cost delivery deliver
cost costly
shipment
shipping
ship

Graph Processing in Spark
● GraphX
Graph Algorithms VS. Graph Queries
GraphFrames
● Graph Query Translation
● GraphFrames API
● Connected Components
○ GraphX Implementation
○ GraphFrames Implementation
○ Performance
Takeaways
Let's Talk Business

YOU ARE
HERE
Graph Processing in Spark

What?
● General-purpose graph processing library
● Built into Spark
● Optimized for fast distributed computing
● Library of algorithms: PageRank, Connected Components, etc.
The Bad
● Why just Scala? No Java, Python APIs. No Graph Queries
● Lower-level RDD-based API (vs. DataFrames)
● Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten
memory management

Separated Systems
Graph algorithms vs. Graph Queries

Spark evolves RDDs to DataFrames -
enjoy the benefits and optimizations of the
Dataframes API
Provides powerful tools for running queries
and standard graph algorithms - using GraphX
native implementation (if needed)
The unification of graph algorithms and graph
queries APIs - Available in Scala Java and
Python
GraphFrames

GraphFrames
Unified API
GraphFrames API
Spark SQL
● Page Rank
● Connected
Components
● BFS
● Wikipedia Collaborators
● Counting mutual
friends
● Finding paths existence
and patterns
Pattern Query
Optimizer

Query String Parsed Pattern
Logical Plan Optimized LP
DataFrame
Result
Graph
Algorithms
Materialized
Views
Relational plan
translations
View Selection Join Elimination and
Reordering
graph.find("(root)-[]->(layer1)").filter("root.is_root = true")
graph.find("(root)-[]->(layer1); (layer1)-[]->(layer2)").filter("root.is_root = true")
GraphFrames
Under The Hood
YOU ARE
HERE

Relational plan translations
● Edges and vertices are represented as
DataFrames
● Starts building the result DataFrame
● For each new vertex in the query we
generate a join
○ With the edges table - to get the src and
dst of the edge
○ With the vertices table - to get the
property of the vertex
graph.find("(v0)-[]->(v1); (v1)-[]->(v2)").filter(v2.attr = true)
a b
c
v0 v1 v2
a b
src dst
a b
b c
src = b
v0 v1 v2
a b c
id attr
a 1
b 2
c 3
id = c

The GraphFrames API
class GraphFrame {
def vertices: DataFrame
def edges: DataFrame
def find(pattern: String): DataFrame
def registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFrame
def pageRank(): GraphFrame
def connectedComponents(): GraphFrame
...
}
YOU ARE
HERE

Connected Components
Goal:
Assign each vertex a component ID such that vertices receive the same component ID iff they are
connected.
Problem:
What about really large graphs?
In Distributed Systems we really care about
communication and data skew (partitions)

Naive Implementation in GraphX
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v, update:
i. Component ID of v Smallest component ID in neighborhood of v
Pro: Easy to implement
Con: Slow convergence on large-diameter graphs
*diameter is the greatest distance between any pair of vertices

Small/Big star algorithm - In GraphFrames
Kiveris et al. "Connected Components in MapReduce and Beyond."
1. Assign each vertex a unique component ID.
2. Iterate until convergence:
a. For each vertex v:
i. Connect smaller neighbors to smallest neighbor - Small Star
b. For each vertex v:
i. Connect bigger neighbors to smallest neighbor (or itself) - Big Star
*Motivation - We are mutating the graph without damaging connectivity into a union of Star Graphs

Small-Star Operations
1
5
7
9
8
smallStar(v) - Connect all smaller neighbours and self to the min neighbour.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8

Big-Star Operations
bigStar(v) - Connect all strictly larger neighbours to the min neighbour including self.
*Happens in parallel on every single node to build a new graph
1
5
7
9
8
1
5
7
9
8

Small/Big star algorithm
1
5
7
9
8
Small/big star operations maintains graph connectivity.
Extra edges are pruned during iterations - makes less message
passing.
Each connected component converges to a star graph.
Converges in log²(#nodes) iterations.

42 million vertices, 1.5 billion edges (small diameter)
running on 16 r3.4xlarge workers on Databricks
● GraphX: 4 minutes
● GraphFrames: 6 minutes
Twitter
Let’s Talk about
Performance
● All datasets are taken from
WebGraph Datasets
105 million vertices, 3.7 billion edges
running on 16 r3.4xlarge workers on Databricks
● GraphX: 25 minutes • slow convergence
● GraphFrames: 4.5 minutes
UK Web Graph
grid 32,000 x 32,000 (large diameter)
1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
● GraphX: failed
● GraphFrames: 1 hour
Grid

~11M
# of Semantic Clusters
~124M
# of Opinions
~31M
# of Reviews
50 r3 xLarge
# of Machines
~2 Hours
PIpeline time
~7.5M
# of Topics
How about some numbers?

Key Takeaways
● Graph Queries + Graph Algorithms = GraphFrames ❤️
● Simple
○ Easy and convenient API in the language of your choosing
○ Lives alongside with other Spark components
● Flexible - using different implementations GraphX/GraphFrames
● Watch out for Performance!
○ Graphframes implementation of CC is actually worst than GraphX for some of
the cases
■ No silver bullet - it depends on the actual graph (size, diameter, sparseness)
○ Most of distributed graph algorithm use iterative message passing between
nodes - Shuffle hell.

Key Takeaways
● Monitoring - Hard to understand the execution plan
● Checkpointing is Important! - by default happens every 2 iterations
○ Handle unexpected node failures
○ Query plan explosion
○ Optimizer slowdown
○ Disk out of shuffle space

Future work
● Performance Optimizations
○ Using different checkpointing parameters
○ Test GraphFrames native Connected Components
● Algorithm Evaluation and AI based Clustering
○ Measure the correctness of current algorithm
○ Research the use of Unsupervised Clustering
● Support additional languages
○ Insights currently supports English.

Graph processing at scale using spark & graph frames

Recommended

Recommended

More Related Content

Similar to Graph processing at scale using spark & graph frames

Similar to Graph processing at scale using spark & graph frames (20)

Recently uploaded

Recently uploaded (20)