Building Identity Graphs over Heterogeneous Data

Building Identity Graphs over
Heterogeneous Data
Sudha Viswanathan
Saigopal Thota

Agenda
▪ Identities at Scale
▪ Why Graph
▪ How we Built and Scaled it
▪ Why an In-house solution
▪ Challenges and
Considerations
▪ A peek into Real time Graph

Identity
Tokens
Online 2
...
Partner
Apps
App 3App 2
App1
Online 1
Account Ids
Cookies
Online Ids
Device Ids

Identity Resolution
Aims to provide a coherent view of a
customer and / or a household by unifying
all customer identities across channels and
subsidiaries
Provides single notion of a customer

Identities, Linkages & Metadata – An Example
Device
id
App id
Login id App id
Cookie
id
Device
id
Last login: 4/28/2020
App: YouTube
Country: Canada
Identity
Linkage
Country: Canada Metadata

Graph – An Example
Login id App id
Device
id
App:YouTube
Cookie
id Country: Canada
Connect all Linkages to create a single connected component per user/household

Graph Traversal
▪ Graph is an efficient data
structure relative to table joins
▪ Why Table join doesn't work?
▪ Linkages are in the order of millions of rows
spanning across hundreds of tables
▪ Table joins are based on
index and are computationally very expensive
▪ Table joins result in lesser coverage
Scalable and offers better coverage

Build once – Query multiple times
▪ Graph enables dynamic traversal logic. One Graph offers infinite
traversal possibilities
▪ Get all tokens liked to an entity
▪ Get all tokens linked to the entity's household
▪ Get all tokens linked to an entity that created after Jan 2020
▪ Get all tokens linked to the entity's household that interacted using App 1
▪ ...
Graph comes with flexibility in traversal

Scale and Performance Objectives
▪ More than 25+ Billion linkages and identities
▪ New linkages created 24x7
▪ Node and Edge Metadata updated for 60% of existing Linkages
▪ Freshness – Graph Updated with linkages and Metadata, Once a day
▪ Could be few hours, in future goals
▪ Ability to run on general purpose Hadoop Infrastructure

Components of Identity Graph
Data Analysis
Understand your data
and check for
anomalies
Handling
Heterogenous Data
Sources
Extract only new and
modified linkages in a
format needed by the
next stage
Stage I – Dedup
& Eliminate
outliers
Add edge metadata,
filter outliers and
populate tables needed
by the next stage
Stage II – Create
Connected
Components
Merge Linkages to form
an Identity graph for
each customer
Stage III –
Prepare for
Traversal
Demystifies linkages
within a cluster and
appends metadata
information to enable
graph traversal
Traversal
Traverse across the
cluster as per defined
rules to pick only the
qualified nodes
Core Processing

Data Analysis
▪ Understanding the data that feeds into Graph pipeline is paramount to
building a usable Graph framework.
▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing
resources and business value
▪ Some questions to analyze,
▪ Does the linkage relationship makes business sense?
▪ What is acceptable threshold for poor quality linkages
▪ Do we need to apply any filter
▪ Nature of data – Snapshot vs Incremental

Handling Heterogenous Data Sources
▪ Data sources grow rapidly in volume and variety
▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls!
▪ Dedicated framework to ingest data in parallel
from heterogenous sources
▪ Serves only new and modified linkages. This is important for Incremental processing
▪ Pulls only the desired attributes for further processing – linkages and their metadata in
a standard schema

Core Processing – Stage I
▪ Feeds good quality linkages to further
processing.
▪ It handles:
▪ Deduplication
▪ If a linkage is repeated, we consume only the latest record
▪ Outlier elimination
▪ Filters anomalous linkages based on a chosen threshold derived from data analysis
▪ Edge Metadata population
▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages.
Dedup & Eliminate outliers

Core Processing – Stage II
▪ Merges all related linkages of a customer to create a Connected
Component
Create Connected Components

Core Processing – Stage III
▪ This stage enriches the connected component with linkages between nodes
and edge metadata to enable graph traversal.
Prepare for Traversal
Login id App id
Device
id
App:YouTube
Cookie
id Country: Canada
Login id
App id
Device
id
Cookie
id

A B
B C
B D
D E
A
B
DE
G1
A
B
D
E
C
m
1
m
2
m3
m4
Stage II – Create Connected Components Stage III - Prepare for Traversal
C
PN
NM
Stage I – Dedup & Outlier Elimination
P
NM
N
M P
G2 m1 m2

Union Find Shuffle (UFS): Building
Connected Components at Scale

Weighted Union Find with Path Compression
2
5
2
9
5
9
2
2
9
2
9
Top Level parent -
2 9
Size of the cluster - 2
2
5
Top Level parent -
5
5
Height – 2
(not Weighted Union)
Height -1
Weighted Union
or
7 8
7 8
5 7 Top Level parent -
2 Top Level parent -
7
2
9 5
8
7
2
9 5
8 1
2
9 5
8
7
1
Top Level parent -
1
2
9 5 87 1
Path Compression
Top Level parent -
2

• Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find()
determines that c is the top level parent;
a -> b; b -> c => c is top parent
• Path compression() – helps to reduce height of connected components by linking all children directly to
the top level parent.
a -> b; b -> c => a -> b -> c => a -> c; b -> c
• Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent
with more children. This also helps to reduce the height of connected components.

Distributed UFS with Path Compression
Path Compress Iteratively perform Path Compression for connected components until
all connected components are path compressed.
Shuffle Merge locally processed partitions with a global shuffle iteratively until
all connected components are resolved
Run UF Run Weighted Union Find with Path Compression on each partition
Divide Divide the data into partitions

Shuffle in UFS
9 4 5 8 3
Reached Termination conditionProceeds to next iteration
3 74 9
8 6
6 3
32
97
57
9
4 3
6 8 34 32
2 3
3 7
5
9 7 5
6 37 3 5

Union Find Shuffle using Spark
▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges)
▪ Iterative processing with caching and intermittent checkpointing
▪ Limitations with other alternatives

How do we scale?
▪ The input to Union Find Shuffle is bucked to create 1000 part files of
similar size
▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion
nodes. Each instance of UF is applied to 100 part files.
▪ At any given time, we will have 5 instances of UF running in parallel

Data Quality:
Challenges and Considerations

Noisy Data
Coo
1
Acc
1
Coo
2 Coo
3
Coo
4
Coo
100
Cookie Tokens
Acc
1
Coo.
1
Acc
2 Acc
3
Acc
4
Acc
100
•
•
•
•
Cookie Token
Graph exposes Noise, opportunities, and fragmentation in data
•
•
•
•

An Example of anomalous linkage data
▪ For some linkages, we have millions of
entities mapping to the same token (id)
▪ In the data distribution, we see a
majority of tokens mapped to 1-5 entities
▪ We also see a few tokens (potential
outliers) mapped to millions of entities!
Data Distribution

Removal of Anomalous Linkages
▪ Extensive analysis to identify anomalous linkage patterns
▪ A Gaussian Anomaly detection model (Statistical Analysis)
▪ Identify thresholds of linkage cardinality to filter linkages
▪ A lenient threshold will improve coverage at the cost of precision.
Hit the balance between Coverage and Precision

Threshold # Big Clusters % match of Facets 1 / # of distinct Entities
Threshold 10 – High Precision, Low Coverage
• Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high
• More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer
Threshold 1000 – Low Precision, High Coverage
• More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser
• Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high
Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4

Large Connected Components (LCC)
▪ Size ranging from 10k – 100 M +
▪ Combination of Hubs and Long Chains
▪ Token collisions, Noise in data, Bot
traffic
▪ Legitimate users belong to LCC
▪ Large number of shuffles in UFS
A result of lenient threshold

Traversing LCC
▪ Business demands both Precision and Coverage, hence LCC needs
traversal
▪ Iterative Spark BFS implementation is used to traverse LCC
▪ Traversal is supported up to a certain pre-defined depth
▪ Going beyond a certain depth not only strains the system but also adds no business value
▪ Traversal is optimized to run using Spark in 20-30 minutes over all
connected components
Solution to get both Precision and Coverage

tid1 tid2 Linkage
metadata
a b tid1 tid
2
c
a b tid1 tid
2
c
a b tid1 tid
2
c
Graph Pipeline – Powered by
Handling Heterogenous Linkages Stage I Stage II
Stage III - LCC
Stage III - SCC
a b tid1 tid
2
c
a b tid1 tid2 c
15 upstream tables
p q tid6 tid9 r
25B+ Raw Linkages &
30B+ Nodes
tid1 tid2 Linkage
metadata
tid1 tid2 Linkage
metadata
tid1_long tid2_long Linkage
metadata
tid6_long tgid120
tid1_long tgid1
tgid Linkages with Metadata
tgid1 {tgid: 1, tid: [aid,bid],
edges:[srcid,
destid,metadata],[] }
1-2
hrs
UnionFindShuffle
8-10hrs
1-2
hrs
Subgraphcreation
4-5hrs
tgid tid Linkages
(adj_list)
3 A1 [C1:m1, B1:m2, B2:m3]
2 C1 [A1:m1]
2 A2 B2:m2
3 B1 [A1:m2]
3 B2 [A1:m3]
tid Linkages
A1 [C1:m1, B1:m2]
A2 [B2:m2]
C1 [A1:m1]
B1 [A1:m2]
Give all aid-bid linkages which go via
cid
Traversal request
Give all A– B linkages where
criteria= m1,m2
Traversal request on LCC
Filter tids on
m1,m2
Select
count(*)
by tgid
MR on
filtered
tgid
partitions
Dump
LCC table
> 5k
CC
startnode=A, endnode=B, criteria=m1,m2
tid Linkage
A1 B1
A2 B2
For each tid do
a bfs
(unidirected/bidi
rected)
Map
Map
Map
1 map per tgid
traversal
tid1 tid1_long
tid6 tid6_long
tid6 tgid120
tid1 tgid1
Tableextraction&
transformation30mins
20-30
mins
2.5
hrs
30
mins
20-30
mins

▪ Linkages within streaming datasets
▪ New linkages require updating the graph in real time.
▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc.
▪ Scale
▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated
▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries
▪ Real-time Querying and Traversals
▪ High throughput traversing and querying capability on tokens belonging to the same customer
Real time Graph: Challenges

Building Identity Graphs over Heterogeneous Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Identity Graphs over Heterogeneous Data

Similar to Building Identity Graphs over Heterogeneous Data (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building Identity Graphs over Heterogeneous Data