SlideShare a Scribd company logo
1 of 41
Download to read offline
Building Identity Graphs over
Heterogeneous Data
Sudha Viswanathan
Saigopal Thota
Primary Contributors
Agenda
▪ Identities at Scale
▪ Why Graph
▪ How we Built and Scaled it
▪ Why an In-house solution
▪ Challenges and
Considerations
▪ A peek into Real time Graph
Identities at Scale
Identity
Tokens
Online 2
...
Partner
Apps
App 3App 2
App1
Online 1
Account Ids
Cookies
Online Ids
Device Ids
Identity Resolution
Aims to provide a coherent view of a
customer and / or a household by unifying
all customer identities across channels and
subsidiaries
Provides single notion of a customer
Why Graph
Identities, Linkages & Metadata – An Example
Device
id
App id
Login id App id
Cookie
id
Device
id
Last login: 4/28/2020
App: YouTube
Country: Canada
Identity
Linkage
Country: Canada Metadata
Graph – An Example
Login id App id
Last login: 4/28/2020
Device
id
App:YouTube
Cookie
id Country: Canada
Connect all Linkages to create a single connected component per user/household
Graph Traversal
▪ Graph is an efficient data
structure relative to table joins
▪ Why Table join doesn't work?
▪ Linkages are in the order of millions of rows
spanning across hundreds of tables
▪ Table joins are based on
index and are computationally very expensive
▪ Table joins result in lesser coverage
Scalable and offers better coverage
Build once – Query multiple times
▪ Graph enables dynamic traversal logic. One Graph offers infinite
traversal possibilities
▪ Get all tokens liked to an entity
▪ Get all tokens linked to the entity's household
▪ Get all tokens linked to an entity that created after Jan 2020
▪ Get all tokens linked to the entity's household that interacted using App 1
▪ ...
Graph comes with flexibility in traversal
How we Built and Scaled
Scale and Performance Objectives
▪ More than 25+ Billion linkages and identities
▪ New linkages created 24x7
▪ Node and Edge Metadata updated for 60% of existing Linkages
▪ Freshness – Graph Updated with linkages and Metadata, Once a day
▪ Could be few hours, in future goals
▪ Ability to run on general purpose Hadoop Infrastructure
Components of Identity Graph
Data Analysis
Understand your data
and check for
anomalies
Handling
Heterogenous Data
Sources
Extract only new and
modified linkages in a
format needed by the
next stage
Stage I – Dedup
& Eliminate
outliers
Add edge metadata,
filter outliers and
populate tables needed
by the next stage
Stage II – Create
Connected
Components
Merge Linkages to form
an Identity graph for
each customer
Stage III –
Prepare for
Traversal
Demystifies linkages
within a cluster and
appends metadata
information to enable
graph traversal
Traversal
Traverse across the
cluster as per defined
rules to pick only the
qualified nodes
Core Processing
Data Analysis
▪ Understanding the data that feeds into Graph pipeline is paramount to
building a usable Graph framework.
▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing
resources and business value
▪ Some questions to analyze,
▪ Does the linkage relationship makes business sense?
▪ What is acceptable threshold for poor quality linkages
▪ Do we need to apply any filter
▪ Nature of data – Snapshot vs Incremental
Handling Heterogenous Data Sources
▪ Data sources grow rapidly in volume and variety
▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls!
▪ Dedicated framework to ingest data in parallel
from heterogenous sources
▪ Serves only new and modified linkages. This is important for Incremental processing
▪ Pulls only the desired attributes for further processing – linkages and their metadata in
a standard schema
Core Processing – Stage I
▪ Feeds good quality linkages to further
processing.
▪ It handles:
▪ Deduplication
▪ If a linkage is repeated, we consume only the latest record
▪ Outlier elimination
▪ Filters anomalous linkages based on a chosen threshold derived from data analysis
▪ Edge Metadata population
▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages.
Dedup & Eliminate outliers
Core Processing – Stage II
▪ Merges all related linkages of a customer to create a Connected
Component
Create Connected Components
Core Processing – Stage III
▪ This stage enriches the connected component with linkages between nodes
and edge metadata to enable graph traversal.
Prepare for Traversal
Login id App id
Last login: 4/28/2020
Device
id
App:YouTube
Cookie
id Country: Canada
Login id
App id
Device
id
Cookie
id
A B
B C
B D
D E
A
B
DE
G1
A
B
D
E
C
m
1
m
2
m3
m4
Stage II – Create Connected Components Stage III - Prepare for Traversal
C
PN
NM
Stage I – Dedup & Outlier Elimination
P
NM
N
M P
G2 m1 m2
Union Find Shuffle (UFS): Building
Connected Components at Scale
Weighted Union Find with Path Compression
2
5
2
9
5
9
2
2
9
2
9
Top Level parent -
2 9
Size of the cluster - 2
2
5
Top Level parent -
Size of the cluster - 1
5
5
Height – 2
(not Weighted Union)
Height -1
Weighted Union
or
7 8
7 8
5 7 Top Level parent -
Size of the cluster - 3
2 Top Level parent -
Size of the cluster - 2
7
2
9 5
8
7
2
9 5
8 1
2
9 5
8
7
1
Top Level parent -
Size of the cluster - 1
1
2
9 5 87 1
Path Compression
Top Level parent -
Size of the cluster - 5
2
• Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find()
determines that c is the top level parent;
a -> b; b -> c => c is top parent
• Path compression() – helps to reduce height of connected components by linking all children directly to
the top level parent.
a -> b; b -> c => a -> b -> c => a -> c; b -> c
• Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent
with more children. This also helps to reduce the height of connected components.
Distributed UFS with Path Compression
Path Compress Iteratively perform Path Compression for connected components until
all connected components are path compressed.
Shuffle Merge locally processed partitions with a global shuffle iteratively until
all connected components are resolved
Run UF Run Weighted Union Find with Path Compression on each partition
Divide Divide the data into partitions
Shuffle in UFS
9 4 5 8 3
Reached Termination conditionProceeds to next iteration
3 74 9
8 6
6 3
32
97
57
9
4 3
6 8 34 32
2 3
3 7
5
9 7 5
6 37 3 5
Union Find Shuffle using Spark
▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges)
▪ Iterative processing with caching and intermittent checkpointing
▪ Limitations with other alternatives
How do we scale?
▪ The input to Union Find Shuffle is bucked to create 1000 part files of
similar size
▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion
nodes. Each instance of UF is applied to 100 part files.
▪ At any given time, we will have 5 instances of UF running in parallel
Data Quality:
Challenges and Considerations
Noisy Data
Coo
1
Acc
1
Coo
2 Coo
3
Coo
4
Coo
100
Cookie Tokens
Acc
1
Coo.
1
Acc
2 Acc
3
Acc
4
Acc
100
•
•
•
•
Cookie Token
Graph exposes Noise, opportunities, and fragmentation in data
•
•
•
•
An Example of anomalous linkage data
▪ For some linkages, we have millions of
entities mapping to the same token (id)
▪ In the data distribution, we see a
majority of tokens mapped to 1-5 entities
▪ We also see a few tokens (potential
outliers) mapped to millions of entities!
Data Distribution
Removal of Anomalous Linkages
▪ Extensive analysis to identify anomalous linkage patterns
▪ A Gaussian Anomaly detection model (Statistical Analysis)
▪ Identify thresholds of linkage cardinality to filter linkages
▪ A lenient threshold will improve coverage at the cost of precision.
Hit the balance between Coverage and Precision
Threshold # Big Clusters % match of Facets 1 / # of distinct Entities
Threshold 10 – High Precision, Low Coverage
• Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high
• More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer
Threshold 1000 – Low Precision, High Coverage
• More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser
• Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high
Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4
Large Connected Components (LCC)
▪ Size ranging from 10k – 100 M +
▪ Combination of Hubs and Long Chains
▪ Token collisions, Noise in data, Bot
traffic
▪ Legitimate users belong to LCC
▪ Large number of shuffles in UFS
A result of lenient threshold
Traversing LCC
▪ Business demands both Precision and Coverage, hence LCC needs
traversal
▪ Iterative Spark BFS implementation is used to traverse LCC
▪ Traversal is supported up to a certain pre-defined depth
▪ Going beyond a certain depth not only strains the system but also adds no business value
▪ Traversal is optimized to run using Spark in 20-30 minutes over all
connected components
Solution to get both Precision and Coverage
Data Volume & Runtime
tid1 tid2 Linkage
metadata
a b tid1 tid
2
c
a b tid1 tid
2
c
a b tid1 tid
2
c
Graph Pipeline – Powered by
Handling Heterogenous Linkages Stage I Stage II
Stage III - LCC
Stage III - SCC
a b tid1 tid
2
c
a b tid1 tid2 c
15 upstream tables
p q tid6 tid9 r
25B+ Raw Linkages &
30B+ Nodes
tid1 tid2 Linkage
metadata
tid1 tid2 Linkage
metadata
tid1_long tid2_long Linkage
metadata
tid6_long tgid120
tid1_long tgid1
tgid Linkages with Metadata
tgid1 {tgid: 1, tid: [aid,bid],
edges:[srcid,
destid,metadata],[] }
1-2
hrs
UnionFindShuffle
8-10hrs
1-2
hrs
Subgraphcreation
4-5hrs
tgid tid Linkages
(adj_list)
3 A1 [C1:m1, B1:m2, B2:m3]
2 C1 [A1:m1]
2 A2 B2:m2
3 B1 [A1:m2]
3 B2 [A1:m3]
tid Linkages
A1 [C1:m1, B1:m2]
A2 [B2:m2]
C1 [A1:m1]
B1 [A1:m2]
Give all aid-bid linkages which go via
cid
Traversal request
Give all A– B linkages where
criteria= m1,m2
Traversal request on LCC
Filter tids on
m1,m2
Select
count(*)
by tgid
MR on
filtered
tgid
partitions
Dump
LCC table
> 5k
CC
startnode=A, endnode=B, criteria=m1,m2
tid Linkage
A1 B1
A2 B2
For each tid do
a bfs
(unidirected/bidi
rected)
Map
Map
Map
1 map per tgid
traversal
tid1 tid1_long
tid6 tid6_long
tid6 tgid120
tid1 tgid1
Tableextraction&
transformation30mins
20-30
mins
2.5
hrs
30
mins
20-30
mins
A peek into Real time Graph
▪ Linkages within streaming datasets
▪ New linkages require updating the graph in real time.
▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc.
▪ Scale
▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated
▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries
▪ Real-time Querying and Traversals
▪ High throughput traversing and querying capability on tokens belonging to the same customer
Real time Graph: Challenges
Questions?

More Related Content

What's hot

Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Neo4j
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyDatabricks
 
Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 TigerGraph
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Modernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationModernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationDenodo
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondNeo4j
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks
 
The Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing DatabaseThe Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing DatabaseRedEye
 
Tiger graph 2021 corporate overview [read only]
Tiger graph 2021 corporate overview [read only]Tiger graph 2021 corporate overview [read only]
Tiger graph 2021 corporate overview [read only]ercan5
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4jNeo4j
 
Graph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jGraph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jNeo4j
 
What is a customer data platform (CDP)?
What is a customer data platform (CDP)?What is a customer data platform (CDP)?
What is a customer data platform (CDP)?Todd Belcher
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptxNeo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptxNeo4j
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data PlatformsTreasure Data, Inc.
 

What's hot (20)

Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform Strategy
 
Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Modernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationModernizing Integration with Data Virtualization
Modernizing Integration with Data Virtualization
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and Beyond
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
The Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing DatabaseThe Customer Data Platform, the Future of the Marketing Database
The Customer Data Platform, the Future of the Marketing Database
 
Tiger graph 2021 corporate overview [read only]
Tiger graph 2021 corporate overview [read only]Tiger graph 2021 corporate overview [read only]
Tiger graph 2021 corporate overview [read only]
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Graph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jGraph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4j
 
Workshop: Make the Most of Customer Data Platforms - David Raab
Workshop: Make the Most of Customer Data Platforms - David RaabWorkshop: Make the Most of Customer Data Platforms - David Raab
Workshop: Make the Most of Customer Data Platforms - David Raab
 
What is a customer data platform (CDP)?
What is a customer data platform (CDP)?What is a customer data platform (CDP)?
What is a customer data platform (CDP)?
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptxNeo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data Platforms
 

Similar to Building Identity Graphs over Heterogeneous Data

From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...HostedbyConfluent
 
5G-USA-Telemetry
5G-USA-Telemetry5G-USA-Telemetry
5G-USA-Telemetrysnrism
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...DataStax Academy
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudKey-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudUniversity of New South Wales
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITOMarcoMellia
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSDeepak Shankar
 
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdfSwisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdfThomasGraf40
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Neo4j
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreePradeeban Kathiravelu, Ph.D.
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationThomasGraf42
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
 

Similar to Building Identity Graphs over Heterogeneous Data (20)

From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
 
5G-USA-Telemetry
5G-USA-Telemetry5G-USA-Telemetry
5G-USA-Telemetry
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
C C N A Day1
C C N A  Day1C C N A  Day1
C C N A Day1
 
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudKey-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITO
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
 
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdfSwisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh Integration
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 

Building Identity Graphs over Heterogeneous Data

  • 1.
  • 2. Building Identity Graphs over Heterogeneous Data Sudha Viswanathan Saigopal Thota
  • 4. Agenda ▪ Identities at Scale ▪ Why Graph ▪ How we Built and Scaled it ▪ Why an In-house solution ▪ Challenges and Considerations ▪ A peek into Real time Graph
  • 6. Identity Tokens Online 2 ... Partner Apps App 3App 2 App1 Online 1 Account Ids Cookies Online Ids Device Ids
  • 7. Identity Resolution Aims to provide a coherent view of a customer and / or a household by unifying all customer identities across channels and subsidiaries Provides single notion of a customer
  • 9. Identities, Linkages & Metadata – An Example Device id App id Login id App id Cookie id Device id Last login: 4/28/2020 App: YouTube Country: Canada Identity Linkage Country: Canada Metadata
  • 10. Graph – An Example Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Connect all Linkages to create a single connected component per user/household
  • 11. Graph Traversal ▪ Graph is an efficient data structure relative to table joins ▪ Why Table join doesn't work? ▪ Linkages are in the order of millions of rows spanning across hundreds of tables ▪ Table joins are based on index and are computationally very expensive ▪ Table joins result in lesser coverage Scalable and offers better coverage
  • 12. Build once – Query multiple times ▪ Graph enables dynamic traversal logic. One Graph offers infinite traversal possibilities ▪ Get all tokens liked to an entity ▪ Get all tokens linked to the entity's household ▪ Get all tokens linked to an entity that created after Jan 2020 ▪ Get all tokens linked to the entity's household that interacted using App 1 ▪ ... Graph comes with flexibility in traversal
  • 13. How we Built and Scaled
  • 14. Scale and Performance Objectives ▪ More than 25+ Billion linkages and identities ▪ New linkages created 24x7 ▪ Node and Edge Metadata updated for 60% of existing Linkages ▪ Freshness – Graph Updated with linkages and Metadata, Once a day ▪ Could be few hours, in future goals ▪ Ability to run on general purpose Hadoop Infrastructure
  • 15. Components of Identity Graph Data Analysis Understand your data and check for anomalies Handling Heterogenous Data Sources Extract only new and modified linkages in a format needed by the next stage Stage I – Dedup & Eliminate outliers Add edge metadata, filter outliers and populate tables needed by the next stage Stage II – Create Connected Components Merge Linkages to form an Identity graph for each customer Stage III – Prepare for Traversal Demystifies linkages within a cluster and appends metadata information to enable graph traversal Traversal Traverse across the cluster as per defined rules to pick only the qualified nodes Core Processing
  • 16. Data Analysis ▪ Understanding the data that feeds into Graph pipeline is paramount to building a usable Graph framework. ▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing resources and business value ▪ Some questions to analyze, ▪ Does the linkage relationship makes business sense? ▪ What is acceptable threshold for poor quality linkages ▪ Do we need to apply any filter ▪ Nature of data – Snapshot vs Incremental
  • 17. Handling Heterogenous Data Sources ▪ Data sources grow rapidly in volume and variety ▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls! ▪ Dedicated framework to ingest data in parallel from heterogenous sources ▪ Serves only new and modified linkages. This is important for Incremental processing ▪ Pulls only the desired attributes for further processing – linkages and their metadata in a standard schema
  • 18. Core Processing – Stage I ▪ Feeds good quality linkages to further processing. ▪ It handles: ▪ Deduplication ▪ If a linkage is repeated, we consume only the latest record ▪ Outlier elimination ▪ Filters anomalous linkages based on a chosen threshold derived from data analysis ▪ Edge Metadata population ▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages. Dedup & Eliminate outliers
  • 19. Core Processing – Stage II ▪ Merges all related linkages of a customer to create a Connected Component Create Connected Components
  • 20. Core Processing – Stage III ▪ This stage enriches the connected component with linkages between nodes and edge metadata to enable graph traversal. Prepare for Traversal Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Login id App id Device id Cookie id
  • 21. A B B C B D D E A B DE G1 A B D E C m 1 m 2 m3 m4 Stage II – Create Connected Components Stage III - Prepare for Traversal C PN NM Stage I – Dedup & Outlier Elimination P NM N M P G2 m1 m2
  • 22. Union Find Shuffle (UFS): Building Connected Components at Scale
  • 23. Weighted Union Find with Path Compression 2 5 2 9 5 9 2 2 9 2 9 Top Level parent - 2 9 Size of the cluster - 2 2 5 Top Level parent - Size of the cluster - 1 5 5 Height – 2 (not Weighted Union) Height -1 Weighted Union or 7 8 7 8 5 7 Top Level parent - Size of the cluster - 3 2 Top Level parent - Size of the cluster - 2 7 2 9 5 8 7 2 9 5 8 1 2 9 5 8 7 1 Top Level parent - Size of the cluster - 1 1 2 9 5 87 1 Path Compression Top Level parent - Size of the cluster - 5 2
  • 24. • Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find() determines that c is the top level parent; a -> b; b -> c => c is top parent • Path compression() – helps to reduce height of connected components by linking all children directly to the top level parent. a -> b; b -> c => a -> b -> c => a -> c; b -> c • Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent with more children. This also helps to reduce the height of connected components.
  • 25. Distributed UFS with Path Compression Path Compress Iteratively perform Path Compression for connected components until all connected components are path compressed. Shuffle Merge locally processed partitions with a global shuffle iteratively until all connected components are resolved Run UF Run Weighted Union Find with Path Compression on each partition Divide Divide the data into partitions
  • 26. Shuffle in UFS 9 4 5 8 3 Reached Termination conditionProceeds to next iteration 3 74 9 8 6 6 3 32 97 57 9 4 3 6 8 34 32 2 3 3 7 5 9 7 5 6 37 3 5
  • 27. Union Find Shuffle using Spark ▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges) ▪ Iterative processing with caching and intermittent checkpointing ▪ Limitations with other alternatives
  • 28. How do we scale? ▪ The input to Union Find Shuffle is bucked to create 1000 part files of similar size ▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion nodes. Each instance of UF is applied to 100 part files. ▪ At any given time, we will have 5 instances of UF running in parallel
  • 30. Noisy Data Coo 1 Acc 1 Coo 2 Coo 3 Coo 4 Coo 100 Cookie Tokens Acc 1 Coo. 1 Acc 2 Acc 3 Acc 4 Acc 100 • • • • Cookie Token Graph exposes Noise, opportunities, and fragmentation in data • • • •
  • 31. An Example of anomalous linkage data ▪ For some linkages, we have millions of entities mapping to the same token (id) ▪ In the data distribution, we see a majority of tokens mapped to 1-5 entities ▪ We also see a few tokens (potential outliers) mapped to millions of entities! Data Distribution
  • 32.
  • 33. Removal of Anomalous Linkages ▪ Extensive analysis to identify anomalous linkage patterns ▪ A Gaussian Anomaly detection model (Statistical Analysis) ▪ Identify thresholds of linkage cardinality to filter linkages ▪ A lenient threshold will improve coverage at the cost of precision. Hit the balance between Coverage and Precision
  • 34. Threshold # Big Clusters % match of Facets 1 / # of distinct Entities Threshold 10 – High Precision, Low Coverage • Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high • More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer Threshold 1000 – Low Precision, High Coverage • More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser • Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4
  • 35. Large Connected Components (LCC) ▪ Size ranging from 10k – 100 M + ▪ Combination of Hubs and Long Chains ▪ Token collisions, Noise in data, Bot traffic ▪ Legitimate users belong to LCC ▪ Large number of shuffles in UFS A result of lenient threshold
  • 36. Traversing LCC ▪ Business demands both Precision and Coverage, hence LCC needs traversal ▪ Iterative Spark BFS implementation is used to traverse LCC ▪ Traversal is supported up to a certain pre-defined depth ▪ Going beyond a certain depth not only strains the system but also adds no business value ▪ Traversal is optimized to run using Spark in 20-30 minutes over all connected components Solution to get both Precision and Coverage
  • 37. Data Volume & Runtime
  • 38. tid1 tid2 Linkage metadata a b tid1 tid 2 c a b tid1 tid 2 c a b tid1 tid 2 c Graph Pipeline – Powered by Handling Heterogenous Linkages Stage I Stage II Stage III - LCC Stage III - SCC a b tid1 tid 2 c a b tid1 tid2 c 15 upstream tables p q tid6 tid9 r 25B+ Raw Linkages & 30B+ Nodes tid1 tid2 Linkage metadata tid1 tid2 Linkage metadata tid1_long tid2_long Linkage metadata tid6_long tgid120 tid1_long tgid1 tgid Linkages with Metadata tgid1 {tgid: 1, tid: [aid,bid], edges:[srcid, destid,metadata],[] } 1-2 hrs UnionFindShuffle 8-10hrs 1-2 hrs Subgraphcreation 4-5hrs tgid tid Linkages (adj_list) 3 A1 [C1:m1, B1:m2, B2:m3] 2 C1 [A1:m1] 2 A2 B2:m2 3 B1 [A1:m2] 3 B2 [A1:m3] tid Linkages A1 [C1:m1, B1:m2] A2 [B2:m2] C1 [A1:m1] B1 [A1:m2] Give all aid-bid linkages which go via cid Traversal request Give all A– B linkages where criteria= m1,m2 Traversal request on LCC Filter tids on m1,m2 Select count(*) by tgid MR on filtered tgid partitions Dump LCC table > 5k CC startnode=A, endnode=B, criteria=m1,m2 tid Linkage A1 B1 A2 B2 For each tid do a bfs (unidirected/bidi rected) Map Map Map 1 map per tgid traversal tid1 tid1_long tid6 tid6_long tid6 tgid120 tid1 tgid1 Tableextraction& transformation30mins 20-30 mins 2.5 hrs 30 mins 20-30 mins
  • 39. A peek into Real time Graph
  • 40. ▪ Linkages within streaming datasets ▪ New linkages require updating the graph in real time. ▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc. ▪ Scale ▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated ▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries ▪ Real-time Querying and Traversals ▪ High throughput traversing and querying capability on tokens belonging to the same customer Real time Graph: Challenges