Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Identity Graphs over Heterogeneous Data


Published on

In today’s world, customers and service providers (e.g., Social networks, ad targeting, retail, etc.) interact in a variety of modes and channels such as browsers, apps, devices, etc. In each such interaction, users are identified using a token (possibly different token for each mode/channel). Examples of such identity tokens include cookies, app IDs etc. As the user engages more with these services, linkages are generated between tokens belonging to the same user; linkages connect multiple identity tokens together.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Building Identity Graphs over Heterogeneous Data

  1. 1. Building Identity Graphs over Heterogeneous Data Sudha Viswanathan Saigopal Thota
  2. 2. Primary Contributors
  3. 3. Agenda ▪ Identities at Scale ▪ Why Graph ▪ How we Built and Scaled it ▪ Why an In-house solution ▪ Challenges and Considerations ▪ A peek into Real time Graph
  4. 4. Identities at Scale
  5. 5. Identity Tokens Online 2 ... Partner Apps App 3App 2 App1 Online 1 Account Ids Cookies Online Ids Device Ids
  6. 6. Identity Resolution Aims to provide a coherent view of a customer and / or a household by unifying all customer identities across channels and subsidiaries Provides single notion of a customer
  7. 7. Why Graph
  8. 8. Identities, Linkages & Metadata – An Example Device id App id Login id App id Cookie id Device id Last login: 4/28/2020 App: YouTube Country: Canada Identity Linkage Country: Canada Metadata
  9. 9. Graph – An Example Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Connect all Linkages to create a single connected component per user/household
  10. 10. Graph Traversal ▪ Graph is an efficient data structure relative to table joins ▪ Why Table join doesn't work? ▪ Linkages are in the order of millions of rows spanning across hundreds of tables ▪ Table joins are based on index and are computationally very expensive ▪ Table joins result in lesser coverage Scalable and offers better coverage
  11. 11. Build once – Query multiple times ▪ Graph enables dynamic traversal logic. One Graph offers infinite traversal possibilities ▪ Get all tokens liked to an entity ▪ Get all tokens linked to the entity's household ▪ Get all tokens linked to an entity that created after Jan 2020 ▪ Get all tokens linked to the entity's household that interacted using App 1 ▪ ... Graph comes with flexibility in traversal
  12. 12. How we Built and Scaled
  13. 13. Scale and Performance Objectives ▪ More than 25+ Billion linkages and identities ▪ New linkages created 24x7 ▪ Node and Edge Metadata updated for 60% of existing Linkages ▪ Freshness – Graph Updated with linkages and Metadata, Once a day ▪ Could be few hours, in future goals ▪ Ability to run on general purpose Hadoop Infrastructure
  14. 14. Components of Identity Graph Data Analysis Understand your data and check for anomalies Handling Heterogenous Data Sources Extract only new and modified linkages in a format needed by the next stage Stage I – Dedup & Eliminate outliers Add edge metadata, filter outliers and populate tables needed by the next stage Stage II – Create Connected Components Merge Linkages to form an Identity graph for each customer Stage III – Prepare for Traversal Demystifies linkages within a cluster and appends metadata information to enable graph traversal Traversal Traverse across the cluster as per defined rules to pick only the qualified nodes Core Processing
  15. 15. Data Analysis ▪ Understanding the data that feeds into Graph pipeline is paramount to building a usable Graph framework. ▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing resources and business value ▪ Some questions to analyze, ▪ Does the linkage relationship makes business sense? ▪ What is acceptable threshold for poor quality linkages ▪ Do we need to apply any filter ▪ Nature of data – Snapshot vs Incremental
  16. 16. Handling Heterogenous Data Sources ▪ Data sources grow rapidly in volume and variety ▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls! ▪ Dedicated framework to ingest data in parallel from heterogenous sources ▪ Serves only new and modified linkages. This is important for Incremental processing ▪ Pulls only the desired attributes for further processing – linkages and their metadata in a standard schema
  17. 17. Core Processing – Stage I ▪ Feeds good quality linkages to further processing. ▪ It handles: ▪ Deduplication ▪ If a linkage is repeated, we consume only the latest record ▪ Outlier elimination ▪ Filters anomalous linkages based on a chosen threshold derived from data analysis ▪ Edge Metadata population ▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages. Dedup & Eliminate outliers
  18. 18. Core Processing – Stage II ▪ Merges all related linkages of a customer to create a Connected Component Create Connected Components
  19. 19. Core Processing – Stage III ▪ This stage enriches the connected component with linkages between nodes and edge metadata to enable graph traversal. Prepare for Traversal Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Login id App id Device id Cookie id
  20. 20. A B B C B D D E A B DE G1 A B D E C m 1 m 2 m3 m4 Stage II – Create Connected Components Stage III - Prepare for Traversal C PN NM Stage I – Dedup & Outlier Elimination P NM N M P G2 m1 m2
  21. 21. Union Find Shuffle (UFS): Building Connected Components at Scale
  22. 22. Weighted Union Find with Path Compression 2 5 2 9 5 9 2 2 9 2 9 Top Level parent - 2 9 Size of the cluster - 2 2 5 Top Level parent - Size of the cluster - 1 5 5 Height – 2 (not Weighted Union) Height -1 Weighted Union or 7 8 7 8 5 7 Top Level parent - Size of the cluster - 3 2 Top Level parent - Size of the cluster - 2 7 2 9 5 8 7 2 9 5 8 1 2 9 5 8 7 1 Top Level parent - Size of the cluster - 1 1 2 9 5 87 1 Path Compression Top Level parent - Size of the cluster - 5 2
  23. 23. • Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find() determines that c is the top level parent; a -> b; b -> c => c is top parent • Path compression() – helps to reduce height of connected components by linking all children directly to the top level parent. a -> b; b -> c => a -> b -> c => a -> c; b -> c • Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent with more children. This also helps to reduce the height of connected components.
  24. 24. Distributed UFS with Path Compression Path Compress Iteratively perform Path Compression for connected components until all connected components are path compressed. Shuffle Merge locally processed partitions with a global shuffle iteratively until all connected components are resolved Run UF Run Weighted Union Find with Path Compression on each partition Divide Divide the data into partitions
  25. 25. Shuffle in UFS 9 4 5 8 3 Reached Termination conditionProceeds to next iteration 3 74 9 8 6 6 3 32 97 57 9 4 3 6 8 34 32 2 3 3 7 5 9 7 5 6 37 3 5
  26. 26. Union Find Shuffle using Spark ▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges) ▪ Iterative processing with caching and intermittent checkpointing ▪ Limitations with other alternatives
  27. 27. How do we scale? ▪ The input to Union Find Shuffle is bucked to create 1000 part files of similar size ▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion nodes. Each instance of UF is applied to 100 part files. ▪ At any given time, we will have 5 instances of UF running in parallel
  28. 28. Data Quality: Challenges and Considerations
  29. 29. Noisy Data Coo 1 Acc 1 Coo 2 Coo 3 Coo 4 Coo 100 Cookie Tokens Acc 1 Coo. 1 Acc 2 Acc 3 Acc 4 Acc 100 • • • • Cookie Token Graph exposes Noise, opportunities, and fragmentation in data • • • •
  30. 30. An Example of anomalous linkage data ▪ For some linkages, we have millions of entities mapping to the same token (id) ▪ In the data distribution, we see a majority of tokens mapped to 1-5 entities ▪ We also see a few tokens (potential outliers) mapped to millions of entities! Data Distribution
  31. 31. Removal of Anomalous Linkages ▪ Extensive analysis to identify anomalous linkage patterns ▪ A Gaussian Anomaly detection model (Statistical Analysis) ▪ Identify thresholds of linkage cardinality to filter linkages ▪ A lenient threshold will improve coverage at the cost of precision. Hit the balance between Coverage and Precision
  32. 32. Threshold # Big Clusters % match of Facets 1 / # of distinct Entities Threshold 10 – High Precision, Low Coverage • Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high • More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer Threshold 1000 – Low Precision, High Coverage • More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser • Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4
  33. 33. Large Connected Components (LCC) ▪ Size ranging from 10k – 100 M + ▪ Combination of Hubs and Long Chains ▪ Token collisions, Noise in data, Bot traffic ▪ Legitimate users belong to LCC ▪ Large number of shuffles in UFS A result of lenient threshold
  34. 34. Traversing LCC ▪ Business demands both Precision and Coverage, hence LCC needs traversal ▪ Iterative Spark BFS implementation is used to traverse LCC ▪ Traversal is supported up to a certain pre-defined depth ▪ Going beyond a certain depth not only strains the system but also adds no business value ▪ Traversal is optimized to run using Spark in 20-30 minutes over all connected components Solution to get both Precision and Coverage
  35. 35. Data Volume & Runtime
  36. 36. tid1 tid2 Linkage metadata a b tid1 tid 2 c a b tid1 tid 2 c a b tid1 tid 2 c Graph Pipeline – Powered by Handling Heterogenous Linkages Stage I Stage II Stage III - LCC Stage III - SCC a b tid1 tid 2 c a b tid1 tid2 c 15 upstream tables p q tid6 tid9 r 25B+ Raw Linkages & 30B+ Nodes tid1 tid2 Linkage metadata tid1 tid2 Linkage metadata tid1_long tid2_long Linkage metadata tid6_long tgid120 tid1_long tgid1 tgid Linkages with Metadata tgid1 {tgid: 1, tid: [aid,bid], edges:[srcid, destid,metadata],[] } 1-2 hrs UnionFindShuffle 8-10hrs 1-2 hrs Subgraphcreation 4-5hrs tgid tid Linkages (adj_list) 3 A1 [C1:m1, B1:m2, B2:m3] 2 C1 [A1:m1] 2 A2 B2:m2 3 B1 [A1:m2] 3 B2 [A1:m3] tid Linkages A1 [C1:m1, B1:m2] A2 [B2:m2] C1 [A1:m1] B1 [A1:m2] Give all aid-bid linkages which go via cid Traversal request Give all A– B linkages where criteria= m1,m2 Traversal request on LCC Filter tids on m1,m2 Select count(*) by tgid MR on filtered tgid partitions Dump LCC table > 5k CC startnode=A, endnode=B, criteria=m1,m2 tid Linkage A1 B1 A2 B2 For each tid do a bfs (unidirected/bidi rected) Map Map Map 1 map per tgid traversal tid1 tid1_long tid6 tid6_long tid6 tgid120 tid1 tgid1 Tableextraction& transformation30mins 20-30 mins 2.5 hrs 30 mins 20-30 mins
  37. 37. A peek into Real time Graph
  38. 38. ▪ Linkages within streaming datasets ▪ New linkages require updating the graph in real time. ▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc. ▪ Scale ▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated ▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries ▪ Real-time Querying and Traversals ▪ High throughput traversing and querying capability on tokens belonging to the same customer Real time Graph: Challenges
  39. 39. Questions?