SlideShare a Scribd company logo
SEARCH, NETWORK AND ANALYTICS
Using Set Cover to Optimize a Large-
Scale Low Latency Distributed Graph
LinkedIn Presentation Template
©2013 LinkedIn Corporation. All Rights Reserved. 1
Rui Wang
Staff Software Engineer
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 2
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 3
SEARCH, NETWORK AND ANALYTICS
LinkedIn Products Backed By Social Graph
©2013 LinkedIn Corporation. All Rights Reserved. 4
SEARCH, NETWORK AND ANALYTICS
LinkedIn’s Distributed Graph APIs
 Graph APIs
– Get Connections
 “Who does member X know?”
– Get Shared Connections
 “Who do I know in common with member Y”?
– Get Graph Distances
 “What’s the graph distances for each member returned from a search
query?”
 “Is member Z within my second degree network so that I can view her
profile?”
 Over 50% queries is to get graph distances
©2013 LinkedIn Corporation. All Rights Reserved. 5
SEARCH, NETWORK AND ANALYTICS
Distributed Graph Architecture Diagram
©2013 LinkedIn Corporation. All Rights Reserved. 6
SEARCH, NETWORK AND ANALYTICS
LinkedIn’s Distributed Graph Infrastructure
 Graph Database (Graph DB)
– Partitioned and replicated graph database
– Distributed adjacency list
 Distributed Network Cache (NCS)
– LRU cache stores second degree network for active members
– Graph traversals are converted to set intersections
 Graph APIs
– Get Connections
– Get Shared Connections
– Get Graph Distances
©2013 LinkedIn Corporation. All Rights Reserved. 7
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 8
SEARCH, NETWORK AND ANALYTICS
Graph Distance Queries and Second Degree Creation
©2013 LinkedIn Corporation. All Rights Reserved. 9
get graph distances
cache lookup
[exists=false] retrieve
1st degree connections
[exists=true] set
intersections
Partition IDs
partial merges
K-Way
merges
return
return
SEARCH, NETWORK AND ANALYTICS
The Scaling Problem Illustrated
©2013 LinkedIn Corporation. All Rights Reserved. 10
Graph DBNCS
• Increased Query Distribution
• Merging and De-duping shift to single NCS node
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover and Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 11
SEARCH, NETWORK AND ANALYTICS
Set Cover Problem
 Definition
– Given a set of elements U = {1, 2, …, m } (called the universe) and a
family S of n sets whose union equals the universe, the set cover
problem is to identify the smallest subset of S the union of which
contains all elements in the universe.
 Greedy Algorithm
– Rule to select sets
 At each stage, choose the set that contains the largest number of
uncovered elements.
– Upper bound
 O(log s), where s is the size of the largest set from S
©2013 LinkedIn Corporation. All Rights Reserved. 12
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 13
 Build a map of partition ID -> Graph DB nodes for the input graph
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 14
 Randomly pick an element
from U = { 1,2,5,6 }
– e = 5
 Retrieve from map
– nodes = { R21, R13 }
 Intersect
– R21 = {1,5 } with U,
– R13 = { 5,6 } with U
 Select R21
– U = {2,6}, C = { R21}
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 15
 Randomly pick an element
from U = { 2,6 }
– e = 2
 Retrieve from map
– nodes = { R11, R22 }
 Intersect
– R11 = {1,2 } with U,
– R22 = { 2,4 } with U
 Select R22
– U = {6}, C = { R21, R22 }
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 16
 Pick the final element from
U = { 6 }
– e = 6
 Retrieve from map
– nodes = { R23, R13 }
 Intersect
– R23 = { 3,6 } with U,
– R13 = { 5,6 } with U
 Select R23
– U = {}, C = {R21, R22, R23 }
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Solution
 Solution Compared to Optimal Result for U , where U = {1,2,5,6}
©2013 LinkedIn Corporation. All Rights Reserved. 17
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 18
SEARCH, NETWORK AND ANALYTICS
Evaluation
 Production Results
– Second degree cache creation drops 38% on 99th percentile
– Graph distance calculation drops 25% on 99th percentile
– Outbound traffic drops 40%; Inbound traffic drops 10%
©2013 LinkedIn Corporation. All Rights Reserved. 19
95th quantile 99th quantile
0
1
2
3
control setcover control setcover
latency(normalized)
network cache building
95th quantile 99th quantile
0
1
2
3
control setcover control setcover
latency(normalized)
getdistances
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 20
SEARCH, NETWORK AND ANALYTICS
Related Work
 Scaling through Replications
– Collocating neighbors [Pujol2010]
– Replication based on read/write frequencies [Mondal2012]
– Replication based on locality [Carrasco2011]
 Multi-Core Implementations
– Parallel graph exploration [Hong2011]
 Offline Graph Systems
– Google’s Pregel [Malewicz2010]
– Distributed GraphLab [Low2012]
©2013 LinkedIn Corporation. All Rights Reserved. 21
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 22
SEARCH, NETWORK AND ANALYTICS
Conclusions and Future Work
 Future Work
– Incremental cache updates
– Replication on GraphDB nodes through LRU caching
– New graph traversal algorithms
 Conclusions
– Key challenges tackled
 Work distribution balancing
 Communication Bandwidth
– Set cover optimized latency for distributed query by
 Identifying a much smaller set of GraphDB nodes serving queries
 Shifting workload to GraphDB to utilize parallel powers
 Alleviating the K-way merging costs on NCS by reducing K
– Available at
https://github.com/linkedin-
sna/norbert/blob/master/network/src/main/scala/com/linkedin/norbert/network/partitioned
/loadbalancer/DefaultPartitionedLoadBalancerFactory.scala
©2013 LinkedIn Corporation. All Rights Reserved. 23
SEARCH, NETWORK AND ANALYTICS©2013 LinkedIn Corporation. All Rights Reserved. 24

More Related Content

Similar to Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph

Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
Neo4j
 
Unit 3- Software Design.pptx
Unit 3- Software Design.pptxUnit 3- Software Design.pptx
Unit 3- Software Design.pptx
LSURYAPRAKASHREDDY
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
Ted Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
LeMeniz Infotech
 
Using Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projectsUsing Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projects
Linkurious
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data Science
Neo4j
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET Journal
 
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Neo4j
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdf
Neo4j
 
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
Neo4j
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
IRJET Journal
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Keiichiro Ono
 
Metric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN GraphsMetric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN Graphs
joisino
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question Answering
IRJET Journal
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
IRJET Journal
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
Neo4j
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
Nesreen K. Ahmed
 

Similar to Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph (20)

Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
 
Unit 3- Software Design.pptx
Unit 3- Software Design.pptxUnit 3- Software Design.pptx
Unit 3- Software Design.pptx
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
Using Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projectsUsing Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projects
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data Science
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
 
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdf
 
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Metric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN GraphsMetric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN Graphs
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question Answering
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 

Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph

  • 1. SEARCH, NETWORK AND ANALYTICS Using Set Cover to Optimize a Large- Scale Low Latency Distributed Graph LinkedIn Presentation Template ©2013 LinkedIn Corporation. All Rights Reserved. 1 Rui Wang Staff Software Engineer
  • 2. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 2
  • 3. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 3
  • 4. SEARCH, NETWORK AND ANALYTICS LinkedIn Products Backed By Social Graph ©2013 LinkedIn Corporation. All Rights Reserved. 4
  • 5. SEARCH, NETWORK AND ANALYTICS LinkedIn’s Distributed Graph APIs  Graph APIs – Get Connections  “Who does member X know?” – Get Shared Connections  “Who do I know in common with member Y”? – Get Graph Distances  “What’s the graph distances for each member returned from a search query?”  “Is member Z within my second degree network so that I can view her profile?”  Over 50% queries is to get graph distances ©2013 LinkedIn Corporation. All Rights Reserved. 5
  • 6. SEARCH, NETWORK AND ANALYTICS Distributed Graph Architecture Diagram ©2013 LinkedIn Corporation. All Rights Reserved. 6
  • 7. SEARCH, NETWORK AND ANALYTICS LinkedIn’s Distributed Graph Infrastructure  Graph Database (Graph DB) – Partitioned and replicated graph database – Distributed adjacency list  Distributed Network Cache (NCS) – LRU cache stores second degree network for active members – Graph traversals are converted to set intersections  Graph APIs – Get Connections – Get Shared Connections – Get Graph Distances ©2013 LinkedIn Corporation. All Rights Reserved. 7
  • 8. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 8
  • 9. SEARCH, NETWORK AND ANALYTICS Graph Distance Queries and Second Degree Creation ©2013 LinkedIn Corporation. All Rights Reserved. 9 get graph distances cache lookup [exists=false] retrieve 1st degree connections [exists=true] set intersections Partition IDs partial merges K-Way merges return return
  • 10. SEARCH, NETWORK AND ANALYTICS The Scaling Problem Illustrated ©2013 LinkedIn Corporation. All Rights Reserved. 10 Graph DBNCS • Increased Query Distribution • Merging and De-duping shift to single NCS node
  • 11. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover and Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 11
  • 12. SEARCH, NETWORK AND ANALYTICS Set Cover Problem  Definition – Given a set of elements U = {1, 2, …, m } (called the universe) and a family S of n sets whose union equals the universe, the set cover problem is to identify the smallest subset of S the union of which contains all elements in the universe.  Greedy Algorithm – Rule to select sets  At each stage, choose the set that contains the largest number of uncovered elements. – Upper bound  O(log s), where s is the size of the largest set from S ©2013 LinkedIn Corporation. All Rights Reserved. 12
  • 13. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 13  Build a map of partition ID -> Graph DB nodes for the input graph
  • 14. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 14  Randomly pick an element from U = { 1,2,5,6 } – e = 5  Retrieve from map – nodes = { R21, R13 }  Intersect – R21 = {1,5 } with U, – R13 = { 5,6 } with U  Select R21 – U = {2,6}, C = { R21}
  • 15. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 15  Randomly pick an element from U = { 2,6 } – e = 2  Retrieve from map – nodes = { R11, R22 }  Intersect – R11 = {1,2 } with U, – R22 = { 2,4 } with U  Select R22 – U = {6}, C = { R21, R22 }
  • 16. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 16  Pick the final element from U = { 6 } – e = 6  Retrieve from map – nodes = { R23, R13 }  Intersect – R23 = { 3,6 } with U, – R13 = { 5,6 } with U  Select R23 – U = {}, C = {R21, R22, R23 }
  • 17. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Solution  Solution Compared to Optimal Result for U , where U = {1,2,5,6} ©2013 LinkedIn Corporation. All Rights Reserved. 17
  • 18. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 18
  • 19. SEARCH, NETWORK AND ANALYTICS Evaluation  Production Results – Second degree cache creation drops 38% on 99th percentile – Graph distance calculation drops 25% on 99th percentile – Outbound traffic drops 40%; Inbound traffic drops 10% ©2013 LinkedIn Corporation. All Rights Reserved. 19 95th quantile 99th quantile 0 1 2 3 control setcover control setcover latency(normalized) network cache building 95th quantile 99th quantile 0 1 2 3 control setcover control setcover latency(normalized) getdistances
  • 20. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 20
  • 21. SEARCH, NETWORK AND ANALYTICS Related Work  Scaling through Replications – Collocating neighbors [Pujol2010] – Replication based on read/write frequencies [Mondal2012] – Replication based on locality [Carrasco2011]  Multi-Core Implementations – Parallel graph exploration [Hong2011]  Offline Graph Systems – Google’s Pregel [Malewicz2010] – Distributed GraphLab [Low2012] ©2013 LinkedIn Corporation. All Rights Reserved. 21
  • 22. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 22
  • 23. SEARCH, NETWORK AND ANALYTICS Conclusions and Future Work  Future Work – Incremental cache updates – Replication on GraphDB nodes through LRU caching – New graph traversal algorithms  Conclusions – Key challenges tackled  Work distribution balancing  Communication Bandwidth – Set cover optimized latency for distributed query by  Identifying a much smaller set of GraphDB nodes serving queries  Shifting workload to GraphDB to utilize parallel powers  Alleviating the K-way merging costs on NCS by reducing K – Available at https://github.com/linkedin- sna/norbert/blob/master/network/src/main/scala/com/linkedin/norbert/network/partitioned /loadbalancer/DefaultPartitionedLoadBalancerFactory.scala ©2013 LinkedIn Corporation. All Rights Reserved. 23
  • 24. SEARCH, NETWORK AND ANALYTICS©2013 LinkedIn Corporation. All Rights Reserved. 24