The document proposes a 3-phase algorithm to compute the metric backbone of a weighted graph to improve the performance of graph algorithms and queries. Phase 1 finds 1st-order semi-metric edges by only examining triangles. Phase 2 identifies metric edges in 2-hop paths. Phase 3 runs BFS to label remaining edges. The algorithm removes up to 90% of semi-metric edges and scales to billion-edge graphs. Real-world graphs exhibit significant semi-metricity, and the backbone provides up to 6x speedups for graph queries and analytics.
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...Vasia Kalavri
Understanding the performance of distributed dataflow systems like Apache Spark, Apache Flink, and Tensorflow is hard. Parallel computation is interleaved with data and control communication, and execution dependencies typically span multiple system components. In such environments, bottleneck detection is cumbersome and currently relies heavily on humans. After decades of systems research, state-of-the-art performance analysis techniques are commonly based on offline trace processing and thus are only suitable for batch computations and postmortem reports.
Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems to predict bottlenecks and provide insights on streaming application performance—leveraging logging and monitoring pipelines of modern production data centers to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if sc
[ICDE 2012] On Top-k Structural Similarity SearchPei Lee
In this talk, we talk about the following classic problem: given a node in a graph, how can we efficiently track the top-k similar nodes regarding this node, by simply checking the graph link structure? This talk is accompanying with the ICDE 2012 paper "On Top-k Structural Similarity Search", which can be found at http://www.cs.ubc.ca/~peil/research.html
Here, we look at the problem of going from a source s to a possible multiple destinations. At them, each of the Lemmas, Theorems and Corollaries used to prove the properties of the
1. Bellman-Ford
2. Dijkstra
are examined in detail.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...Vasia Kalavri
Understanding the performance of distributed dataflow systems like Apache Spark, Apache Flink, and Tensorflow is hard. Parallel computation is interleaved with data and control communication, and execution dependencies typically span multiple system components. In such environments, bottleneck detection is cumbersome and currently relies heavily on humans. After decades of systems research, state-of-the-art performance analysis techniques are commonly based on offline trace processing and thus are only suitable for batch computations and postmortem reports.
Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems to predict bottlenecks and provide insights on streaming application performance—leveraging logging and monitoring pipelines of modern production data centers to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if sc
[ICDE 2012] On Top-k Structural Similarity SearchPei Lee
In this talk, we talk about the following classic problem: given a node in a graph, how can we efficiently track the top-k similar nodes regarding this node, by simply checking the graph link structure? This talk is accompanying with the ICDE 2012 paper "On Top-k Structural Similarity Search", which can be found at http://www.cs.ubc.ca/~peil/research.html
Here, we look at the problem of going from a source s to a possible multiple destinations. At them, each of the Lemmas, Theorems and Corollaries used to prove the properties of the
1. Bellman-Ford
2. Dijkstra
are examined in detail.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
What do graphs look like? How do they evolve over time? How does influence/news/viruses propagate, over time? We present a long list of static and temporal laws, and some recent observations on real graphs. We show that fractals and self-similarity can explain several of the observed patterns, and we conclude with cascade analysis and a surprising result on virus propagation and immunization.
Design of Filter Circuits using MATLAB, Multisim, and ExcelDavid Sandy
The purpose of this project was to design crossover active filter circuits, in order to drive music through three different types of speakers. So, high frequencies would be sent through a Tweeter speaker, low frequencies would be sent through a Woofer speaker, and middle frequencies would be sent through a Midbass driver speaker. Three circuits were created to drive these speakers. Multisim, MATLAB, and Excel, were all used in the design process in order to create the filter circuits correctly.
Circuit Theory 2: Filters Project ReportMichael Sandy
The purpose of this project was to design crossover active filter circuits, in order to drive music through three different types of speakers. So, high frequencies would be sent through a Tweeter speaker, low frequencies would be sent through a Woofer speaker, and middle frequencies would be sent through a Midbass driver speaker. Three circuits were created to drive these speakers. Multisim, MATLAB, and Excel, were all used in the design process in order to create the filter circuits correctly.
Traversing Notes |surveying II | Sudip khadka Sudip khadka
Traverse is a method in the field of surveying to establish control networks. It is also used in geodesy. Traverse networks involve placing survey stations along a line or path of travel, and then using the previously surveyed points as a base for observing the next point
A summary of my thinking on this topic, unfortunately, also the last ones. Defined the inevitable trade-off between the reliability and throughput, suggested adaptation across the protocol stack etc.
Initial Graphulo Graph Analytics Expressed in GraphBLAS:
GraphBLAS is an effort to define standard building blocks for graph algorithms in the language of linear algebra. Graphulo is a project to implement the GraphBLAS using Accumulo.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
From data stream management to distributed dataflows and beyondVasia Kalavri
Recent efforts by academia and open-source communities have established stream processing as a principal data analysis technology across industry. All major cloud vendors offer streaming dataflow pipelines and online analytics as managed services. Notable use-cases include real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and anomaly detection in financial transactions. At the same time, streaming dataflow systems are increasingly being used for event-driven applications beyond analytics, such as orchestrating microservices and model serving. In the past decades, streaming technology has evolved significantly, however, emerging applications are once more challenging the design decisions of modern streaming systems. In this talk, I will discuss the evolution of stream processing and bring current trends and open problems to the attention of our community.
Predictive Datacenter Analytics with StrymonVasia Kalavri
A modern enterprise datacenter is a complex, multi-layered system whose components often interact in unpredictable ways. Yet, to keep operational costs low and maximize efficiency, we would like to foresee the impact of changing workloads, updating configurations, modifying policies, or deploying new services.
In this talk, I will share our research group’s ongoing work on Strymon: a system for predicting datacenter behavior in hypothetical scenarios using queryable online simulation. Strymon leverages existing logging and monitoring pipelines of modern production datacenters to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if scenarios. Predictions are made online by simulating the hypothetical datacenter state alongside the real one. Driven by a real-use case from our industrial partners, I will highlight the challenges we are facing in building Strymon to support a diverse set of data representations, input sources, query languages, and execution models.
Finally, I will share our initial design decisions and give an overview of Timely Dataflow; a high-performance distributed streaming engine and our platform of choice for Strymon’s core implementation.
More Related Content
Similar to The shortest path is not always a straight line
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
What do graphs look like? How do they evolve over time? How does influence/news/viruses propagate, over time? We present a long list of static and temporal laws, and some recent observations on real graphs. We show that fractals and self-similarity can explain several of the observed patterns, and we conclude with cascade analysis and a surprising result on virus propagation and immunization.
Design of Filter Circuits using MATLAB, Multisim, and ExcelDavid Sandy
The purpose of this project was to design crossover active filter circuits, in order to drive music through three different types of speakers. So, high frequencies would be sent through a Tweeter speaker, low frequencies would be sent through a Woofer speaker, and middle frequencies would be sent through a Midbass driver speaker. Three circuits were created to drive these speakers. Multisim, MATLAB, and Excel, were all used in the design process in order to create the filter circuits correctly.
Circuit Theory 2: Filters Project ReportMichael Sandy
The purpose of this project was to design crossover active filter circuits, in order to drive music through three different types of speakers. So, high frequencies would be sent through a Tweeter speaker, low frequencies would be sent through a Woofer speaker, and middle frequencies would be sent through a Midbass driver speaker. Three circuits were created to drive these speakers. Multisim, MATLAB, and Excel, were all used in the design process in order to create the filter circuits correctly.
Traversing Notes |surveying II | Sudip khadka Sudip khadka
Traverse is a method in the field of surveying to establish control networks. It is also used in geodesy. Traverse networks involve placing survey stations along a line or path of travel, and then using the previously surveyed points as a base for observing the next point
A summary of my thinking on this topic, unfortunately, also the last ones. Defined the inevitable trade-off between the reliability and throughput, suggested adaptation across the protocol stack etc.
Initial Graphulo Graph Analytics Expressed in GraphBLAS:
GraphBLAS is an effort to define standard building blocks for graph algorithms in the language of linear algebra. Graphulo is a project to implement the GraphBLAS using Accumulo.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
From data stream management to distributed dataflows and beyondVasia Kalavri
Recent efforts by academia and open-source communities have established stream processing as a principal data analysis technology across industry. All major cloud vendors offer streaming dataflow pipelines and online analytics as managed services. Notable use-cases include real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and anomaly detection in financial transactions. At the same time, streaming dataflow systems are increasingly being used for event-driven applications beyond analytics, such as orchestrating microservices and model serving. In the past decades, streaming technology has evolved significantly, however, emerging applications are once more challenging the design decisions of modern streaming systems. In this talk, I will discuss the evolution of stream processing and bring current trends and open problems to the attention of our community.
Predictive Datacenter Analytics with StrymonVasia Kalavri
A modern enterprise datacenter is a complex, multi-layered system whose components often interact in unpredictable ways. Yet, to keep operational costs low and maximize efficiency, we would like to foresee the impact of changing workloads, updating configurations, modifying policies, or deploying new services.
In this talk, I will share our research group’s ongoing work on Strymon: a system for predicting datacenter behavior in hypothetical scenarios using queryable online simulation. Strymon leverages existing logging and monitoring pipelines of modern production datacenters to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if scenarios. Predictions are made online by simulating the hypothetical datacenter state alongside the real one. Driven by a real-use case from our industrial partners, I will highlight the challenges we are facing in building Strymon to support a diverse set of data representations, input sources, query languages, and execution models.
Finally, I will share our initial design decisions and give an overview of Timely Dataflow; a high-performance distributed streaming engine and our platform of choice for Strymon’s core implementation.
Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri
Streaming is the latest hot topic in the big data world. We want to process data immediately and continuously. Modern stream processors have matured significantly and offer exceptional features, including sub-second latencies, high throughput, fault-tolerance, and seamless integration with various data sources and sinks.
Many sources of streaming data consist of related or connected events: user interactions in a social network, web page clicks, movie ratings, product purchases. These connected events can be naturally represented as edges in an evolving graph.
In this talk I will explain how we can leverage a powerful stream processor, such as Apache Flink, and academic research of the past two decades, to build graph streaming applications. I will describe how we can model graphs as streams and how we can compute graph properties without storing and managing the graph state. I will introduce useful graph summary data structures and show how they allow us to build graph algorithms in the streaming model, such as connected components, bipartiteness detection, and distance estimation.
dotScale 2016 presentation
Writing distributed graph applications is inherently hard. In this talk, Vasia gives an overview of high-level programming models and platforms for distributed graph processing. She exposes and discusses common misconceptions, shares lessons learnt, and suggests best practices.
This is the "Deep Dive" talk given at the first Apache Flink Meetup Stockholm. The talk describes three components of the Apache Flink Internals: (a) job life-cycle, (b) the batch optimizer and (c) native iterations.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
1. THE SHORTEST PATH IS NOT
ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of Technology
Tiago Simas (tiago.simas@telefonica.com)Telefonica Research
Dionysios Logothetis (dionysios@fb.com) Facebook
4. THE METRIC BACKBONE
Reduces the graph size while
maintaining relevant structure
The minimum subgraph of a weighted graph, that
preserves the shortest paths of the original graph
4
B
E
DA
C
2
3
10
4
2
1
B
E
DA
C
2
3
2
1
5. WHAT CAN WE USE IT FOR?
• Exact computations
• any algorithm that depends on the shortest paths
• reachability, connectivity
• betweenness centrality, closeness centrality
• Approximation
• PageRank, random walks
• eigenvector centrality
• community detection, clustering
5
6. WHAT CAN WE USE IT FOR?
• Exact computations
• any algorithm that depends on the shortest paths
• reachability, connectivity
• betweenness centrality, closeness centrality
• Approximation
• PageRank, random walks
• eigenvector centrality
• community detection, clustering
5
Improves community detection
modularity and recommender
systems accuracy
7. IMPACT ON LARGE-SCALE SYSTEMS
• Graph Databases
• fewer edges => smaller path search space
• Batch Graph Processing
• CPU and memory requirements depend on #messages
• #messages proportional to #edges
• fewer edges => improved analysis performance
• Graph Compression
• fewer edges => storage reduction
6
9. SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
8
B
E
DA
C
2
3
10
4
2
1
10. SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
9
B
E
DA
C
2
3
10
4
2
1
CE is 1st-order
semi-metric:
C-D-E is a shorter
2-hop path
11. SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
10
B
E
DA
C
2
3
10
4
2
1
AD is 2nd-order
semi-metric:
A-B-C-D is a shorter
3-hop path
CE is 1st-order
semi-metric:
C-D-E is a shorter
2-hop path
12. SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
11
B
E
DA
C
2
3
10
4
2
1
CE is 1st-order
semi-metric:
C-D-E is a shorter
2-hop path
AD is 2nd-order
semi-metric:
A-B-C-D is a shorter
3-hop path
AB, BC, CD, DE
are metric
15. BACKBONE CALCULATION
• Calculating the backbone:
• find all semi-metric edges: 1 BFS per edge?
• compute APSP and store O(N2) paths
Can we calculate or
approximate the backbone
without solving APSP?
13
23. A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric
edges: only look at triangles
1. Scalable & practical
for large graphs
24. A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric
edges: only look at triangles
1.
Identify metric edges in
2-hop paths
2.
Scalable & practical
for large graphs
25. A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric
edges: only look at triangles
1.
Identify metric edges in
2-hop paths
2.
Scalable & practical
for large graphs
Most semi-metric edges
have been removed
30. A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric
edges: only look at triangles!
1.
Identify metric edges in
2-hop paths
2.
Scalable & practical
for large graphs!
Most semi-metric edges
have been removed
31. A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric
edges: only look at triangles!
1.
Identify metric edges in
2-hop paths
2.
Run a BFS for remaining
unlabeled edges.
3.
Scalable & practical
for large graphs!
Most semi-metric edges
have been removed
32. A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric
edges: only look at triangles!
1.
Identify metric edges in
2-hop paths
2.
Run a BFS for remaining
unlabeled edges.
3.
Scalable & practical
for large graphs!
1%-9% edges
Most semi-metric edges
have been removed
39. EVALUATION GOALS
• How does our algorithm compare to APSP?
• Are large, real-world graphs semi-metric?
• Can we improve graph analysis performance?
26
41. COMPARISONTO APSP
Computing APSP in Giraph
• multiple SSSPs
• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months
for million-edge graphs
42. COMPARISONTO APSP
Computing APSP in Giraph
• multiple SSSPs
• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months
for million-edge graphs
In the order of days for
million-edge graphs
43. COMPARISONTO APSP
Computing APSP in Giraph
• multiple SSSPs
• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months
for million-edge graphs
In the order of days for
million-edge graphs
Our algorithm is 120-180x faster than SSSP
and 11-14x faster than MSSP:
order of hours for million-edge graphs
47. ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
48. ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
49. ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
Slow
50. ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
Slow
Labels up to 1-9%
of the total edges
51. ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
Slow
Labels up to 1-9%
of the total edges
Phase 1 is the fastest and most useful phase
61. BEST PRACTICES
When to use the backbone?
• semi-metric weighting schemes, e.g. neighborhood similarity
• we can amortize the overhead: e.g. many algorithms on the same graph,
multiple distance queries
• lossy compression is ok
When not to use the backbone?
• for metric weighting schemes
• we need to run one-off analysis
• we need lossless compression
35
62. RECAP: MAIN CONTRIBUTIONS
36
• An algorithm for computing the metric
backbone without solving APSP
• An open-source distributed implementation
• Graph query and graph analytics speedup on
Neo4j and Apache Giraph
63. THE SHORTEST PATH IS NOT
ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of Technology
Tiago Simas (tiago.simas@telefonica.com)Telefonica Research
Dionysios Logothetis (dionysios@fb.com) Facebook