The authors present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
Graph Summarization with Quality GuaranteesTwo Sigma
Given a large graph, the authors we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation.
Algorithmic Data Science = Theory + PracticeTwo Sigma
Obtaining actionable insights from large datasets requires the use methods that must be, at once, fast, scalable, and statistically sound. This is the field of study of algorithmic data science, a discipline at the border of computer science and statistics. In this talk I outline the fundamental questions that motivate research in this area, present a general framework to formulate many problems in this field, introduce the challenges in balancing theoretical and statistical correctness with practical efficiency, and I show how sampling-based algorithms are extremely effective at striking the correct balance in many situations, giving examples from social network analysis and pattern mining. I will conclude with some research directions and areas for future explorations.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
Graph Summarization with Quality GuaranteesTwo Sigma
Given a large graph, the authors we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation.
Algorithmic Data Science = Theory + PracticeTwo Sigma
Obtaining actionable insights from large datasets requires the use methods that must be, at once, fast, scalable, and statistically sound. This is the field of study of algorithmic data science, a discipline at the border of computer science and statistics. In this talk I outline the fundamental questions that motivate research in this area, present a general framework to formulate many problems in this field, introduce the challenges in balancing theoretical and statistical correctness with practical efficiency, and I show how sampling-based algorithms are extremely effective at striking the correct balance in many situations, giving examples from social network analysis and pattern mining. I will conclude with some research directions and areas for future explorations.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
Towards a stable definition of Algorithmic RandomnessHector Zenil
Although information content is invariant up to an additive constant, the range of possible additive constants applicable to programming languages is so large that in practice it plays a major role in the actual evaluation of K(s), the Kolmogorov complexity of a string s. We present a summary of the approach we've developed to overcome the problem by calculating its algorithmic probability and evaluating the algorithmic complexity via the coding theorem, thereby providing a stable framework for Kolmogorov complexity even for short strings. We also show that reasonable formalisms produce reasonable complexity classifications.
Enumeration methods are very important in a variety of settings, both mathematical and applications. For many problems there is actually no real hope to do the enumeration in reasonable time since the number of solutions is so big. This talk is about how to compute at the limit.
The talk is decomposed into:
(a) Regular enumeration procedure where one uses computerized case distinction.
(b) Use of symmetry groups for isomorphism checks.
(c) The augmentation scheme that allows to enumerate object up to isomorphism without keeping the full list in memory.
(d) The homomorphism principle that allows to map a complex problem to a simpler one.
Fractal dimension versus Computational ComplexityHector Zenil
We investigate connections and tradeoffs between two important complexity measures: fractal dimension and computational (time) complexity. We report exciting results applied to space-time diagrams of small Turing machines with precise mathematical relations and formal conjectures connecting these measures. The preprint of the paper is available at: http://arxiv.org/abs/1309.1779
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Hector Zenil
Complexity measures are designed to capture complex behaviour and to quantify how complex that particular behaviour is. If a certain phenomenon is genuinely complex this means that it does not all of a sudden becomes simple by just translating the phenomenon to a different setting or framework with a different complexity value. It is in this sense that we expect different complexity measures from possibly entirely different fields to be related to each other. This work presents our work on a beautiful connection between the fractal dimension of space-time diagrams of Turing machines and their time complexity. Presented at Machines, Computations and Universality (MCU) 2013, Zurich, Switzerland.
Information Content of Complex NetworksHector Zenil
This short talk given in Stockholm, Sweden, explains how algorithmic complexity measures, notably Kolmogorov complexity approximated both by lossless compression algorithms and the Block Decomposition Method (BDM) are capable of characterizing graphs and networks by some of their group-theoretic and topological properties, notably graph automorphism group size and clustering coefficients of complex networks. The method distinguished between models of networks such as regular, random, small-world and scale-free.
Towards a stable definition of Algorithmic RandomnessHector Zenil
Although information content is invariant up to an additive constant, the range of possible additive constants applicable to programming languages is so large that in practice it plays a major role in the actual evaluation of K(s), the Kolmogorov complexity of a string s. We present a summary of the approach we've developed to overcome the problem by calculating its algorithmic probability and evaluating the algorithmic complexity via the coding theorem, thereby providing a stable framework for Kolmogorov complexity even for short strings. We also show that reasonable formalisms produce reasonable complexity classifications.
Enumeration methods are very important in a variety of settings, both mathematical and applications. For many problems there is actually no real hope to do the enumeration in reasonable time since the number of solutions is so big. This talk is about how to compute at the limit.
The talk is decomposed into:
(a) Regular enumeration procedure where one uses computerized case distinction.
(b) Use of symmetry groups for isomorphism checks.
(c) The augmentation scheme that allows to enumerate object up to isomorphism without keeping the full list in memory.
(d) The homomorphism principle that allows to map a complex problem to a simpler one.
Fractal dimension versus Computational ComplexityHector Zenil
We investigate connections and tradeoffs between two important complexity measures: fractal dimension and computational (time) complexity. We report exciting results applied to space-time diagrams of small Turing machines with precise mathematical relations and formal conjectures connecting these measures. The preprint of the paper is available at: http://arxiv.org/abs/1309.1779
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Hector Zenil
Complexity measures are designed to capture complex behaviour and to quantify how complex that particular behaviour is. If a certain phenomenon is genuinely complex this means that it does not all of a sudden becomes simple by just translating the phenomenon to a different setting or framework with a different complexity value. It is in this sense that we expect different complexity measures from possibly entirely different fields to be related to each other. This work presents our work on a beautiful connection between the fractal dimension of space-time diagrams of Turing machines and their time complexity. Presented at Machines, Computations and Universality (MCU) 2013, Zurich, Switzerland.
Information Content of Complex NetworksHector Zenil
This short talk given in Stockholm, Sweden, explains how algorithmic complexity measures, notably Kolmogorov complexity approximated both by lossless compression algorithms and the Block Decomposition Method (BDM) are capable of characterizing graphs and networks by some of their group-theoretic and topological properties, notably graph automorphism group size and clustering coefficients of complex networks. The method distinguished between models of networks such as regular, random, small-world and scale-free.
Contemporary communication systems 1st edition mesiya solutions manualto2001
Contemporary Communication Systems 1st Edition Mesiya Solutions Manual
Download:https://goo.gl/DmVRQ4
contemporary communication systems mesiya pdf download
contemporary communication systems mesiya download
contemporary communication systems pdf
contemporary communication systems mesiya solutions
Extended network and algorithm finding maximal flows IJECEIAES
Graph is a powerful mathematical tool applied in many fields as transportation, communication, informatics, economy, in ordinary graph the weights of edges and vertexes are considered independently where the length of a path is the sum of weights of the edges and the vertexes on this path. However, in many practical problems, weights at a vertex are not the same for all paths passing this vertex, but depend on coming and leaving edges. The paper develops a model of extended network that can be applied to modelling many practical problems more exactly and effectively. The main contribution of this paper is algorithm finding maximal flows on extended networks.
I am Bing Jr. I am a Signal Processing Assignment Expert at matlabassignmentexperts.com. I hold a Master's in Matlab Deakin University, Australia. I have been helping students with their assignments for the past 9 years. I solve assignments related to Signal Processing.
Visit matlabassignmentexperts.com or email info@matlabassignmentexperts.com. You can also call on +1 678 648 4277 for any assistance with Signal Processing Assignments.
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...Alexander Litvinenko
Just some ideas how low-rank matrices/tensors can be useful in spatial and environmental statistics, where one usually has to deal with very large data
Similar to TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size (20)
The State of Open Data on School BullyingTwo Sigma
How much of a problem is school bullying in NYC? The answer depends on who you ask. Data Clinic volunteers compared local surveys (where many students say bullying is happening) with federal data (where a majority of schools report zero incidents), to analyze these disparities for the 2013-14 school year. To present this work, the Data Clinic hosted an event as part of NYC’s Open Data Week, featuring a presentation of the analysis and a panel discussion with researchers, advocates, and journalists to better understand this important student safety issue.
Halite is an open source artificial intelligence programming competition, created by Two Sigma, where players build bots using the coding language of their choice to battle on a two-dimensional virtual board. Halite II, running on GCP, supported about 6,000 active game players from about 100 countries and 1,000 institutions over a three month period. The presentation surveys the principles needed for a successful AI programming competition and describes the architecture of the game environment, particularly the support that GCP provided for the support of 12 million game executions written in over 20 programming languages. Among other topics, this talk illustrates the approaches taken to security, scalability, and the considerations needed to allow machine learning bots to place in the top 50 results.
BeakerX is a collection of kernels and extensions to the Jupyter interactive computing platform. Its major features are: 1) JVM kernel support including Java, Scala, Groovy, Clojure, Kotlin, and SQL. The kernels are built from a shared base kernel that includes magics and classpath support. 2) a collection of interactive widgets for time-series plots, tables, and forms. There are APIs for our JVM languages plus Python and JavaScript. 3) prototype autotranslation for polyglot programming 4) One-click publication including interactive widgets, and 5) a data browser with drag-and-drop into the notebook. The presentation will include a demo of BeakerX and discussion of its history and relationship to its predecessor the Beaker Notebook.
Engineering with Open Source - Hyonjee JooTwo Sigma
Engineering systems using open source solutions can be a powerful way to leverage existing technology. However, not all open source solutions are made or supported equally, and it’s important to choose what you use carefully. In this talk, we’ll walk through building a metrics system for a high performance data platform, taking a look at some of the important factors to consider when choosing what open source offerings to use.
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonTwo Sigma
The NERF and Heads projects bring Linux back to the cloud servers' boot ROMs by replacing nearly all of the vendor firmware with a reproducible built Linux runtime that acts as a fast, flexible, and measured boot loader. It has been years since any modern servers have supported Free Firmware options like LinuxBIOS or coreboot, and as a result server and cloud security has been dependent on unreviewable, closed source, proprietary vendor firmware of questionable quality. With Heads on NERF, we are making it possible to take back control of our systems with Open Source Software from very early in the boot process, helping build a more trustworthy and secure cloud.
Waiter: An Open-Source Distributed Auto-ScalerTwo Sigma
One of the key challenges in developing a service-oriented architecture (SOA) is anticipating traffic patterns and scaling the number of running instances of services to meet demand. In many situations, it’s hard to know how much traffic a service will receive and when that traffic will come. A service may see no requests for several days in a row and then suddenly see thousands of requests per second. If developers underestimate peak traffic, their service can quickly become overwhelmed and unresponsive, and may even crash, resulting in constant human intervention and poor developer productivity. On the other hand, if they provision sufficient capacity upfront, the resources they allocate will be completely wasted when there’s no traffic. In order to allow for better resource utilization, many cluster management platforms provide auto-scaling features. These features tend to auto-scale at the machine/resource level (as opposed to the request level) or by deferring to logic in the application layer. A more optimal approach would be to run services when–and only when–there is traffic. Waiter is a distributed auto-scaler that delivers this optimal type of request-level auto-scaling. It requires no input or handling from applications and is agnostic to underlying cluster managers; it currently uses Mesos, but can easily run on top of Kubernetes or other solutions. Another challenge with SOAs is enabling the evolution of service implementations without breaking downstream customers. On this front, Waiter supports service-versioning for downstream consumers by running multiple, individually-addressable versions of services. It automatically manages service lifecycles and reaps older versions after a period of inactivity. With a variety of unique features, Waiter is a compelling platform for applications across a broad range of industries. Existing web services can run on Waiter without modification as long as they communicate over HTTP and support the transmission of client requests to arbitrary backends. Two Sigma has employed the platform in a variety of critical production contexts for over two years, with use cases rising to hundreds of millions of requests per day.
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeTwo Sigma
Designing a system that can extract immediate insights from large amounts of data in real-time requires a special way of thinking. This talk presents a “reactive” approach to designing real-time, responsive, and scalable data applications that can continuously compute analytics on-the-fly. It also highlights a case study as an example of reactive design in action.
Archival Storage at Two Sigma - Josh LenersTwo Sigma
This talk is about archival storage at Two Sigma. We begin by presenting CelFS, Two Sigma’s geo-distributed file system which has been in deployment for over ten years. Although CelFS has scaled to serve tens of petabytes of data, it uses physical partitioning to provide quality of service guarantees, it has a high replication overhead, and cannot take advantage of outsourced cold storage (e.g., Amazon’s Glaclier or Google’s coldline). In the rest of the talk, we describe our response to these limitations in Jaks, a new storage system to reduce the TCO of CelFS and serve as the backend for other systems at Two Sigma. We also discuss how we hedge risk in changing such a foundational system.
Smooth Storage - A distributed storage system for managing structured time se...Two Sigma
Smooth is a distributed storage system for managing structured time series data at Two Sigma. Smooth’s design emphasizes scale, both in terms of size and aggregate request bandwidth, reliability and storage efficiency. It is optimized for large parallel streaming read/write accesses over provided time ranges. Smooth has a clear separation between the metadata and data layers, and supports multiple pluggable object stores for storing data files. Data can be replicated or moved between different stores and data centers to support availability, performance and storage tiering objectives. Smooth is widely used at Two Sigma by various applications including modeling research workflows, data pipelines and various data analysis jobs. Smooth has been in development for about 5 years, currently stores multiple PBs of compressed data, and serves peak aggregate throughput in excess of 100 GB/s. In this talk I will discuss the design and implementation of Smooth, our experience running it over the past two years, ongoing challenges and future directions.
Whether your data's in MySQL, a NoSQL, or somewhere in the cloud, you're likely paying decent money for storage and IOPS. With ever-growing data volumes, and the need for SSDs to cut latency and replication to provide insurance, your storage footprint is an important place to look for savings. It makes sense, then, why so many storage vendors tout compression as a key metric and differentiator.
The language vendors and users employ to reason about storage footprint and compression is embarrassingly vague if not meaningless or downright deceptive, but we can do better, and we must do better.
Whether your data's in MySQL, a NoSQL, or somewhere in the cloud, you're likely paying decent money for storage and IOPS. With ever-growing data volumes, and the need for SSDs to cut latency and replication to provide insurance, your storage footprint is an important place to look for savings. It makes sense, then, why so many storage vendors tout compression as a key metric and differentiator.
The language vendors and users employ to reason about storage footprint and compression is embarrassingly vague if not meaningless or downright deceptive, but we can do better, and we must do better.
This presentation discusses each part of the durable storage stack, from the hardware on up, and how usage numbers can take on different meanings at each layer. It covers what's important to know at each layer, and how to think about and talk about concepts like compression, fragmentation, write amplification, and wear leveling. Finally, it examines different ways benchmarketers can present data deceptively, and provides some techniques for identifying and cutting through those kinds of misrepresentations.
Identifying Emergent Behaviors in Complex Systems - Jane AdamsTwo Sigma
Forager ants in the Arizona desert have a problem: after leaving the nest, they don’t return until they’ve found food. On the hottest and driest days, this means many ants will die before finding food, let alone before bringing it back to the nest. Honeybees also have a problem: even small deviations from 35ºC in the brood nest can lead to brood death, malformed wings, susceptibility to pesticides, and suboptimal divisions of labor within the hive. All ants in the colony coordinate to minimize the number of forager ants lost while maximizing the amount of food foraged, and all bees in the hive coordinate to keep the brood nest temperature constant in changing environmental temperatures.
The solutions realized by each system are necessarily decentralized and abstract: no single ant or bee coordinates the others, and the solutions must withstand the loss of individual ants and bees and extend to new ants and bees. They focus on simple yet essential features and capabilities of each ant and bee, and use them to great effect. In this sense, they are incredibly elegant.
In this talk, we’ll examine a handful of natural and computer systems to illustrate how to cast system-wide problems into solutions at the individual component level, yielding incredibly simple algorithms for incredibly complex collective behaviors.
Improving Python and Spark Performance and Interoperability with Apache ArrowTwo Sigma
Apache Arrow-based interconnection between various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently,
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Two Sigma
The Vera Institute of Justice (Vera) partnered with with Two Sigma’s Data Clinic, a volunteer-based program that leverages employees’ data science expertise, to uncover the factors contributing to continued jail growth in rural areas.
An overview of Rademacher Averages, a fundamental concept from statistical learning theory that can be used to derive uniform sample-dependent bounds to the deviation of samples averages from their expectations.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size
1. TRIÈST: Approximating Triangle Counts
in Fully-Dynamic Graph Edge Streams
with Fixed Memory
Matteo Riondato – Labs, Two Sigma Investments
CMU DB Group – October 24, 2016
1 / 26
2. Who am I?
Matteo Riondato
Working at
Labs, Two Sigma Investments (Research Scientist);
CS Dept., Brown U. (Visiting Asst. Prof.);
Doing research in algorithmic data science
(used to be data mining, but somehow we forgot about algorithms. . . );
algorithmic data science = (theory × practice)(theory×practice)
Tweeting @teorionda;
“Living” at http://matteo.rionda.to.
2 / 26
3. What am I going to talk about?
TRIÈST: a suite of algorithms for approximately counting triangles in fully-dynamic
edge streams, using a fixed amount of storage/space/memory.
Joint work with:
• Lorenzo De Stefani (Brown);
• Alessandro Epasto (Google Research);
• Eli Upfal (Brown);
Best student paper award at ACM KDD’16;
Journal version under submission to ACM TKDD,
available from http://bit.ly/triestkdd;
TRIÈST: Counting Local and Global Triangles in
Fully-Dynamic Streams with Fixed Memory Size
Lorenzo De Stefani
Brown University
Providence, RI, USA
lorenzo@cs.brown.edu
Alessandro Epastoú
Google
New York, NY, USA
aepasto@google.com
Matteo Riondato*
Two Sigma Investments
New York, NY, USA
matteo@twosigma.com
Eli Upfal
Brown University
Providence, RI, USA
eli@cs.brown.edu
“Ogni lassada xe persa”1
– Proverb from Trieste, Italy.
ABSTRACT
We present trièst, a suite of one-pass streaming algorithms
to compute unbiased, low-variance, high-quality approxima-
tions of the global and local (i.e., incident to each vertex)
number of triangles in a fully-dynamic graph represented as
an adversarial stream of edge insertions and deletions.
Our algorithms use reservoir sampling and its variants to
exploit the user-specified memory space at all times. This is
in contrast with previous approaches, which require hard-to-
choose parameters (e.g., a fixed sampling probability) and
o er no guarantees on the amount of memory they use. We
analyze the variance of the estimations and show novel con-
centration bounds for these quantities.
Our experimental results on very large graphs demon-
strate that trièst outperforms state-of-the-art approaches
in accuracy and exhibits a small update time.
1. INTRODUCTION
Exact computation of characteristic quantities of Web-
scale networks is often impractical or even infeasible due
approximation of these quantities. For e ciency, the algo-
rithms should aim at exploiting the available memory space
as much as possible and they should require only one pass
over the stream.
We introduce trièst, a suite of sampling-based, one-pass
algorithms for adversarial fully-dynamic streams to approx-
imate the global number of triangles and the local number of
triangles incident to each vertex. Mining local and global
triangles is a fundamental primitive with many applications
(e.g., community detection [4], topic mining [10], spam/anomaly
detection [3, 27], ego-networks mining [12] and protein in-
teraction networks analysis [29].)
Many previous works on triangle estimation in streams
also employ sampling (see Sect. 3), but they usually require
the user to specify in advance an edge sampling probability
p that is fixed for the entire stream. This approach presents
several significant drawbacks. First, choosing a p that allows
to obtain the desired approximation quality requires to know
or guess a number of properties of the input (e.g., the size
of the stream). Second, a fixed p implies that the sample
size grows with the size of the stream, which is problematic
when the stream size is not known in advance: if the user
3 / 26
4. What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
5. What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
6. What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
7. What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
E.g., ∆1 = 2, ∆5 = 3, ∆6 = 0, . . .
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
8. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
5 / 26
9. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗; Element on the stream: +, (1, 2)
Graph G(t∗): 1
0 4
3 2
5 / 26
10. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 1; Element on the stream: +, (3, 2)
Graph G(t∗): 1
0 4
3 2
5 / 26
11. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 1; Element on the stream: +, (3, 2)
Graph G(t∗+1): 1
0 4
3 2
5 / 26
12. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 2; Element on the stream: +, (1, 3)
Graph G(t∗+1): 1
0 4
3 2
5 / 26
13. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 2; Element on the stream: +, (1, 3)
Graph G(t∗+2): 1
0 4
3 2
5 / 26
14. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 3; Element on the stream: −, (3, 2)
Graph G(t∗+2): 1
0 4
3 2
5 / 26
15. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 3; Element on the stream: −, (3, 2)
Graph G(t∗+3): 1
0 4
3 2
5 / 26
16. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 4; Element on the stream: +, (1, 5)
Graph G(t∗+3): 1
0 4
3 2
5 / 26
17. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 4; Element on the stream: +, (1, 5)
Graph G(t∗+4): 1
0 4
53 2
5 / 26
18. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 4; Element on the stream: +, (1, 5)
Graph G(t∗+4): 1
0 4
53 2
5 / 26
19. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 5; Element on the stream: +, (4, 5)
Graph G(t∗+4): 1
0 4
53 2
5 / 26
20. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 5; Element on the stream: +, (4, 5)
Graph G(t∗+5): 1
0 4
53 2
5 / 26
21. What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 5; Element on the stream: +, (4, 5)
Graph G(t∗+5): 1
0 4
53 2
The global and local triangle counts change from G(t) to G(t+1);
Our goal: at each time t, give an estimate of ∆G(t) and ∆v , v ∈ V (t).
5 / 26
22. Why is working on fully-dynamic edge streams difficult?
The stream is infinite: storing all (or a constant fraction of) the edges is impossible;
There is no end of the stream: post-processing at the end of the stream is impossible;
Updates arrive continuously: re-running an algorithm from scratch after each update
is infeasible;
Triangle counts change continuously: spending a long time on each update to get the
exact count is infeasible and illogical;
An efficient algorithm for fully-dynamic streams must tackle all these challenges.
TRIÈST does.
6 / 26
23. Why is working on fully-dynamic edge streams difficult?
The stream is infinite: storing all (or a constant fraction of) the edges is impossible;
→ TRIÈST stores a user-specified, fixed amount M of edges;
There is no end of the stream: post-processing at the end of the stream is impossible;
→ TRIÈST needs no postprocessing.
Updates arrive continuously: re-running an algorithm from scratch after each update
is infeasible; → TRIÈST is incremental and one-pass;
Triangle counts change continuously: spending a long time on each update to get the
exact count is infeasible and illogical; → TRIÈST computes high-quality estimates;
An efficient algorithm for fully-dynamic streams must tackle all these challenges.
TRIÈST does.
6 / 26
24. What is TRIÈST?
(the local dialect name of Trieste, a city in the North-East of Italy, next to Slovenia.)
TRIÈST (TRIangles EST imation):
A suite of 3 algorithms for approximate triangle counting from edge streams:
• TRIÈST-BASE: baseline algorithm for insertion-only streams;
• TRIÈST-IMPR: improved algorithm for insertion only streams with reduced variance;
• TRIÈST-FD: algorithm for fully-dynamic streams.
All three algorithms offer unbiased estimators of the local and global triangle counts;
We also present a complete analysis of their variance and give concentration bounds;
7 / 26
25. Aren’t there other algorithms to estimate triangles?
There are many algorithms for estimating triangles from data streams;
Most-recent ones are based on independent edge sampling with fixed probability;
They use an ever-increasing amount of space;
Work
Single
pass
Fixed
space
Local
counts
Global
counts
Fully-dynamic
streams
Becchetti et al. 2010 /
Kolountzakis et al. 2012
Pavan et al. 2013
Jha et al. 2015
Ahmed et al. 2014
Lim et al. 2015
TRIÈST
TRIÈST is the first to tackle all the challenges;
It is based on reservoir sampling, a well-known non-independent sampling scheme;
The analysis is challenging, but the gains are worth the price.
8 / 26
26. What is the general idea behind TRIÈST?
Let’s focus on TRIÈST-BASE for now (i.e., insertion-only streams);
TRIÈST-BASE maintains a collection S of M edges from the stream;
The edges in S induce a graph GS = (VS, S);
TRIÈST-BASE maintains the exact values for
∆GS
: the number of triangles in GS; and
∆vS : the number of triangles in GS incident to v ∈ VS.
Maintaining the exact counts ∆GS
and ∆vS , v ∈ V (t) after each update is fast:
Estimates for ∆G(t) and ∆v , v ∈ V (t) are obtained from ∆GS
and ∆vS by weighting by
a probability πt (stay tuned!)
9 / 26
27. How does TRIÈST-BASE work?
TRIÈST-BASE uses a random sampling scheme known as reservoir sampling;
At any time t ≤ M, deterministically insert the edge currently on the stream into S;
At any t M, flip a coin with tail-bias M/t;
If the outcome is head, do nothing;
If the outcome is tail :
1) Choose an edge in S u.a.r. and replace it with the edge currently on the stream;
2) Decrease ∆GS
and ∆vS , v ∈ VS, by the no. of triangles involving the removed edge;
3) Increase ∆GS
and ∆vS , v ∈ VS, by the no. of triangles involving the inserted edge;
10 / 26
28. Is an example worth a thousand words?
Memory: M = 8; Time: end of t∗ − 1;
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
29. Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗;
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
30. Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
31. Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
32. Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
33. Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3−1 + 1 = 3
11 / 26
34. Is an example worth a thousand words?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Coin bias: M/(t∗ + 1); Coin flip outcome:
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
35. Is an example worth a thousand words?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Coin bias: M/(t∗ + 1); Coin flip outcome: head;
Actions: Do nothing;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
36. How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
12 / 26
37. How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
Corollary
The probability that a triangle (a, b, c) of G(t) is in GS at time t is
πt =
t − 3
M − 3
t
M
12 / 26
38. How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
Corollary
The probability that a triangle (a, b, c) of G(t) is in GS at time t is
πt =
t − 3
M − 3
t
M
because
t
M
: M-subsets of E(t) (|E(t)| = t)
t − 3
M − 3
: M-subsets of E(t) containing (a, b, c)
12 / 26
39. How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
Corollary
The probability that a triangle (a, b, c) of G(t) is in GS at time t is
πt =
t − 3
M − 3
t
M
because
t
M
: M-subsets of E(t) (|E(t)| = t)
t − 3
M − 3
: M-subsets of E(t) containing (a, b, c)
Hence, TRIÈST-BASE computes the unbiased estimate of ∆G(t) :
∆G(t) =
∆GS
πt
.
12 / 26
40. Where are the theorems?
We give complete analysis of unbiasedness, variance, and novel concentration bounds;
The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent;
This makes the analysis of variance and concentration bounds quite challenging;
13 / 26
41. Where are the theorems?
We give complete analysis of unbiasedness, variance, and novel concentration bounds;
The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent;
This makes the analysis of variance and concentration bounds quite challenging;
Theorem (Concentration bound, (ε, δ)-approximation)
Let t ≥ 0 and assume |∆(t)| 0. For any ε, δ ∈ (0, 1), let
Φ = 3
8ε−2
3h(t) + 1
|∆(t)|
ln
(3h(t) + 1)e
δ
.
If
M ≥ max tΦ 1 +
1
2
ln2/3
(tΦ) , 12ε−1
+ e2
, 25 ,
then |ξ(t)τ(t) − |∆(t)|| ε|∆(t)| with probability 1 − δ.
Proving this was fun:
we used results on graph coloring,Poisson approximations, and Chernoff bounds.
13 / 26
42. Ok, but can I show you something?
To exactly show the variance of TRIÈST-BASE estimator ∆GS
:
1) Express variance as sum of covariances of each pair of triangles:
Var(∆GS
) =
pairs (a,b)
Cov(a, b)
2) Explicitly compute covariance formulas:
2.a) For pairs of triangles sharing an edge, compute the probability of 5 edges
being in S:
πt
(M − 3)(M − 4))
(t − 3)(t − 4)
2.b) For pairs of triangles not sharing an edge, compute the probability of 6 edges
being in S:
πt
(M − 3)(M − 4)(M − 5)
(t − 3)(t − 4)(t − 5)
The variance depends on the real no. of triangles in G(t) and on the no. of triangles in
G(t) sharing an edge. 14 / 26
43. What is wrong with TRIÈST-BASE?
Weaknesses:
1) -BASE uses the exact value of ∆GS
at time t to estimate ∆G(t) ;
Over time, ∆GS
may decrease, and so would the estimation,. . .
while ∆G(t ) never decreases: ≥ ∆G(t) for any t t!
2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in
S, and the third one is on the stream right now, we may infer that the triangle exists,
so we should count it;
TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance;
15 / 26
44. What is wrong with TRIÈST-BASE?
Weaknesses:
1) -BASE uses the exact value of ∆GS
at time t to estimate ∆G(t) ;
Over time, ∆GS
may decrease, and so would the estimation,. . .
while ∆G(t ) never decreases: ≥ ∆G(t) for any t t!
Solution: never decrease the estimate, i.e., use GS only to identify new triangles;
2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in
S, and the third one is on the stream right now, we may infer that the triangle exists,
so we should count it;
Solution: first increment the counters, then decide whether to insert the edge into S;
TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance;
15 / 26
45. How does TRIÈST-IMPR work?
Memory: M = 8; Time: end of t∗ − 1;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3
16 / 26
46. How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
47. How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
48. How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
49. How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
50. How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
51. How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Action: Weighted increment of λ using the of triangles closed by (2, 4)
with weight t∗(t∗ − 1)/(M(M − 1));
Coin bias: Coin flip outcome:
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1) +2t∗(t∗−1)
M(M−1)
16 / 26
52. How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Action: Weighted increment of λ using the of triangles closed by (2, 4)
with weight t∗(t∗ − 1)/(M(M − 1));
Coin bias: M/(t∗ + 1); Coin flip outcome: head;
Actions: Do nothing;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1) +2t∗(t∗−1)
M(M−1)
16 / 26
53. How does TRIÈST-IMPR estimate the number of triangles?
TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) .
17 / 26
54. How does TRIÈST-IMPR estimate the number of triangles?
TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) .
Corollary
The probability that a triangle of G(t) is “seen” and causes an increment in λ at time t
when the third edge of the triangle is on the stream is:
ρt =
t − 2
M − 2
t − 1
M
=
M(M − 1)
(t − 2)(t − 1)
.
Since ρt πt, TRI`-EST-IMPR’s estimations have lower variance than
TRI`-EST-BASE’s.
17 / 26
55. Where are the theorems?
The order of the updates on the streams affects the probability of “seeing” a triangle;
This further complicates the analysis of the variance:
Theorem (Upper bound to the variance)
Then, for any time t M, we have
Var τ(t)
≤ |∆(t)
| max 1,
(t − 1)(t − 2)
(M(M − 1))
− 1 + z(t) t − 1 − M
M
.
We proceed case-by-case: not-intuitive, tedious, pessimistic, inelegant, and loose;
18 / 26
56. What about fully-dynamic edge streams?
Handling deletions is hard;
TRIÈST-FD’s approach is inspired by random pairing (Gemulla et al., 2009).
TRIÈST-FD tracks all deletions, and update S by removing deleted edges;
This is not enough;
The resulting S is no longer a uniform sample of the non-deleted edges in G(t);
TRIÈST-FD keeps track of the max. number of edges at any time t;
This allows to compute the bias of the current S due to unpaired deletions.
TRIÈST-FD weights ∆S by the bias, to obtain the estimate for ∆G(t) ;
19 / 26
57. Where are the experiments?
Implementation: C++. Available from http://bit.ly/triestkdd
Graphs: Last.fm, Patent-Cit, Patent-Coaut, Twitter, Yahoo!, and others
Goals: evaluate variance, runtime, scalability.
Environment: Brown CS computing cluster (single core, max 4GB RAM)
20 / 26
58. How does TRIÈST-IMPR perform?
Yahoo! graph with 1.2 billion edges (computing exact ∆G is infeasible);
Space M = 1 million ( 0.1% of the graph);
0
1x10
10
2x10
10
3x1010
4x10
10
5x10
10
6x10
10
7x10
10
8x10
10
0
2x10
8
4x10
8
6x10
8
8x10
8
1x10
9
1.2x10
9
Globaltrianglecount
Time t
max est.
min est.
avg est.
Takeaway: The unbiased estimates are highly concentrated around the mean.
21 / 26
59. How does TRIÈST-IMPR perform compared to other methods?
Last.fm graph (40 million edges, 1 billion triangles);
Space M = 100K (0.25% of the graph);
Compared with MASCOT (KDD’15), which uses edge sampling with fixed probability;
0
2x10
8
4x10
8
6x10
8
8x10
8
1x10
9
1.2x109
1.4x109
0
5x10
6
1x10
7
1.5x10
7
2x10
7
2.5x10
7
3x10
7
3.5x10
7
Globaltrianglecount
Time t
ground truth
max est. TRIEST-IMPR
min est. TRIEST-IMPR
max est. MASCOT-I
min est. MASCOT-I
0
2x10
7
4x107
6x10
7
8x10
7
1x10
8
1.2x108
0
5x10
6
1x10
7
1.5x10
7
2x10
7
2.5x10
7
3x10
7
3.5x10
7
Std.dev.oftheestimation
Time t
std dev TRIEST-IMPR
std dev MASCOT-I
Takeaway: TRIÈST has much more accurate estimations with lower variance.
22 / 26
60. How does TRIÈST-FD perform?
0
200000
400000
600000
800000
1x10
6
1.2x10
6
1.4x10
6
1.6x10
6
0
5x10
6
1x10
7
1.5x10
7
2x10
7
2.5x10
7
3x10
7
Globaltrianglecount
Time t
ground truth
avg est.+std dev
avg est.-std dev
avg est.
(c) Patent (Cit.)
0
2x10
7
4x10
7
6x107
8x10
7
1x10
8
1.2x10
8
0
1x10
7
2x10
7
3x10
7
4x10
7
5x10
7
6x10
7
7x10
7
8x10
7
Globaltrianglecount
Time t
ground truth
avg est.+std dev
avg est.-std dev
avg est.
(d) LastFm
-5x109
0
5x109
1x1010
1.5x1010
2x1010
2.5x10
10
0
5x10
8
1x10
9
1.5x10
9
2x10
9
2.5x10
9
Globaltrianglecount
Time t
avg est.+std dev
avg est.-std dev
avg est.
(e) Yahoo! Answers
Takeaway:
1) The estimations are very accurate;
2) TRIÉST allows to study the evolution of triangles at a level not available before;
E.g., it is possible to detect patterns and anomalies.
23 / 26
61. How scalable is TRIÈST-FD?
We measured the average time to handle an update on the stream;
1
10
100
1000
10000
patent-cit
patent-coaut
lastfm
yahoo
Avg.microsecsperupdate
M=200000
M=500000
M=1000000
Takeaway: between 2 µs/edge and 3 ms/edge;
(i.e., between 500k edges/sec. and 300 edges/sec.) 24 / 26
62. What didn’t I tell you?
The Goods:
Concentration results (the one for TRIÈST-BASE is very elegant;)
Theorems for TRIÈST-FD;
TRIÈST for multigraphs (various defs. of triangle counts);
Many more experiments and comparisons with state-of-the-art;
The Bads:
Results on variance are upper bounds, often loose;
Some of the concentration bounds are quite naïve (Chebyshev Ineq.);
The bounds should not depend on the order of the edges on the stream;
The Betters:
We are exploring the use of cube sampling and balanced sampling to solve the issues.
25 / 26
63. What did I talk about?
TRIÈST: three algorithms for triangle counts estimation in fully-dynamic edge streams;
• Uses a fixed, constant amount of memory;
• Is intrinsically incremental;
• Scales to billion edges graphs and handles tens of thousands of; edges per second;
• Uses reservoir sampling in a smart way;
• Gives unbiased, low-variance, highly-concentrated estimates;
Complex analysis due to non-independent sampling, but worth the effort!
Thank you!
EML: matteo@twosigma.com TWTR: @teorionda
WWW: http://matteo.rionda.to
26 / 26
64. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.