https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
2. GRADES ’22, June 12, 2022, Philadelphia, PA McCrabb, et al.
Compute
Share
Update
vn
Create
vertex
data
Share
with
each vn
vn
vn
Change
graph
topology
Figure 2: Dynamic graph algorithms’ structure: the compute
stage creates per-vertex values for some (or all) vertices. The
share stage transfers these values to neighbors. The update
stage implements all topology updates that have been queued
since the previous iteration.
Graph repositories, like SNAP [21] and Network Repository [31],
provide several graph datasets, but many important types of graphs
used in real-world applications are missing from these repositories.
For example, in real-world scenarios, static graphs may include
billions to trillions of edges [32] [7]. Additionally, many real graphs
are power-law graphs; that is, a small set of vertices have a high
degree (i.e., they have many neighbors) and the degree distribu-
tion plot approximates Power Law [26]. The available datasets are
much smaller and do not cover the rich variety of realistic Power
Law properties. Thus, to evaluate their work on such large graphs,
researchers resort to synthetic graph generators. These tools cre-
ate artificial graph datasets with certain pre-set properties, such
as number of vertices, degree distribution (where degree is the
number of edges incident to the vertex), clustering coefficients, etc.
Synthetic generators thus bridge a crucial gap between what is
publicly available and what is needed in the research community.
This gap is even larger for dynamic graphs. A wider range of dy-
namic graphs are needed to evaluate new research projects because
dynamic graphs are defined by a richer set of properties across
different applications, such as the frequency of graph updates and
changes in degree distribution over time. At the same time, even
fewer dynamic graph datasets are available in public repositories for
multiple reasons. First, collecting dynamic graph data can be pro-
hibitively time-consuming and expensive. Second, existing dynamic
graphs, from sources like social media, often contain inherently
identifiable data, limiting researchers’ ability to share them publicly.
Third, the methods and information that companies use to collect
their dynamic graph datasets may be protected by industry secrets.
Moreover, the offering of synthetic dynamic graph generators
is extremely limited. The few available [2] [10] [29] [30] [35] are
built for specific purposes and are ill-suited to evaluate novel work.
This aspect leaves dynamic graph researchers with few options,
other than artificially creating dynamic graphs out of existing static
graphs, a crude, unrealistic substitute for real-world data.
To bridge this gap in dynamic graph offerings, we present Dy-
Graph, a dataset generator and benchmark suite for dynamic graph
applications. Specifically, DyGraph contributes the following:
• It provides the DyGraph Generator, a novel synthetic dataset
generator capable of both generating graphs with user-specified
uniformly random or power-law properties from scratch and mim-
icking the properties of an input dynamic graph.
• It collects many, real-world dynamic graph datasets, offers them
with a uniform representation, and makes them publicly available
in this format, for use as-is or jointly with the DyGraph Generator.
• It demonstrates how the DyGraph Generator may be used to
create graphs with properties that mimic real-world datasets and
modify a graph over time, while maintaining its original charac-
teristics. In our case study, we find that DyGraph is able to closely
match the degree distribution properties of real input datasets (3 to
5.5 times better than Power Law), and that users can control the
properties in the output graphs by applying small changes to an
automatically-generated script.
2 DYNAMIC GRAPHS
2.1 Temporal Representation
A static graph 𝐺 is a structure comprising a set of unique vertices 𝑉
and a set of unique edges 𝐸, such that 𝐺 = (𝑉, 𝐸). Graph algorithms
are most often organized as a series of iterations. Each iteration
includes a “compute” stage and a “share” stage. In the compute
stage, the same instructions are executed independently for each
vertex in 𝑉 , creating some result value. In the share stage, result
values are shared with some or all of the vertex’s neighbors across
all edges in 𝐸. [16], [24], and [34] discuss this and similar paradigms
further. This type of algorithm design only requires that the graph’s
topology is not modified within an iteration, thus it can be applied
to dynamic graphs as well, as long as this condition holds.
Indeed, most dynamic graph algorithms leverage this same al-
gorithmic structure with a small modification: they add an “up-
date” stage after each iteration, during which they apply topology
changes (e.g., adding an edge, removing a vertex, etc.) in batch. Fig-
ure 2 shows the complete algorithm organization. It is therefore
most appropriate to represent dynamic graphs as a sequence of 𝑇
static graphs 𝐺0..𝑇−1 with the set of changes between each 𝐺𝑡 and
𝐺𝑡+1. Note this discrete representation holds valid for both stream-
ing (future graphs states provided in real time) and non-streaming
(future states available from the start) applications.
Algorithms computing on dynamic graphs are often incremental;
that is, they use the solution from 𝐺𝑡 as the starting point for 𝐺𝑡+1
[4]. For example, if a shortest path passes through an edge 𝐴 → 𝐵
in 𝐺𝑡 which is removed to create 𝐺𝑡+1, a dynamic algorithm may
start by searching for a short path from 𝐴 to 𝐵 and maintain the rest
of the overall solution. Similarly, a dynamic PageRank algorithm
may use the values found for 𝐺𝑡 as an approximate solution to
𝐺𝑡+1. This informs two key differences between dynamic graphs
and a series of independent static graphs: (1) unchanged vertices
and edges must persist across timesteps with the same vertexID,
stable and unique across all timesteps and (2) we must be able to
accurately describe the topological difference between two adjacent
timesteps with a managable number of changes. For these reasons,
it is not viable to build a dynamic graph using multiple iterations
of an existing static graph generator, as each “timestep” would be
unrealistically different from the previous.
2.2 Intermediate Static Format
Compressed Sparse Row (CSR) is the most common format for
storing static graphs, as CSR representations are easier to read, offer
better space efficiency, and provide more regular memory access
opportunities than adjacency matrices or lists [33]. CSR contains
3. DyGraph: A Dynamic Graph Generator and Benchmark Suite GRADES ’22, June 12, 2022, Philadelphia, PA
Table 1: Key DyGraph Generator Commands
Command Description
add [𝑣] vertices Create 𝑣 disconnected vertices
add [𝑒] random edges Create 𝑒 edges, connecting two uniformly random vertices
add edge power law [𝑆𝑑] [𝐾𝑠] [𝑚𝑎𝑥𝐷𝑒𝑔] [𝑏𝑖𝑛𝑠] Add edges via Power Law to existing graph
remove [𝑣] vertices Delete 𝑣 random vertices and all connecting edges
remove [𝑒] random edges Delete 𝑒 random edges from the graph
commit Save state as 𝐺𝑡 . Begin changes for 𝐺𝑡+1
build [𝑆𝑑][𝐾𝑠][𝑚𝑎𝑥𝐷𝑒𝑔][𝑏𝑖𝑛𝑠] Add edges via Power Law from scratch (𝐺0 only)
bin [𝑏𝑖𝑛𝐼𝐷] [𝑏𝑖𝑛𝑆𝑖𝑧𝑒] [𝑙𝑜𝑐𝑎𝑙𝑀𝑎𝑥] Define bin parameters (after “add edge power law”/“build”)
two arrays: an edge list and a vertex list. The vertex list maps each
vertex to the starting index of its list of neighbors in the edge list,
one index per vertex. The edge list has one entry for each edge,
grouped by source vertex, where each entry holds a destination
vertex. For example, if vertices 3 and 4 had vertex list entries of 7
and 12, vertex 3 has five neighbors whose IDs are in slots 7-11 of the
edge list. While CSR is popular for its minimal storage footprint, it
is inefficient for representing dynamic graphs: changes to the graph
topology would require a complete reconstruction of the graph
representation. Moreover, pinpointing the differences between two
graphs in this format entails the complete construction of both
graphs. For these reasons, many research works using dynamic
graphs avoid CSR formats.
Specifically, these works and DyGraph uses a regular edge list,
offering a more effective representation for dynamic graphs. Each
edge is represented by two values: a source and a destination vertex
ID. This format requires more storage space (2|𝐸| instead of |𝑉 |+|𝐸|),
but it is easier to compute the difference between two graphs, or
two time-based snapshots of a same graph, 𝐺𝑡 and 𝐺𝑡+1, and derive
a change log between them.
3 DYGRAPH
In this section, we present the DyGraph Generator, and then sum-
marize the datasets adapted for benchmark distribution. Both can
be found at adacenter.org/dygraph.
3.1 DyGraph Generator
The DyGraph Generator is designed for three use cases. First, users
can create dynamic graphs from scratch. Second, they can automat-
ically generate datasets with properties that mimic those of other
input dynamic graphs via an automatically-generated intermediate
script. This feature allows users to create and share datasets with
the same profile as other graphs that they may be unable to share.
Third, users can modify this script to create augmented versions
of an input dynamic graph dataset, such as increasing the number
of vertices or edges. The DyGraph-generated scripts assume that
the input graph follows the Power Law, but commands are also
available to add and remove individual vertices and edges. This
latter functionality gives users the flexibility to create graphs with
any degree distribution, where degree is the number of neighbors
that a vertex is connected to. Note that the use of scripting allows
researchers to both share key properties of the datasets used in their
evaluations, and also give other researchers the means to create
their own similarly-profiled graphs, all without needing to publicly
distribute the original datasets.
3.1.1 DyGraph Generator commands. Users create dynamic graphs
from scratch by providing a list of commands (via console or script).
Commands to modify the graph either add vertices, add edges,
remove vertices, remove edges, or commit graph changes to the
current timestep. When a vertex is removed, all its edges are also re-
moved. Table 1 lists the primary DyGraph commands and their com-
mand parameters. DyGraph also provides the opportunity to add
or remove specific vertices by vertexID, and to add or remove spe-
cific edges by pairs of vertexIDs, enabling finer-grain control when
needed. Note that vertices have unique identifiers even when re-
moved from the graph, and that both new and previously-removed
vertices may be re-added using these commands, specified by ver-
texID. These features are detailed in the DyGraph Generator’s user
manual with the suite. Graph modifications to be applied to different
timesteps are separated by commit commands.
Note that when vertices are added, they have no neighbors. Ver-
tices with no neighbors are omitted from output files when a commit
occurs. There are three ways to add edges to vertices: (1) single-item,
(2) with uniform random distribution, and (3) with a power-law
distribution. To add a single edge, users specify the two vertices
to connect. To add randomly distributed edges, users specify the
number of edges to add: DyGraph adds that many edges, selecting
which existing vertices to connect with a uniform probability for
each edge. Finally, to add edges via Power Law, users invoke the
“add edge power law” command, as described below.
3.1.2 Power-law dynamic graph generation. To create a power-law
graph, DyGraph first determines the desired degree distribution,
then modifies the number of edges in the existing set. Note that
commands to create power-law graphs are commands to add edges;
these commands do not add any vertices. Indeed, vertices must be
created before adding any edge. New edges modify the degree of
each vertex so as to fit into a power-law distribution. The proper
degree distribution is calculated by leveraging a combination of (i)
an exponential decay (power-law) function and (ii) a set of bins.
The power-law function can be represented as:
|𝑉 (𝑑𝑒𝑔)| = 𝐾𝑠 · 𝑑𝑒𝑔𝑆𝑑 (1)
where 𝑑𝑒𝑔 is the degree and |𝑉 (𝑑𝑒𝑔)| is the number of vertices with
degree 𝑑𝑒𝑔. 𝐾𝑠 is a scaling factor for the number of vertices in each
degree, and 𝑆𝑑 (<0) controls the slope of |𝑉 (𝑑𝑒𝑔)|’s exponential
decay as 𝑑𝑒𝑔 increases. The DyGraph Generator determines how
many edges to add and which pairs of vertices to connect with
those edges, so to attain the power-law properties specified by 𝐾𝑠
and 𝑆𝑑.
4. GRADES ’22, June 12, 2022, Philadelphia, PA McCrabb, et al.
Bin 0
Deg 14-23
10
#
Vertices
100
1
10
1k
10k
100k
Degree 20 30 40
Bin 1
Deg 24-33
Etc…
Threshold
LocalMax
BinWidth
BinSize
binning starts
Power
Law
Figure 3: Schematic of the DyGraph Generator’s binning
process to approximate real-world power-law graphs. Degree
distribution is set by Power Law for low-degree vertices and
bin parameters for high-degree vertices.
3.1.3 Limitation of basic power-law distribution. We found that
many real-world power-law graphs do not fit precisely in a power-
law distribution, demonstrated in Section 4.2. This is expected,
as Power Law provides an efficient mechanism to approximate
many graphs arising from real-world situations, but it is still just
an approximation. The most impactful difference between an ideal
power-law graph and those we have observed in our real-world
datasets lies in the frequency of the high-degree vertices, that is,
those few vertices that have a high number of incident edges. Indeed,
many real graphs have more such vertices than predicted by a
basic power-law distribution. Figure 3 provides a schematic of this
divergence for a graph representing how the power-law distribution
fitted for this graph effectively approximates low-degree vertices
on the left part of the figure, but fails to model the high-degree
vertices on the right.
In striving to design a high-accuracy generator, we split each
degree in the distribution into one of two sections: i) Power Law
for low-degree vertices and ii) a process leveraging vertex-binning
for high-degree vertices, as discussed below. A user-defined thresh-
old determines which of the two approaches DyGraph will use in
modeling the connectivity of each vertex: the threshold separates
high and low degrees:
𝑚𝑖𝑛𝐷𝑒𝑔 = 𝑚𝑖𝑛(𝑑𝑒𝑔) 𝑠.𝑡. (|𝑉 (𝑑𝑒𝑔)| < 𝐻) (2)
where 𝐻 is the threshold (see Figure 3). Vertices with degrees lower
than 𝑚𝑖𝑛𝐷𝑒𝑔 are connected based on Power Law; vertices with
higher degrees follow the binning process described below in Sec-
tion 3.1.4. We found that the exact value of 𝐻 has little effect on the
quality of the real-world graph approximation, because real-world
graphs diverge slowly from the power-law function as vertices’
degrees increase. We empirically found that several graphs deviate
from Power Law at 𝐻 ≈ 10, so we set 𝐻 = 10.
3.1.4 Binning for power-law graphs. In order to model the vertices’
distribution in the segment where 𝑑𝑒𝑔 > 𝑚𝑖𝑛𝐷𝑒𝑔, which is the long
tail of the power-law distribution, we chose to approximate the tail
as a series of boxes, each of width binWidth and of height localMax.
We call these boxes “bins.” Below we describe the key traits we
capture for each bin.
Table 2: DyGraph Power-law Command Parameters
Param. Description
𝐾𝑠 Power-law scaling factor
𝑆𝑑 Power-law decay factor (< 0)
maxDeg Maximum degree among all vertices
bins # bins for power-law add commands
binID Bin ID number (0..𝑏𝑖𝑛𝑠 − 1)
binWidth # degrees in the bin
binSize # degrees in the bin with > 0 vertices
localMax Max # vertices with the same degree
Bins represent degree intervals; that is, each vertex whose degree
is within a certain range is assigned to the same bin. The “add edge
power law” command allows users to specify the number of bins,
while the DyGraph Generator labels each bin with its own identifier
binID. The first bin begins at degree 𝑚𝑖𝑛𝐷𝑒𝑔. The last bin contains
maxDeg, that is the highest degree of any vertex in the graph.
Each bin is specified by three key parameters: binWidth, localMax,
and binSize. BinWidth is the width of the interval in degrees
assigned to the bin (i.e., the degree capacity of a bin). The DyGraph
Generator creates bins that have all the same binWidth. LocalMax
is the maximum number of vertices with a same degree within the
bin. BinSize is the number of degree values within a bin that have
at least one vertex at that degree, that is, |𝑉 (𝑑𝑒𝑔)| > 0. Each bin has
a specific localMax and binSize. Table 2 summarizes all parameters
described.
The binWidth parameter determines the granularity we use in
modeling the distribution. We use localMax to capture how vertices
in each bin become more sparse as degree increases. In other words,
monitoring localMax helps us avoid the generation of spikes in the
distribution. We track the BinSize parameter because the far end
of the tail is often rarified; that is, there are very few vertices, and
many degrees have no vertex associated with them. BinSize and
localMax help us capture and reproduce that sparsity.
Consider the example in Figure 4. The first degree with fewer
than 10 vertices is degree 50, so minDeg is 50. BinWidth has been
set to a 100 and bin 0 thus spans degrees 50 to 149. Bin 0 has
a binSize of 97 and a localMax of 18. If the DyGraph Generator
were to add edges so to preserve the currently captured traits of
the graph distribution, it should add edges only to vertices whose
degree matches that of the 97 unique degrees, between 50 and
149, that already have at least one vertex. In addition, it would not
modify the localMax, thus, each degree would have no more than
18 vertices mapped to it. A similar analysis would take place for
bin 1, where the DyGraph Generator would only add edges so that
vertices only fall into one of the 58 degrees, between 150-249, that
already had vertices mapped to it. The DyGraph Generator would
also maintain the localMax for vertices in this bin as 5.
3.1.5 Adding edges using binning. The above power-law and bin-
ning structure provide a complete map of how new edges should
be distributed over the entire graph (set of vertices). Once this
map is computed, the DyGraph Generator adds edges such that
the resulting graph’s degree distribution aligns with the new dis-
tribution. Note that the user controls the degree distribution of
the final graph, not the specific number of edges to add. To this
5. DyGraph: A Dynamic Graph Generator and Benchmark Suite GRADES ’22, June 12, 2022, Philadelphia, PA
Table 3: Dynamic Graph Datasets
Name Description |V| Com. |E| Max |E| Timesteps Time Span
AdTraffic Online advertisement interactions 6M 410M 5.9M 135 30 days
Bitcoin User-to-User bitcoin trust network 5k 3.7M 36k 147 N/A
Email Internal research institution emails 1k 7.5M 167k 27 2.2 yrs
Forum Private college forum interactions 899 33.7k 848 103 N/A
Higgs Higgs-Boson Twitter Interactions 456k 34.3M 563k 125 30 days
Hospital Patient-to-staff proximity 75 2.2M 32k 141 96 hrs
Movies User-movie streaming network 138k 1.48B 16.4M 193 20 yrs
Music User-music streaming network 92k 12.7M 186k 184 N/A
Ubuntu AskUbuntu user-to-user interactions 159k 70.5M 964k 178 7.3 yrs
100
#
Vertices
Degree 200 300 400
1k
10k
Bin 0 Bin 1 Bin 2 Bin 3
100k
1M
50 150 250
16
12
8
4
Bin 0
LocalMax
Bin 1
97/100 58/100
Figure 4: Degree distribution for the Ubuntu dataset at T=50,
showing binning parameter values in bins 0 and 1.
end, DyGraph first computes the difference between the number of
vertices in each bin, and the final number of vertices that should
be in each bin. Then DyGraph determines which vertices should
be moved to each bin (unless they are already in the correct bin).
Finally, each vertex in each bin is assigned a final target degree,
and it should be connected to the additional number of edges as
required to reach its target degree. To track all the vertices that
should be connected to additional edges, we create an edge-addition
set, where each vertex is present as many times as the number of ad-
ditional incident edges it must receive. For example, if a vertex has
100 neighbors, and the new degree distribution dictates it should
have 115 neighbors, the vertex is added to the edge-addition set
15 times. Once the edge-addition set is built, edges are created by
connecting two vertices randomly from this set, removing both
from the set when connected. Self-loops (edges connecting a vertex
to itself) and duplicate edges (edges that already exist) are rejected.
These two aspects are easily modifiable by users.
The DyGraph Generator allows users to add edges with power-
law characteristics through two commands: “build” and “add edge
power law.” “Build” is a special case of “add edge power law”, to be
used for an initial graph with no existing edges. In that case, the
edge-addition set is created solely from a completely new degree
distribution (since the existing degree distribution is empty).
3.1.6 Removing edges. There are two ways to remove edges with
the DyGraph Generator: by removing a single edge, or by removing
multiple edges in a uniformly random fashion. Note that, when
removing edges, DyGraph can only choose to remove edges that
are already present in the graph. If the existing set of edges are
uniformly random among all vertices, then the multiple edge re-
moval is also uniformly random. However, if the existing set of
edges was distributed in accordance with Power Law, removing
edges randomly is thus also naturally abiding the Power Law.
3.1.7 Automated script generation. In addition to enabling users to
write their own DyGraph scripts, DyGraph can also automatically
generate scripts that mimic an input dynamic graph: it analyzes the
graph to extract the parameters described (vertices to add, power-
law parameters, bin parameters, etc.), using a pre-set binWidth. This
process can be applied to all timestamps, so that the script generates
a dynamic graph with the same degree distribution trends. This
script can then be used as is, or modified before generating the
synthetic graph.
3.2 Datasets
We provide a collection of real-world dynamic graph datasets for
two reasons. First, researchers may use existing datasets as a fast,
valuable first step to measure initial results. Second, users may ana-
lyze these existing datasets to find and adjust values for DyGraph’s
power-law command parameters.
We list the datasets included with DyGraph in Table 3, along with
several of their key characteristics. Though the original datasets
are publically available in a variety of formats, we have converted
them into a consistent format: a series of separate edge-list files,
one for each timestep, as described in Section 2. With reference
to Table 3, |𝑉 | is the number of unique vertices in the dataset.
Combined Edges (𝐶𝑜𝑚.|𝐸|) reports the sum of all edges across all
timesteps, counting each occurrence of edges that are listed in
multiple discrete timesteps. Max Edges (𝑀𝑎𝑥|𝐸|) reports the peak
number of edges in a timestep, thus tracking how large the dynamic
graph becomes throughout its evolution. Timesteps reports the
number of individual graph states included in the dynamic graph;
it also corresponds to the number of separate files. Finally, Time
Span is the duration of time represented by the dataset, as reported
by its original source.
Our dataset repository includes 9 datasets. The AdTraffic dataset
[9] is a sample of live online advertisement traffic from Criteo, a
computational advertising company. The Bitcoin dataset [18] [17]
represents a trust network of users of Bitcoin OTC, an over-the-
counter bitcoin trading marketplace. We include three social media
datasets: Email, Forum, and Higgs. The Email dataset [20] [36]
represents internal email transmissions among members of a Eu-
ropean research institute, thus a corporate communication setting.
The Forum dataset [27] is the network of both group and direct
messages among members of a private forum for college students.
The Higgs dataset [8] contains the replies, retweets, and mentions
6. GRADES ’22, June 12, 2022, Philadelphia, PA McCrabb, et al.
1
100
10k
1
1k
1M
1,000
100
10 100 10k 100 10k
T=0 T=50 T=100
#
Vertices
Degree
Original Dataset Made by DyGraph
Power Law Function
1
1k
1M
Figure 5: Log-log plots of degree distributions for the original input AdTraffic dataset (blue) and the synthetic dataset generated
by the DyGraph Generator (green) at three time steps: 0, 50, and 100. The plots are overlaid with the power-law function derived
from the dataset by DyGraph (pink).
#
Vertices
1k
1M
1
1k
1M
1
1 100 10k 1 100 10k
5 10 15
Degree Degree Degree
T=0
T=10
T=100
T=0
T=10
T=100
T=0
T=10
T=100
0
KS Sd
103.3 -2.3
105.7 -6.1
106.8 -6.0
v=10SidSd
Figure 6: Log-log plots of (left) the degree distributions over time for the original AdTraffic dataset, (center) the power-law
functions and their parameters derived from the original dataset by DyGraph, and (right) the synthetic dataset generated by
DyGraph to match the original.
of the Higgs-Boson particle discovery on Twitter. The Hospital
dataset [31] maps which patients and staff had close contact in a
hospital. We also include two datasets for recommendation system
applications. The Music dataset [6] is the network of 2,000 Last.fm
users and the music they played. The Movies dataset [14] is the
set of 5-star ratings and text reviews of randomly selected users of
MovieLens with >20 reviews. Finally, the Ubuntu dataset [28] is
the set of user interactions in the AskUbuntu online forum.
4 EXPERIMENTAL EVALUATION
In this section, we evaluate how the DyGraph Generator can build
graph datasets that closely resemble real-world graphs. We divide
our analysis into three sections: defining the evaluation metric,
showing how DyGraph can create graphs that mimic real-world
graphs, and demonstrating how users can edit these DyGraph
scripts to customize their graphs’ properties to their own needs.
4.1 Evaluation Metric
To evaluate whether DyGraph creates graphs with similar prop-
erties to an input graph, we must choose a metric to compare the
two graphs. Common metrics include diameter, clustering coef-
ficient, triangle counting, and degree distribution. For this work,
metrics must be applicable to all possible real-world graphs, allow
us to compare the size of two graphs, and demonstrate the dataset’s
sparsity, an essential factor in evaluating performance. Diameter
requires that there exist paths from all vertices to all other vertices
and ignores any disconnected components. Many real-world graphs
do not have this property. Triangle counting is heavily affected by
the size of the graph. Clustering coefficient can be measured for all
graphs, but it is unclear whether differences in power-law graph
sizes affect the clustering coefficient. We use degree distribution,
as the only common metric to fully meet these conditions.
4.2 Mimicking Existing Graphs
We evaluate whether the DyGraph Generator builds synthetic
graphs with degree distributions like those of other input, real-
world graphs. We use one of the largest of our dynamic graphs:
AdTraffic. As the graph changes over time, the degree distribution
also changes, so we evaluate the distribution at multiple timesteps.
Figure 5 shows degree distributions of AdTraffic and DyGraph’s
generated dataset at three timesteps. For degrees from 1 to 200, the
correlation coefficient (𝑟) of the DyGraph-generated dataset against
the original is 2.98x closer to ideal (𝑟 = 1) than that of the Power
Law function against the original when T=50, and 5.57x closer when
T=100. Similarly, the deviations of Power Law and DyGraph from
the original (two-sample Kolmogorov–Smirnov metrics) improve
from 0.053 to 0.023 for 𝑇 = 50 and 0.086 to 0.015 for 𝑇 = 100.
We make a few observations from these plots. First, the power-
law function aligns with the AdTraffic dataset only for low-degree
vertices (i.e., the left side of each plot). Once the power-law line falls
below ∼10 vertices, the degree distributions of both the original and
the synthetic dataset diverge from the power-law line (pink). This
trait shows, first, that the binning process successfully ensures that
the synthetic graph has a similar number of higher-degree vertices,
more than would be included by Power Law alone. Second, the
larger the graph, the more closely DyGraph’s degree distribution
aligns with the original dataset, as DyGraph has more opportu-
nities to create the precise distribution that it is targeting. Third,
while DyGraph generally creates a degree distribution shape which
matches the original far closer than the power-law function, it also
tends to create vertices with slightly lower average degree than the
Triangle density?
Modularity?
7. DyGraph: A Dynamic Graph Generator and Benchmark Suite GRADES ’22, June 12, 2022, Philadelphia, PA
1k
100
10
T=30
#
Vertices
Degree
Original
10k
1M
DyGraph DyG (x2)
10k
100
1M
2M 4M
2M
# Vertices # Edges
10 20 30 40 50 10 20 30 40 50
T =
(a)
(b) (c)
Figure 7: (a) Log-log plot of degree distribution, (b) vertex
count across time, and (c) edge count across time for the
original AdTraffic dataset, a matching synthetic DyGraph-
generated graph, and a synthetic graph generated by dou-
bling the original graph size in the script.
original. This deviation is caused by two constraints in DyGraph:
self-loops and duplicate edges are eliminated, and vertices with no
neighbors are omitted. As DyGraph is an open-source suite, users
may remove these constraints if desired.
Figure 6 illustrates how the AdTraffic dynamic graph evolves
over time and how the DyGraph-generated graph matches this
evolution. Figure 6 also highlights how the power-law properties
change over time. Initially (T=0 to T=10), more vertices are added
and connected to few neighbors, but few existing vertices are con-
nected to many neighbors. This leads to a higher scaling factor (𝐾𝑠
rises from 103.3 to 105.7) but a sharper relative decline as degree
increases (𝑆𝑑 falls from -2.3 to -6.1). After the initial growth (T=10
to T=100), 𝐾𝑠 continues to increase and 𝑆𝑑 stays steady around
-6.0. This reflects a trend in which new vertices with few neighbors
continue to join the graph, but many more of the existing vertices
connect to additional neighbors already in the graph.
4.3 Customizing Graph Properties
As described in Section 3.1, we designed DyGraph to be capable
of automatically generating a script which can, in turn, be used
to generate a synthetic graph with the same characteristics as the
original. We leveraged this approach to provide users the flexibility
to modify existing graph properties, and create variants from the
original real-world graphs provided with the benchmark suite. To
demonstrate this feature, we take the AdTraffic dataset, have Dy-
Graph produce the generating script, then modify such script to
double both vertices and edges.
We edited the script as follows. First, for each command adding
𝑛 vertices, we changed 𝑛 to 2𝑛. Second, for each command adding
edges to follow power-law properties, we increased both sections
of the degree distribution plot. For the power-law section, we in-
creased 𝐾𝑠 so to double the number of vertices at each degree. Fi-
nally, for the binning section of the degree distribution, we attained
our goal by doubling each localMax.
2k
4k
Seconds
T= 0 10 20 30 40
Original
AdTraffic
2x Vertices
2x Edges
3k
1k
0.57M
1.66M
2.31M
2.72M
3.25M
#Edges
Figure 8: DyGraph Generator execution time for generating
the synthetic graphs of Figure 7.
Figure 7 plots the degree distribution of the original AdTraffic
dataset, overlaid with that of the matching synthetic DyGraph-
generated graph, and also with that of the “doubled” synthetic graph,
obtained with the modified script. As the figure presents the graphs
in a log-plot, the doubled synthetic graph closely overlaps with
the original synthetic graph, indicating extremely similar degree
distributions. Figure 7 also reports how the number of vertices
and edges change over time. Note how the original graph and
the synthetic DyGraph edge-count align closely. In addition, the
“doubled” synthetic graph reports approximately double vertices
and edges. Note also how the baseline synthetic graph (green)
consistently includes slightly fewer vertices than the real-world
graph, because of the constraints described in Section 4.2.
4.4 DyGraph Generator Performance
The DyGraph Generator is an open-source software, written in C++.
Figure 8 reports execution times for building the synthetic graphs
identified in Figure 7, timed on a machine using Ubuntu 20.04, an
Intel i7-7700, 32GB of memory, and a 2TB HDD. We observe that
increasing either dimension of a graph (vertices or edges) extends
the execution time. We further observe that increasing |𝐸| takes
more time than a similar increase in |𝑉 |. Three factors explain this.
First, adding vertices alone requires almost no additional compu-
tation, as new vertices start with no neighbors. Runtime increases
only because a larger set of vertices increases the size of the distri-
bution when adding edges. Second, adding edges with power law
properties dominates execution time, as all other graph updates re-
quire trivial amounts of compute beyond I/O. Finally, doubling the
vertex count without increasing edge count result in more vertices
without edges, which are omitted from the output graph. DyGraph
is currently single-threaded, and we hope to parallelize DyGraph
as part of future work.
5 RELATED WORK
Prior works have identified that there are few dynamic graph gen-
erators, emphasizing that better such tools are needed [32] [13] [5]
[35]. Some static graph generators can be used to build dynamic
graphs. Kronecker graph generators [19], for example, use Kro-
necker multiplication to iteratively build increasingly-large graphs
with power-law degree distribution. Similarly, Barabási-Albert (BA)
models [3] iteratively attach new vertices to pre-existing vertices
in the graph using preferential attachment. Sets of new Kronecker
multiplications or preferential attachments may be used as new
graph states, collectively forming a dynamic graph. However, such
8. GRADES ’22, June 12, 2022, Philadelphia, PA McCrabb, et al.
generation methods enforce two undesirable restrictions: graph
sizes increase monotonically (i.e., vertices and edges are not re-
moved) and the density increases as the graph size increases.
Görke [10] proposed a model to generate uniformly random
graphs to evaluate algorithms for dynamic graph clustering. These
graphs follow an evolving ground-truth for clustering, which can
be compared against the clusters discovered by the algorithm. Un-
like prior work, vertices and edges in this model may be added or
removed over time. However, the model can only generate graphs
with uniformly-random degree distribution and is thus unable to
mimic many real-world applications.
Purohit’s more flexible model [30] uses atomic, temporal graph
motifs (i.e., sub-graph patterns of ≤ 3 vertices). This model is capa-
ble of mimicking many key properties of existing dynamic graphs,
including the original dataset’s growth rate, structure, and degree
distribution. However, vertices and edges cannot be removed, again
limiting its ability to emulate key applications [23].
Waudby [35] extended the LDBC Social Network Benchmark
[2] to include insertions and deletions for dynamic graphs using
lifespan. However, there is no functionality to mimic properties of
existing graphs. This limitation restricts users’ ability to expand
their own graphs which have specific desired properties, but are
too small to be used for effectively evaluating their solutions.
6 CONCLUSION
Dynamic graph processing is quickly becoming a critical area of
data mining and analytics. As researchers develop new algorithms,
optimizations, and hardware solutions for dynamic graphs, there
is an urgent need for more robust infrastructure to evaluate these
solutions. In this work, we present DyGraph, a solution for dynamic
graph workloads that includes both a wide range of real-world
dynamic graph datasets and a novel, flexible DyGraph Generator
for synthetic dynamic graph datasets. We demonstrate how the
synthetic graphs created by the DyGraph Generator closely mimic
the properties of real-world dynamic graphs, attaining 3 to 5.5 times
more accurate graph datasets than Power Law. Further, we illustrate
that the DyGraph Generator can be leveraged to automatically
produce a script describing a real-world graph by analyzing it: such
scripts can be later modified to generate new synthetic graphs of
any size and any power-law characteristics, allowing users to create
variants of real-world datasets to fit their needs and evaluations.
ACKNOWLEDGMENTS
This work was supported by the Applications Driving Architectures
(ADA) Research Center, a JUMP Center co-sponsored by SRC and
DARPA.
REFERENCES
[1] Abraham Addisie and Valeria Bertacco. 2020. Centaur: Hybrid Processing in
On/Off-chip Memory Architecture for Graph Analytics. In Proc. DAC.
[2] Renzo Angles, János Benjamin Antal, Alex Averbuch, Peter A. Boncz, Orri Erling,
Andrey Gubichev, Vlad Haprian, Moritz Kaufmann, Josep Lluís Larriba-Pey, Nor-
bert Martínez-Bazan, József Marton, Marcus Paradies, Minh-Duc Pham, Arnau
Prat-Pérez, Mirko Spasic, Benjamin A. Steer, Gábor Szárnyas, and Jack Waudby.
2020. The LDBC Social Network Benchmark. In arXiv CoRR.
[3] Albert-László Barabási and Réka Albert. 1999. Emergence of scaling in random
networks. In Science.
[4] Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoe-
fler. 2021. Practice of streaming processing of dynamic graphs: Concepts, models,
and systems. In Proc. TPDS.
[5] Angela Bonifati, Irena Holubová, Arnau Prat-Pérez, and Sherif Sakr. 2020. Graph
Generators: State of the Art and Open Challenges. In ACM Comput. Surv.
[6] Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. Workshop on Information
Heterogeneity and Fusion in Recommender Systems. In Proc. RecSys.
[7] Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi
Muthukrishnan. 2015. One trillion edges: Graph processing at facebook-scale. In
Proc. VLDB.
[8] Manlio De Domenico, Antonio Lima, Paul Mougel, and Mirco Musolesi. 2013.
The Anatomy of a Scientific Rumor. In Nature Sci. Rep.
[9] Eustache Diemert, Julien Meynet, Pierre Galland, and Damien Lefortier. 2017.
Attribution Modeling Increases Efficiency of Bidding in Display Advertising. In
Proc. ADKDD.
[10] Robert Görke, Roland Kluge, Andrea Schumm, Christian Staudt, and Dorothea
Wagner. 2012. An efficient generator for clustered dynamic random networks. In
Proc. MedAlg.
[11] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret
Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accel-
erator for graph analytics. In Proc. MICRO.
[12] Kathrin Hanauer, Monika Henzinger, and Christian Schulz. 2020. Fully dynamic
single-source reachability in practice: An experimental study. In Proc. ALENEX.
[13] Kathrin Hanauer, Monika Henzinger, and Christian Schulz. 2021. Recent advances
in fully dynamic graph algorithms. arXiv preprint.
[14] Maxwell Harper and Joseph Konstan. 2015. The MovieLens Datasets: History
and Context. In ACM Trans. iiS.
[15] Takanori Hayashi, Takuya Akiba, and Ken-ichi Kawarabayashi. 2016. Fully
dynamic shortest-path distance query acceleration on massive networks. In Proc.
CIKM.
[16] Vasiliki Kalavri, Vladimir Vlassov, and Seif Haridi. 2018. High-Level Programming
Abstractions for Distributed Graph Processing. In IEEE Trans. KDE.
[17] Srijan Kumar, Bryan Hooi, Disha Makhija, Mohit Kumar, Christos Faloutsos, and
V.S. Subrahmanian. 2018. REV2: Fraudulent User Prediction in Rating Platforms.
In Proc. WSDM.
[18] Srijan Kumar, Francesca Spezzano, V. S. Subrahmanian, and Christos Faloutsos.
2016. Edge Weight Prediction in Weighted Signed Networks. In Proc. ICDM.
[19] Jurij Leskovec, Deepayan Chakrabarti, Jon Kleinberg, and Christos Faloutsos.
2005. Realistic, mathematically tractable graph generation and evolution, using
kronecker multiplication. In Proc. ECML PKDD.
[20] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2007. Graph Evolution:
Densification and Shrinking Diameters. In ACM Trans. KDD.
[21] Jure Leskovec and Rok Sosič. 2016. SNAP: A General-Purpose Network Analysis
and Graph-Mining Library. In ACM Trans. IST.
[22] Zhe Lin, Fan Zhang, Xuemin Lin, Wenjie Zhang, and Zhihong Tian. 2021. Hier-
archical core maintenance on large dynamic graphs. In Proc. VLDB.
[23] László Lőrincz, Júlia Koltai, Anna Fruzsina Győr, and Károly Takács. 2019. Col-
lapse of an online social network: Burning social capital to create it?. In Jour. Soc.
Netw.
[24] Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty
Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph
processing. In Proc. SIGMOD.
[25] Andrew McCrabb, Eric Winsor, and Valeria Bertacco. 2019. DREDGE: Dynamic
repartitioning during dynamic graph execution. In Proc. DAC.
[26] Mark Newman. 2005. Power laws, Pareto distributions and Zipf’s law. In Jour.
Contemp. Phys.
[27] Tore Opsahl. 2013. Triadic closure in two-mode networks: Redefining the global
and local clustering coefficients. In Jour. Soc. Netw.
[28] Ashwin Paranjape, Austin Benson, and Jure Leskovec. 2017. Motifs in Temporal
Networks. In Proc. WSDM.
[29] Tiago Peixoto. 2020. The Netzschleuder Network Catalogue and Repository.
[30] Sumit Purohit, Lawrence Holder, and George Chin. 2018. Temporal graph gener-
ation based on a distribution of temporal motifs. In Proc. MLG.
[31] Ryan Rossi and Nesreen Ahmed. 2015. The Network Data Repository with
Interactive Graph Analytics and Visualization. In Proc. AAAI.
[32] Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and Tamer Özsu.
2017. The Ubiquity of Large Graphs and Surprising Challenges of Graph Pro-
cessing. In Proc. VLDB.
[33] Xuanhua Shi, Zhigao Zheng, Yongluan Zhou, Hai Jin, Ligang He, Bo Liu, and
Qiang-Sheng Hua. 2018. Graph processing on GPUs: A survey. In CSUR.
[34] Philip Stutz, Abraham Bernstein, and William Cohen. 2010. Signal/Collect: Graph
Algorithms for the (Semantic) Web. In Proc. ISWC.
[35] Jack Waudby, Benjamin Steer, Arnau Prat-Pérez, and Gábor Szárnyas. 2020.
Supporting Dynamic Graphs and Temporal Entity Deletions in the LDBC Social
Network Benchmark’s Data Generator. In Proc. GRADES-NDA.
[36] Hao Yin, Austin Benson, Jure Leskovec, and David Gleich. 2017. Local Higher-
Order Graph Clustering. In Proc. SIGKDD.