Graph Analysis: New Algorithm Models, New Architectures

Graph Analysis: New Algorithm Models,
New Architectures
E. Jason Riedy and a large supporting cast of students
Georgia Institute of Technology
ACM Computing Frontiers, May

Outline
Motivation and Applications
New Algorithm Model
New Architectures
Closing
New! And Graphs! — ACM CF, May /

(insert preﬁx here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating markets, smart &
sustainable cities
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
Changes are important. Cannot stop the world...

Potential Applications
• Social Networks
• Identify communities, inﬂuences, bridges, trends,
anomalies (trends before they happen)...
• Potential to help social sciences, city planning, and
others with large-scale data.
• Cyber-security
• Determine if new connections can access a device or
represent new threat in <5ms...
• Is data transfer by a virus / persistent threat?
• Bioinformatics, health
• Construct gene sequences, analyze protein
interactions, map brain interactions
• Credit fraud forensics ⇒ detection ⇒ monitoring
• Real-time integration of all the customer’s data

Streaming graph data
Network data rates:
• Gigabit ethernet: k – . M packets per second
• Over ﬂows per second on GigE (< . µs)
Person-level data rates:
• M posts per day on Twitter ( k / sec)
• M posts per minute on Facebook ( k / sec)
Should analyze only changes and not entire graph.
Throughput & latency trade off and expose different
levels of concurrency.
www.internetlivestats.com/twitter-statistics/
www.jeffbullas.com/ / / / -awesome-facebook-facts-and-statistics-you-need-to-check-out/

Streaming graph analysis
Terminology (not universal):
• Streaming changes into a massive, evolving graph
• Need to handle deletions as well as insertions
Previous STINGER performance results (x - ):
Data ingest > M upd/sec [Ediger, McColl, Poovey, Campbell, &
Bader ]
Clustering coefﬁcients > K upd/sec [R, Meyerhenke, B, E,
& Mattson ]
Connected comp. > M upd/sec [McColl, Green, & B ]
Community clustering > K upd/sec∗
[R & B ]
PageRank Up to × latency improvement [R ]

Starting incremental / streaming algorithms
• Incremental and
streaming algorithms
start somewhere.
• Initial, static
computation can take a
rather long time...
• During which the graph
cannot change?
• What about supporting
many simultaneous
analyses?
Data ingest rates, R-MAT into
R-MAT, scales &
●
●
●
●
●
●
1e+02
1e+03
1e+04
1e+05
1e+06
1 10 100 1000 10000 1e+05
Batch size
Updaterate(upd/s)
platform ● Power8 Haswell Haswell−30
What can we run while the graph changes?

Starting incremental / streaming algorithms
• Incremental and
streaming algorithms
start somewhere.
• Initial, static
computation can take a
rather long time...
• During which the graph
cannot change?
• What about supporting
many simultaneous
analyses?
Graph
Changes
PageRank
Clustering
Coefﬁcients
Clusters
s-t Path
What can we run while the graph changes?

What if we don’t hold up changes?
When is an algorithm valid?
Analyze concurrently with the graph changes, and
produce a result correct for the starting graph and
some implicit subset of concurrent changes.
• No locking beyond atomic operations.
• No versioned data structure.
• No stopping.
Extreme model for extreme data rates.
Chunxing Yin, Riedy, Bader. “Validity of Graph Algorithms on
Streaming Data.” . (in submission)

Sample of other execution models
• Put in a query, wait for sufﬁcient data [Phillips, et al.
at Sandia]
• Different but very interesting model.
• Evolving: Sample, accurate w/high-prob.
• Difﬁcult to generalize into graph results (e.g.
shortest path tree).
• Classical: dynamic algorithms, versioned data
• Can require drastically more storage, possibly a copy
of the graph per property, or more overhead for
techniques like read-copy-update.
Generally do not address the latency of computing the
“static” starting point.

Algorithm validity in our model: Example.
Can you compute degrees in an undirected graph (no self
loops) concurrently with changes?
Algorithm: Iterate over vertices, count the number of
neighbors.
Compute deg(v ) Compute deg(v )
delete edge
Cannot correspond to an undirected graph at all!
Valid for our model? No!
Not incorrect, just not valid for our model.

Algorithm validity in our model: Example.
Can you compute degrees in an undirected graph (no self
loops) concurrently with changes?
Algorithm: Iterate over edges, increment the degrees of
the endpoints.
Inc deg(v ), deg(v ) (later...)
delete edge
Corresponds to the beginning graph plus a subset of
concurrent changes.
Valid for our model? Yes!
Undirected stored as directed: skip edges with v ≥ v .

Algorithm validity in our model
s
w(e ) =
w(e ) = →
∆ =
• What is valid?
• Typical (direction optimizing) BFS
• Shiloach-Vishkin connected components
• PageRank, Katz via Jacobi
• Label propagation
• Triangle counting (carefully!)
• Saved decisions (can make a copy)...
• Extracting a subgraph or path.
• What may be invalid?
• Making a decision twice in implementations
• ∆-stepping SSSP: Decrease a weight below ∆
• Degree optimization: Cross threshold, miss vertex
• Applying old or different information

Fun properties for one-shot queries
Due to Chunxing Yin, under sensible assumptions:
. You can produce a single-change stream to
demonstrate invalidity.
• Idea: Start with a graph that incorporates all the
visible changes, introduce the one change at the
right time.
. Algorithms that produce a subgraph of their input
cannot be guaranteed to run concurrently with
changes and always produce moment-in-time
outputs.
• Idea: Any time a snapshot result could happen,
delete then re-insert an edge from the output.

On to streaming...
Can we update graph metrics as new data arrives without
just re-running?
• Track what changed during the one-shot query.
• Update locally around those changes, while other
changes are occuring.
• If the update is valid, can repeat to follow a
streaming graph.
Initial
∆
Upd. w/∆
∆
Upd. w/∆
∆
Examples: PageRank, reﬁnement. Connected
components, maintain a spanning forest.

Open issues
Difficult problems: Updating triangle counts efficiently!
• Option: re-counting a region around changes,
stopping once counts do not change.
• Can mis-count on the region’s border, but only at
changes.
• Next run can fix those... A looser model?
Some algorithms essentially copy subgraphs.
• What are the size bounds?
• Can those bounds characterize algorithms /
properties?

Limitations of current architectures
• Graph analysis often uses relatively narrow memory
acceses, e.g. separate -byte integers.
• Currently under-utilizing memory bandwidth.
• One-eighth of a cache line: one-eighth of bandwidth.
• Typical DRAM pages are ≥ KiB. Entire page must be
powered on for an operation.
• New HBM: Kib-wide ⇒ potentially / th
BW
A new approach from Emu Technology: Lightweight
threads migrating to data in narrow-channel DRAM.

Emu PGAS architecture
1 nodelet
Gossamer
Core 1
Memory-Side Processor
Gossamer
Core 4
...
Migration Engine
RapidIODisk I/O
8 nodelets
per node
64 nodelets
per Chick
RapidIO
Stationary
Core
• Multithreaded multicore
• Memory-side “processor” for
atomics, etc. w/NCDIMM
• Stationary core for OS
• Threads migrate in
hardware on reads!

Emu Chick prototype
Experimental system:
• Soft processors (Arria
FPGAs)
• One Gossamer Core (GC) per
nodelet, max threadlets
• Memory and cores are
under-clocked.
• Firmware bugs limit
inter-node migration, ﬁle I/O

Pointer chasing benchmark
Data-dependent loads, fine-grained access
Ordered
Intra-block shuffle: weak locality
Full block shuffle: weak locality
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Vuduc, Riedy.
“An Initial Characterization of the Emu Chick,” AsHES .

Pointer Chasing: Intel Xeon
Performance varies drastically.

Pointer Chasing: Emu Chick
Matches simulation to a consistent factor of two.
Simulation of larger, full Emu systems shows promising
results... More later.

Pointer Chasing: Bandwidth utilization
Full shufﬂe. Measured against STREAM.

Pointer Chasing: Bandwidth scaling
Full machine results. STREAM around GB/s.
Still need many threads, but not as many as MTA/XMT.
(Thanks to Eric Hein.)

Pointer Chasing: Bandwidth scaling
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
0
2000
4000
6000
8000
10000
Memorybandwidth(MBs)
1024 threads
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
peak STREAM bandwidth
2048 threads
block_shuffle intra_block_shuffle full_block_shuffle
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
4096 threads
Pointer Chasing (Emu Chick, 64 nodelets)
Full machine results. STREAM around GB/s.
Still need many threads, but not as many as MTA/XMT.
(Thanks to Eric Hein.)

Closing
• Summary
• Analysis concurrent with graph change can work.
• But not all implementations are valid.
• New and novel architectures show promise for
ﬁne-grained access and parallelism.
• Future work
• Track subgraphs / communities for “slow” analyses
• Can ofﬂoad subgraphs to accelerators?
• Develop more valid updating methods,
approximation results
• Experiment with even more new architectures

Introducing the CRNCH Rogues Gallery
A physical & virtual space for hosting novel computing
architectures, systems, and accelerators.
Host / manage remote access for novel architectures!
• Emu Chick
• FPGA + HMC: D stacked
• FPAA: Analog/Neuromorphic
Amortize effort and cost of trying novel architectures.
Break the “but it’s too much work” barrier.
http://crnch.gatech.edu/rogues-gallery

Graph Analysis: New Algorithm Models, New Architectures

Recommended

Recommended

More Related Content

Similar to Graph Analysis: New Algorithm Models, New Architectures

Similar to Graph Analysis: New Algorithm Models, New Architectures (20)

More from Jason Riedy

More from Jason Riedy (20)

Recently uploaded

Recently uploaded (20)

Graph Analysis: New Algorithm Models, New Architectures