Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
ย
GraphBLAS: A linear algebraic approach for high-performance graph queries
1. GraphBLAS: A linear algebraic approach
for high-performance graph algorithms
Gรกbor Szรกrnyas
szarnyas@mit.bme.hu
2. WHAT MAKES GRAPH PROCESSING DIFFICULT?
the โcurse of connectednessโ
contemporary computer architectures are good at
processing linear and hierarchical data structures,
such as Lists, Stacks, or Trees
a massive amount of random data access is required,
CPU has frequent cache misses, and implementing
parallelism is difficult
B. Shao, Y. Li, H. Wang, H. Xia (Microsoft Research),
Trinity Graph Engine and its Applications,
IEEE Data Engineering Bulleting 2017
connectedness
computer
architectures
caching and
parallelization
9. BOOKS ON LINEAR ALGEBRA FOR GRAPH PROCESSING
๏ง 1974: Aho-Hopcroft-Ullman book
o The Design and Analysis of Computer Algorithms
๏ง 1990: Cormen-Leiserson-Rivest book
o Introduction to Algorithms
๏ง 2011: GALLA book (ed. Kepner and Gilbert)
o Graph Algorithms in the Language of Linear Algebra
A lot of literature but few practical implementations
and particularly few easy-to-use libraries.
10. THE GRAPHBLAS STANDARD
BLAS GraphBLAS
Hardware architecture Hardware architecture
Numerical applications Graph analytical apps
LAGraphLINPACK/LAPACK
Separation of concernsSeparation of concerns
Goal: separate the concerns of the hardware/library/application designers.
๏ง 1979: BLAS Basic Linear Algebra Subprograms (dense)
๏ง 2001: Sparse BLAS an extension to BLAS (insufficient for graphs, little uptake)
๏ง 2013: GraphBLAS standard building blocks for graph algorithms in LA
13. MATRIX MULTIPLICATION ON SEMIRINGS
๏ง Using the conventional semiring
๐ = ๐๐
๐ ๐, ๐ = ฮฃ
๐
๐ ๐, ๐ โ ๐ ๐, ๐
๏ง Use arbitrary semirings that override the โจ addition and
โจ multiplication operators. Generalized formula (simplified)
๐ = ๐ โจ.โจ ๐
๐ ๐, ๐ = โ
๐
๐ ๐, ๐ โจ๐ ๐, ๐
14. GRAPHBLAS SEMIRINGS
The ๐ท,โ,โ, 0 algebraic structure is a GraphBLAS semiring if
๏ง ๐ท,โ, 0 is a commutative monoid over domain ๐ท with an
addition operator โ and identity 0, where โ๐, ๐, ๐ โ ๐ท:
o Commutative ๐ โ ๐ = ๐ โ ๐
o Associative ๐ โ ๐ โ ๐ = ๐ โ ๐ โ ๐
o Identity ๐ โ 0 = ๐
๏ง The multiplication operator is a closed binary operator
โ: ๐ท ร ๐ท โ ๐ท.
This is less strict than the standard mathematical definition which
requires that โ is a monoid and distributes over โ.
15. COMMON SEMIRINGS
semiring domain โจ โจ 0
integer arithmetic ๐ โ โ + โ 0
real arithmetic ๐ โ โ + โ 0
lor-land ๐ โ F, T โ โ F
Galois field ๐ โ 0,1 xor โ 0
power set ๐ โ โค โช โฉ โ
Notation: ๐ โ.โ ๐ is a matrix multiplication using addition โ and
multiplication โ, e.g. ๐ โจ.โง ๐. The default is ๐ + . โ ๐
17. MATRIX MULTIPLICATION SEMANTICS
semiring domain โจ โจ 0
lor-land ๐ โ F, T โจ โง F
๏
๏
๏ ๏
๏ ๏
๏
Semantics: reachability
๏ ๏ ๏ ๏ ๏ ๏ ๏
๐ฏ F F F T F T F
๐ ๏ ๏ ๏ ๏ ๏ ๏ ๏
๏ T T
๏ T T
๏ T
๏ T T
๏ T
๏ T
๏ T T T
T T
TโงT=T
TโงT=T
TโจT=T ๐ฏ โจ.โง ๐Identity element: F
20. SSSP โ SINGLE-SOURCE SHORTEST PATHS
๏ง Problem:
o From a given start node ๐ , find the shortest paths to every other
(reachable) node in the graph
๏ง Bellman-Ford algorithm:
o Relaxes all edges in each step
o Guaranteed to find the shortest paths using at most ๐ โ 1 steps
๏ง Observation:
o The relaxation step can be captured using a VM multiplication
32. TC: OPTIMIZATION
Observation: Matrix ๐ โ.โ ๐ โ.โ ๐ is no longer sparse.
Optimization: Use element-wise multiplication โ to close
wedges into triangles:
๐๐๐ = ๐ โ.โ ๐ โ ๐
Then, perform a row-wise summation to get the number of
triangles in each row:
๐ญ๐ซ๐ข = โ ๐ ๐๐๐ : , ๐
35. TC: ALGORITHM
Input: adjacency matrix ๐
Output: vector ๐ญ๐ซ๐ข
Workspace: matrix ๐๐๐
1. ๐๐๐ ๐ = ๐ โ.โ ๐ compute the triangle count matrix
2. ๐ญ๐ซ๐ข = โ ๐ ๐๐๐ : , ๐ compute the triangle count vector
Optimization: use ๐, the lower triangular part of ๐ to avoid duplicates.
๐๐๐ ๐ = ๐ โ.โ ๐
Worst-case optimal joins: There are deep theoretical connections between masked matrix multiplication and
relational joins. It has been proven in 2013 that for the triangle query, binary joins always provide suboptimal
runtime, which gave rise to new research on the family of worst-case optimal multi-way joins algorithms.
37. GRAPH ALGORITHMS IN GRAPHBLAS
problem category algorithm
canonical
complexity ฮ
LA-based
complexity ฮ
breadth-first search ๐ ๐
single-source shortest paths
Dijkstra ๐ + ๐ log ๐ ๐2
Bellman-Ford ๐๐ ๐๐
all-pairs shortest paths Floyd-Warshall ๐3 ๐3
minimum spanning tree
Prim ๐ + ๐ log ๐ ๐2
Borลฏvka ๐ log ๐ ๐ log ๐
maximum flow Edmonds-Karp ๐2 ๐ ๐2 ๐
maximal independent set
greedy ๐ + ๐ log ๐ ๐๐ + ๐2
Luby ๐ + ๐ log ๐ ๐ log ๐
Based on the table in J. Kepner:
Analytic Theory of Power Law Graphs,
SIAM Workshop for HPC on Large Graphs, 2008
Notation: ๐ = ๐ , ๐ = |๐ธ|. The complexity cells contain asymptotic bounds.
Takeaway: The majority of common graph algorithms can be expressed efficiently in LA.
See also L. Dhulipala, G.E. Blelloch, J. Shun:
Theoretically Efficient Parallel Graph Algorithms
Can Be Fast and Scalable, SPAA 2018
39. GRAPHBLAS C API
๏ง โA crucial piece of the GraphBLAS effort is to translate the
mathematical specification to an API that
o is faithful to the mathematics as much as possible, and
o enables efficient implementations on modern hardware.โ
mxm(Matrix *C, Matrix M, BinaryOp accum, Semiring op, Matrix A, Matrix B, Descriptor desc)
๐ ยฌ๐ โ= โ.โ ๐โค
, ๐โค
A. Buluรง et al.: Design of the GraphBLAS C API, GABB@IPDPS 2017
40. SUITESPARSE:GRAPHBLAS
๏ง Authored by Prof. Tim Davis at Texas A&M University,
based on his SuiteSparse library (used in MATLAB).
๏ง Additional extension operations for efficiency.
๏ง Sophisticated load balancer for multi-threaded execution.
๏ง CPU-based, single machine implementation.
๏ง Powers the RedisGraph graph database.
T.A. Davis: Algorithm 1000: SuiteSparse:GraphBLAS:
graph algorithms in the language of sparse linear
algebra, ACM TOMS, 2019
T.A. Davis: SuiteSparse:GraphBLAS: graph
algorithms via sparse matrix operations
on semirings, Sparse Days 2017
R. Lipman, T.A. Davis: Graph Algebra โ Graph operations
in the language of linear algebra, RedisConf 2018
R. Lipman: RedisGraph
internals, RedisConf 2019
41. PYTHON WRAPPERS
Two libraries, both offer:
๏ง Concise GraphBLAS operations
๏ง Wrapping SuiteSparse:GrB
๏ง Jupyter support
Difference: pygraphblas is more
Pythonic, grblas strives to stay
close to the C API.
michelp/pygraphblas
jim22k/grblas
43. SUITESPARSE:GRAPHBLAS / LDBC GRAPHALYTICS
Twitter
50M nodes,
2B edges
Results from late 2019,
new version is even faster
graph500-26
33M nodes,
1.1B edges
d-8.8-zf
168M nodes,
413M edges
d-8.6-fb
6M nodes,
422M edges
Ubuntu Server, 512 GB RAM,
64 CPU cores, 128 threads
44. THE GAP BENCHMARK SUITE
S. Beamer, K. Asanovic, D. Patterson:
The GAP Benchmark Suite, arXiv, 2017
๏ง Part of the Berkeley Graph Algorithm Platform project
๏ง Algorithms:
o BFS, SSSP, PageRank, connected components
o betweenness centrality, triangle count
๏ง Very efficient baseline implementation in C++
๏ง Comparing executions of implementations that were
carefully optimized and fine-tuned by research groups
๏ง Ongoing benchmark effort, paper to be submitted in Q2
gap.cs.berkeley.edu/benchmark.html
46. RESOURCES
List of GraphBLAS-related books, papers, presentations, posters, and software
szarnyasg/graphblas-pointers
Library of GraphBLAS algorithms
GraphBLAS/LAGraph
Extended version of this talk: 200+ slides
๏ง Theoretical foundations
๏ง BFS variants, PageRank
๏ง clustering coefficient, ๐-truss and triangle count variants
๏ง Community detection using label propagation
๏ง Lubyโs maximal independent set algorithm
๏ง computing connected components on an overlay graph
๏ง connections to relational algebra
47. SUMMARY
๏ง Linear algebra is a powerful abstraction
o Good expressive power
o Concise formulation of most graph algorithms
o Very good performance
o Still lots of ongoing research
๏ง Trade-offs:
o Learning curve (theory and GraphBLAS API)
o Some algorithms are difficult to formulate in linear algebra
o Only a few GraphBLAS implementations (yet)
๏ง Overall: a very promising programming model for graph
algorithms suited to the age of heterogeneous hardware
48. ACKNOWLEDGEMENTS
๏ง Tim Davis and Tim Mattson for helpful discussions,
members of GraphBLAS mailing list for their detailed
feedback.
๏ง The LDBC Graphalytics task force for creating the
benchmark and assisting in the measurements.
๏ง Masterโs students at BME for developing GraphBLAS-based
algorithms: Bรกlint Hegyi, Mรกrton Elekes, Petra Vรกrhegyi,
Lehel Boรฉr