Fast & Energy-Efficient Breadth-First Search on a Single NUMA System

Fast & Energy-Eﬃcient
Breadth-First Search
on a Single NUMA System
Yuichiro Yasui & Katsuki Fujisawa
Kyushu University & JST CREST
Yukinori Sato
JAIST & JST CREST
ISC14 (International supercomputing conference 2014)
Research Papers 08 ‒ Energy Eﬃciency, June 26, 2014

Outline
1.  Background
2.  Fast computation of graph processing
–  Related work and our previous contributions
3.  Bottlenecks analysis for our previous NUMA-
optimized BFS
4.  Our proposal : Degree-aware BFS
5.  Performance evaluation of proposal BFS
–  Fast for Graph500 benchmark
–  Energy-eﬃcient for Green graph500 benchmark

Background
•  Large scale graphs in various ﬁelds
–  US Road network : 58 million edges
–  Twitter follow-ship : 1.47 billion edges
–  Neuronal network : 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
•  Fast and scalable graph processing by using HPC
large
61.6 million vertices
& 1.47 billion edges

•  Transportation
•  Social network
•  Cyber-security
•  Bioinformatics
Graph analysis and important kernel BFS
•  The cycle of graph analysis for understanding real-networks
•  concurrent search (breadth-first search)
•  optimization (single source shortest path)
•  edge-oriented (maximal independent set)
graph
processing
Understanding
Application field
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
•  One of most important and fundamental processing
•  Many algorithms and applications based on exists (Max.-flow and centrality)
•  low arithmetic intensity & irregular memory accesses.
Breadth-first search (BFS)
Source
BFS Lv. 3
source Lv. 2
Lv. 1
Outputs：Distance (Lv.)
and Predecessor for each
vertex from source
Inputs：Graph,
and source vertex

Target: NUMA arch. system
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM
RAM
CPU socket（16 logical cores）
+ Local RAM
Memory access for Local RAM（Fast）
Memory access for Remote RAM（Slow）
NUMA node
•  Reduces and avoids memory accesses for Remote RAM
•  4-way Intel Xeon E5-4640 (Sandybridge-EP)
–  4 (# of CPU sockets)
–  8 (# of physical cores per socket)
–  2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
NUMA-aware computation

Graph500 Benchmark
•  Fast computation of graph processing is signiﬁcant topic in HPC
•  Graph500 benchmark measures computer performance using
TEPS ratio (# of Traversed edges per second) in graph processing
such as BFS (Breath-ﬁrst search)
SCALE&&&edgefactor&(=16)
Median
TEPS
1.  Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
t parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed e
- TEPS
Input parameters ResultGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- S
- e
- B
- T
- T
TEPS
ratio
ValidationBFS
64 Iterations
3.  BFS x 642.  Construction
x 64
TEPS ratio
•  Kronecker graph
–  synthetic scale-free network which was generated by
using Recursive Kronecker product
–  2SCALE vertices and 2SCALE edgefactor edges
–  e.g.) SCALE 30 and edgefactor 16 1 billion vertices
and 17.2 billion edges
www.graph500.org

Level-synchronized parallel BFS (Top-down)
•  Started from source vertex
and executes following two
phases for each level
ns (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-
BFS for each source, and the verify-phase
ut of the BFS.
k is based on the TEPS ratio, which is
ven graph and the BFS output. Submission
hmark must report ﬁve TEPS ratios: the
uartile, median, third quartile, and maxi-
ARALLEL BFS ALGORITHM
ized Parallel BFS
the input of a BFS is a graph G = (V, E)
et of vertices V and a set of edges E.
f G are contained as pairs (v, w), where
et of edges E corresponds to a set of
where an adjacency list A(v) contains
s (v, w) ∈ E for each vertex v ∈ V . A
various edges spanning all other vertices
he source vertex s ∈ V in a given graph
predecessor map π, which is a map from
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V, A) : unweighted directed graph.
s : source vertex.
Variables: QF
: frontier queue.
QN
: neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v) ← −1, ∀v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF
← {s}
5 QN
← ∅
6 while QF
̸= ∅ do
7 for v ∈ QF
in parallel do
8 for w ∈ A(v) do
9 if w ̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN
← QN
∪ {w}
13 QF
← QN
14 QN
← ∅
Traversal
Swap
Frontier
Neighbor
Level k
Level k+1
QF
QN
Swap … swaps the frontier
QF and the neighbor QN for
next level
Traversal … ﬁnds unvisited
adjacency vertices from
current frontier QF and
append to neighbor QN!

Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
QN
← QN
∪ {w}
break
QF
← QN
Hybrid-BFS (Direction-optimizing BFS)
Chooses one from Top-down or Bottom-up for frontier size at each level
Frontier
Neighbors
Level0k
Level0k+1
Frontier
Level0k
Level0k+1
neighbors
Top-down algorithm
•  Eﬃcient for small-frontier
•  Uses out-going edges
Bottom-up algorithm
•  Eﬃcient for large-frontier
•  Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
QN
← QN
∪ {w}
break
QF
← QN
Current frontier
Unvisited
neighbors
Current frontier
Beamer2012
Candidates of
neighbors
Skips unnecessary edge traversal

Chooses one from Top-down or Bottom-up
for a number of traversed edges at each level
Number of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces
unnecessary edge traversals
Beamer2012
Hybrid-BFS (Direction-optimizing BFS)
Top=down
探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Bottom=up&
Top=down
Distance from source
|V| = 226, |E| = 230
= |E|

NUMA-optimized BFS
•  Clearly separated to accessing for local and remote memory
–  Edge traversal on Local RAM
–  All-gather of local queues and bitmaps for Remote RAM
NUMA=optimized&
Top=down
NUMA=optimized&
Bottom=up
Large&frontier? Aggregates&local&
frontier&queues
Yes
No
At each level,!
Traversal on local RAM Swap on Remote RAM
QN
1
QN
0
QN
3
QF
QN
2
QF QN
2
•  Searches&local&neighbors&
from&local&copied&frontier&
•  Out-going edges for Top-down
•  In-coming edges for Bottom-up
NUMA-opt. requires two CSR graphs
※&Not&same&for&undirected&graph

Local Edge Traversal
V0
QF
visited0
QN
0
A0
V
V1
QF
visited1
QN
1
V A1
V3
QF
visited3
QN
3
A3V
V2
QF
visited2
QN
2
A2
V
•  partial&edges&(vi),)vj)),&(vi) )V)and)vj) )Vk)
慮したグラフ領域の分割
U ソケット (ℓ-個の RAM) を持つ計算機を想定
部分点集合 Vk と隣接リストの部分集合 Ak に配置
V = V0 | V1 | · · · | Vℓ−1 , A = A0 | A1 | · · · | Aℓ−1 ,
Vk は単純な 1 次元分割 (n: 点数, ℓ: CPU ソケット数)
Vk = vj ∈ V | j ∈
kn
ℓ
,
(k + 1)n
ℓ
,
U ソケットを持つ計算機上に 2 種類の隣接リストを定義
B
•  partial&vertices&Vk
RAM RAM
8-core Xeon E5 4640shared L3 cache
RAM
RAM
RAM RAM
RAM
RAM
RAM RAM
RAM
RAM
0th NUMA node 3th NUMA node
2nd NUMA node1st NUMA node
k=th&NUMA&node&holds&
•  Local&copied&frontier&QF

CPU Affinity and local memory binding
•  ULIBC: Ubiquity Library for Intelligently Binding Cores
–  provides some routines for CPU affinity + Local memory binding
–  manages each processor core (processor ID) by topology
information as a tuple of (SMT ID, core ID, package ID).
All processors Online processors
(allocated&to&current&process)
CPU Affinity
1.&Detects&online&processors&
&&&&&&using&sched_getaffinity&system&call
NUMA node 0
NUMA node 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
Use&
Other&processes
Package0ID0:0index&of&CPU&socket&
Core0ID0:0index&of&physical&core&in&each&CPU&socket&
SMT0ID0:0index&of&thread&in&each&physical&core&
Processor0ID&
index&of&logical&processor&core&
2.&Binds&each&thread&to&logical&core&
&&&&&&&&using&sched_setaffinity&system&call&or&&
&&&&&&&&&&&&&&&Intel&compiler&Thread&Affinity&Interface

0
5
10
15
20
25
30
35
20 21 22 23 24 25 26 27 28 29
GTEPS
SCALE
reference code
Agawal2010
Beamer2012
Yasui2013
Yasui2014
Related work: TEPS ratios on a single node
•  Our BFS achieves 31.7 GTEPS for Kronecker graph (SCALE27)
Yasui2013
Yasui2014
x 2.2
x 2.6
x 5.9Agarwal2010
faster
Agarwal2010
NUMA-aware Top-down BFS
4-way Intel Xeon 7560
Beamer2012
Hybrid-BFS
4-way Intel Xeon E5-8870
Yasui2013
NUMA-opt. Hybrid-BFS
0.8 GTEPS
(m/n=64, 1.1GTEPS)
5.1 GTEPS
11.1 GTEPS
Yasui2014
Degree-aware NUMA-opt. BFS
31.7 GTEPS
This paper
This paper
Reference code 0.1 GTEPS

Visited vertices!
Zero-degree!
71,140,085!
53.0%
Top-down!
283!
0.0%!
Bottom-up
63,035,833!
47.0%
Level Step Hybrid-BFS
0 Top-down 22
1 Top-down 239,930
2 Bottom-up 150,006,673
3 Bottom-up 19,742,764
4 Bottom-up 139,817
5 Bottom-up 41,846
6 Top-down 260
Total – 170,171,312
% 4.0 %
Breakdown of hybrid-BFS
•  Most of CPU time taken to
Bottom-up step in Hybrid BFS.
•  In particular, Bottom-up step in
Level-2 has almost edge
traversals. 99.9 %
231 = 2,147,483,648 (100 %)
for Kronecker graph with SCALE27
#Traversed edges
+ +
Total vertices!
134,217,728!
100.0%!
=
•  Most of vertex traversal taken to Bottom-up step in Hybrid BFS.
•  A half of number of vertices is unvisited.
Breakdown of vertex traversal
Traversed edges
88.1 %
Unvisited vertices!
Isolated!
41,527!
0.0%
+
=227
( 8 %)
227 vertices and 231 edges

Influence of ordering for adjacency vertices
•  Computation complexity of Bottom-up step depends on
the ordering of adjacency vertices for each vertex
Number of traversal edges for each ordering
# of traversed edges is strongly affected by each ordering in Lv. 2.
Descending&order
High-degree Low-degree
A(v)
Sorted adjacency list A(v)!
using out-degree of w!
w
aversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Bottom-up Step Ascending Randomized Descending
223,250,243 T 22 22 22
258,645,723 T 239,930 239,930 239,930
83,878,899 B 848,743,124 150,006,673 83,878,899
19,616,130 B 19,935,737 19,742,764 19,616,130
139,606 B 139,868 139,817 139,606
41,846 B 41,846 41,846 41,846
41,586 T 260 260 260
585,614,033 – 869,100,787 170,171,312 103,916,693
179.6 % 20.6 % 4.0 % 2.5 %
108
Randomized
108
Descending
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
108
Ascending
108
Randomized
108
Descending
Better
Loop0count0τ!
A(va)
A(vb)
finds frontier vertex and breaks this loop……
Bottom=up&
Skipped&adjacency&vertices
Traversed&adjacency&vertices

τ=1
Analysis of loop count for each vertices
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Numberoffixedvertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
Max: 5,873 Max: 58
Max: 28
19.0% + 27.8%
•  Bottom-up found 46.8 % vertices
•  Descending finds most vertices at first loop .Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
22 4,223,250,243 T 22 22 22
239,930 3,258,645,723 T 239,930 239,930 239,930
,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
,007,608 139,606 B 139,868 139,817 139,606
98,339 41,846 B 41,846 41,846 41,846
260 41,586 T 260 260 260
,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
100 % 179.6 % 20.6 % 4.0 % 2.5 %
6000100 1000
count ⌧
ending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.528.1% + 18.7%
45.0%
τ = 1
Better
better
τ 2
First vertex of adjacency list
τ = 1 τ 2
τ = 1 τ 2
Descending order
Ascending Randomized
1.8%

3 features and Degree-aware BFS
1. A half vertices has no adjacency vertices
  Suppression of zero degree vertices using renumbering
technique for non-zero degree vertices
2. Computation complexity of Bottom-up depends on
the ordering of adjacency vertices for each vertex
  Sorted adjacency list by out-degree in descending
3. Most vertices was found at ﬁrst loop of Bottom-up
  Separated graph representation; highest-degree
adjacency vertex list A+ and remaining CSR graph A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A+
=standard
CSR graph
+
Zero-degree opt.
High-degree opt.

Performance improvements
Degree-aware BFS is 2.68 faster than NUMA-opt.
0
5
10
15
20
25
30
NUMA-opt. + High-deg + zero-deg Degree-aware
GTEPS
⇥1.00
⇥1.34
⇥2.03
⇥2.681.34 x 2.03
This&paperPrevious&paper

Intel Xeon (64) v.s. SGI Altix UV1000 (512)
–  536.9 million vertices, 8.59 billion edges
Intel Xeon E5-4650 @ 2.70GHz
64-threads = 4-NUMA x 8-cores x 2-threads
0
50
100
150
200
250
300
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUp 0
50
100
150
200
250
300
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUp
64 threads
400 GB
512-threads = 64-NUMA x 8-cores
SGI Altix UV1000 (Westmere-EX arch.)Intel Xeon (Sandybridge-EP arch.)
for&Local&RAM
for&Remote&RAM
for&Remote&RAM
for&Local&RAM
31.81 GE/s21.81 GE/s
512 GB RAM 4.0 TB RAM
512 threads
1.0 TBytes

Intel Xeon (64) v.s. SGI Altix UV1000 (512)
–  1.07 billion vertices, 17.18 billion edges
64-threads = 4-NUMA x 8-cores x 2-threads
512-threads = 64-NUMA x 8-cores
SGI Altix UV1000 (Westmere-EX arch.)Intel Xeon (Sandybridge-EP arch.)
512 GB RAM 4.0 TB RAM
0
50
100
150
200
250
300
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUp
BottomUp
512 threads
2.0 TBytes
for&Remote&RAM
for&Local&RAM
37.70 GE/s
Out of memory
Rank.50
Fastest of single-node
on nov.2013 list

Strong scaling on SGI Altix UV1000
0
100
200
300
400
500
600
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUp
512 threads (one-rack)
37.70 GE/s26.17 GE/s
256 threads128 threads
18.76 GE/s
Local Local
Local
Remote Remote
Remote
L : R = 67% : 33%L : R = 80% : 20% L : R = 57% : 43%
197 ms
218 ms182 ms
258 ms
438 ms733 ms
•  As the number of threads increases,
–  Improves the CPU time for Local memory access
–  Keeps the CPU time for Remote memory access
–  1.07 billion vertices, 17.18 billion edges
Rank.50
Fastest of single-node
on nov.2013 list

BFS Performances for Real networks
•  Suitable for small-world networks
–  efficient for a low-diameter and a large-edgefactor
Twitter follow-ship network in 2009
61.6 million vertices & 1.47 billion edges
10.90 GTEPS (max. 24.09 GTEPS)
US road network
24 million vertices & 58 million edges
0.09 GTEPS (max. 0.11 GTEPS)
Small=world
Non&small=world
faster than the former owing to its edgefactor being 1.68 times larger relatively.
In addition, twitter and friendster show similar BFS performances of approxi-
mately 10 GTEPS because they have similar edgefactor and similar diameters.
Therefore, we verify whether our BFS is affected by using both the edgefactor
and diameter of the network. From these numerical results, we could achieve
high performance for large-scale small-world networks with a large edgefactor.
Table 9. BFS performance of real-world network on Sandybridge-EP system.
Graph size edgefactor Diameter GTEPS
Instance n m m/n diam′
G min 1/4 median 3/4 max
wiki-Talk [23, 24] 2.39 M 5.02 M 2.1 8 0.29 0.61 0.75 0.87 1.26
USA-road-d [25] 23.95 M 58.33 M 2.4 8,098 0.07 0.08 0.09 0.09 0.11
LiveJournal [26, 27] 4.85 M 68.99 M 14.2 16 2.76 3.76 4.07 4.32 4.94
twitter [28] 61.58 M 1,468.37 M 23.8 16 7.58 10.02 10.90 12.68 24.09
friendster [29] 65.61 M 1,806.07 M 27.5 25 4.89 9.61 10.74 11.29 11.81
5 Energy Efficiency of Our BFS

The Green Graph500 list in Nov. 2013
•  Measures power-eﬃcient using TEPS/W ratio
•  Our results on various systems such as Xeon servers
and Android devices
http://green.graph500.org
Median
TEPS
1. Generation
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
E
factor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3.  BFS phase
2. Construction
x 64
TEPS ratio
Watt
TEPS/W
Power measurement
Green Graph500
Graph500
Measuring power consumption
during the BFS phase

TEPS and TEPS/W on 4-way Xeon
16 8 8 4 4
44
7.92 GTEPS
364.9 W
21.71 MTEPS/W
11.83 GTEPS
452.6 W
26.13 MTEPS/W
13.96 GTEPS
517.8 W
26.96 MTEPS/W
Fast
16 16
1616
29.03 GTEPS
639.1 W
45.43 MTEPS/W
8 8
88
22.03 GTEPS
586.7 W
37.55 MTEPS/W
Energy eﬃcient
#NUMA nodes = 4#threads = 16
0
5
10
15
20
25
30
1⇥1
(1)
4⇥1
(4)
4⇥2
(8)
1⇥16
(16)
2⇥8
(16)
4⇥4
(16)
2⇥16
(32)
4⇥8
(32)
4⇥16
(64)
w/o
(64)
4⇥16
(64)
GTEPS
` ⇥ t CPU Aﬃnity (Number of threads)
Degree-aware (GTEPS)
Reference (GTEPS)
NUMA-opt. (GTEPS)
NUMA-opt.Ref.Degree-aware

0
100
200
300
400
500
10 11 12 13 14 15 16 17 18 19 20
GTEPS
SCALE
Reference (p = 4)
Degree-aware BFS (p = 4)
7. MTEPS of reference BFS and Degree-aware BFS on XperiaA SO-04E.
10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
Degree-aware (p = 1) 20 136.29 3.23 42.25
Green Graph500 on Xperia-A-SO-04E
Manage both fast and energy-efficient
on, suggesting that the effective power is not strongly affected by the number
hreads and the algorithm used. With regard to energy-efficient computation,
BFS is around 100 times faster than the reference code for roughly the
e effective power of 3.0 W; specifically, our BFS shows an energy-efficient
ormance of 153.17 MTESP/W.
ble 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
This study (p = 1) 20 136.29 3.23 42.25
This study (p = 2) 20 248.08 2.99 82.92
This study (p = 4) 20 477.63 3.12 153.17
153.17 MTEPS/W
(477.64 MTEPS)
Roughly same power-consumption
Smartphone
SONY Xperia-A-SO-04E
CPU : 4-core Snapdragon
RAM : 2 GB
#1 in Nov. 2013 list
# threads
1 MTEPS/W
Energy=efficient
x150
Faster and
energy efficient

Conclusion
•  Degree-aware BFS
–  Speedup techniques considering the vertex degree
–  1) Zero-degree vertex suppression
–  2) Separated graph representation
–  2.68 times faster than our previous algorithm
•  Our BFS achieves fastest of single-node
–  37.7 GTEPS for SCALE30 on SGI Altix UV1000 (one rack)
•  Investigates aﬃnity and power consumption
–  4 sockets x 16 threads -aﬃnity is the highest MTEPS
and MTEPS/W on 4-way Intel Xeon server.
–  First position of small data category of 2nd Green
graph500 on Android device

Fast & Energy-Efficient Breadth-First Search on a Single NUMA System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fast & Energy-Efficient Breadth-First Search on a Single NUMA System

Similar to Fast & Energy-Efficient Breadth-First Search on a Single NUMA System (20)

Recently uploaded

Recently uploaded (20)

Fast & Energy-Efficient Breadth-First Search on a Single NUMA System