bfgasnet_pr-v2

Introduction Tests Conclusions Contributions Future work
Gasnet library evaluation on Barrelfish and
Intel SCC
June 30, 2012
Zeus Gómez Marmolejo
Barcelona Supercomputing Center
Zeus Gómez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC

Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Contents
1 Introduction
Motivation
Project goals
Software architecture
2 Tests
Hardware
Conﬁgurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work

Introduction
Motivation
Future trends:
Multi-core CPUs and multi-core GPUs in a single chip.
Shared memory and cache coherence complexity. This
May not scale in the future.
Problems with shared memory OS like Linux or Windows
and many core systems.
Message passing OS like Barrelﬁsh.
Experiments on non-coherent multi-core shared
architectures: Intel SCC and its MPBs.

Linux approach
Multi-core operating systems using shared memory
core 0
struct page {
...
spinlock_t ptl;
};
core 1 core 2 core N...
Data sharing:
Access locks
False sharing
Memory
Contention
Hardware cache
coherence

Barrelﬁsh approach
No sharing, but message passing
System Knowledge Base:
No driver software!
Message passing:
No sharing at all
System processes
Asynchronous calls
Interconnect drivers

Project goals
Looking for the appropriate library that meets the desired features
Port a well-known message passing library to
Barrelﬁsh...
Desired features:
Portable across different
architectures, systems and
OSs.
Highly efﬁcient.
Used in many applications and
parallel languages.
Be able to run standard
OpenMP programs via the
nanos runtime.

Project goals
Looking for the appropriate library that meets the desired features
Port a well-known message passing library to
Barrelfish...
Desired features:
Portable across different
architectures, systems and
OSs.
Highly efficient.
Used in many applications and
parallel languages.
Be able to run standard
OpenMP programs via the
nanos runtime.
The Gasnet library
from the
University of
Berkeley
fulfills these
expectations.

Gasnet library
Low level communication library
Network hardware
UDP
conduit
SMP
conduit
MPI
conduit
BF
conduit
Gasnet core API
Low level communication library: implements UPC,
Titanium, OmpSs.
Different categories: AMShort, AMMedium, AMLong
Message types: requests, replies
Private Shared Memory (PSHM) mode for a conduit

Barrelfish Message Passing
Generated stubs for efficient message passing
msg()
user.c
...
...
ump_hdlr() {
user_flounder
_bindings.c
...
cache
write
}
core 0 process
event_dispatch() {
waitset.c
...
ump_rx() {
...
msg()
}
core 1 process
closure.handler()
...
...
user_flounder
_bindings.c
msg() {
...
}
user.c
} ...
Non-blocking asynchronous calls.
Continuation closure, called also asynchronously.
Messages sent as RPC.
Generated C code by flounder tool, in Haskell, depending
on the interconnect driver.
Fast event handling code on receiving side (polling).
When all arguments are assembled, call is made to user
program.

Differences with Barrelﬁsh MP model
Similarities, differences and solutions proposed
Similarities:
Gasnet Nodes → Barrelﬁsh Cores
Messages as RPC
Necessity to send large buffers
Be able to send back replies

Differences with Barrelﬁsh MP model
Similarities, differences and solutions proposed
Similarities:
Gasnet Nodes → Barrelﬁsh Cores
Messages as RPC
Necessity to send large buffers
Be able to send back replies
Differences:
Gasnet calls must
block
Non-blocking
message handlers
No thread-safe
Solutions:
2 threads: application &
Gasnet
Leader-followers thread
serving model
4 independent
channels per peer

Gasnet BF conduit implementation
Details of the BF implementation using previous solutions
Details:
Uses the BF ﬂounder generated stub to pass messages.
To simulate the synchronous behavior of Gasnet, we use 2
threads. One of them is coming from the pool.
However, the binding cannot be handled by two threads
concurrently without proper locking.
thread 1
GASNET BARRELFISH
core 1
BARRELFISH
core 2
gasnet_AMShort()
ack
ack
ack
GASNET
handler call
ambf_ump_send_handler()
endr()
thread 2 thread 2 thread 1

Introduction Tests Conclusions Contributions Future work Hardware Conﬁgurations MP: 1 to 1 MP: N to N
Contents
1 Introduction
Motivation
Project goals
2 Tests
Hardware
Conﬁgurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work

Test systems
Intel x86 64 SMP system
Sun Fire X2270 M2, with 2 Intel Xeon CPU E5620@ 2.40GHz:
chip 0 chip 1QPI link
(8 CPUs) (8 CPUs)
DDR3 0
DDR3 1
DDR3 2
DDR3 0
DDR3 1
DDR3 2
SMP system
Features:
Intel x86 64 architecture
2 chips x 4 SMP x 2 SMT = 16 CPUs
32 GB RAM NUMA
QPI link: 25 GB/s

Test systems
Intel Single-chip Cloud Computer system
Features:
48 CPUs Intel 32-bit P54C in a
single chip
Non-coherent caches
Routers and MPBs for message
passing
4 DDR3 memory controllers
Shipped as:
A cluster of 48 linux
systems accessed
by SSH
No OS prev to
Barrelﬁsh sees it as
a single system

Test configurations
Possible combinations of architecture & OS & Gasnet conduit/interconnect driver
x86 64-linux-pshm. SMP with Linux and PSHM in Gasnet.
x86 64-linux-mpi. SMP with Linux with MPI conduit for Gasnet.
x86 64-barrelfish-pshm. SMP running Barrelfish and PSHM in
Gasnet.
x86 64-barrelfish-ump. SMP running Barrelfish and the User-Level
Message Passing.
scc-linux-mpi. Intel SCC running Linux on all cores with Gasnet MPI
conduit, compiled with the MPIRCK CH2 driver.
scc-barrelfish-ump ipi. Intel SCC running Barrelfish with Gasnet BF
conduit with UMP & Inter-Process Interrupts backend.

Test configurations
Possible combinations of architecture & OS & Gasnet conduit/interconnect driver
x86 64-linux-pshm. SMP with Linux and PSHM in Gasnet.
x86 64-linux-mpi. SMP with Linux with MPI conduit for Gasnet.
x86 64-barrelfish-pshm. SMP running Barrelfish and PSHM in
Gasnet.
x86 64-barrelfish-ump. SMP running Barrelfish and the User-Level
Message Passing.
scc-linux-mpi. Intel SCC running Linux on all cores with Gasnet MPI
conduit, compiled with the MPIRCK CH2 driver.
scc-barrelfish-ump ipi. Intel SCC running Barrelfish with Gasnet BF
conduit with UMP & Inter-Process Interrupts backend.
Not tested:
32-bit SMP system
Bulk transfer Barrelfish transfer mode
Intel SCC MPBs flounder backend: Fast, but very short
and unprotected. SCC is seen as an accelerator

Message Passing: 1 to 1
Only two nodes are sending messages
Testam benchmark from Gasnet. 1000 mesages of:
Ping-pong roundtrip Request - Reply (prqp)
Ping-pong roundtrip Request - Request (prqq)
Flood one-way Request (foq)
Flood roundtrip Request - Reply (frqp)
Flood roundtrip Request - Request (frqq)

Only two nodes are sending messages
Testam benchmark from Gasnet. 1000 mesages of:
Ping-pong roundtrip Request - Reply (prqp)
Ping-pong roundtrip Request - Request (prqq)
Flood one-way Request (foq)
Flood roundtrip Request - Reply (frqp)
Flood roundtrip Request - Request (frqq)
0.5
1
1.5
2
2.5
3
3.5
4
4.5
prqp prqq foq frqp
delay(us)
test type
AMShort testam x86_64-linux-pshm
linux-pshm
0
2000
4000
6000
8000
10000
12000
14000
1 16 256 4k 64k 1M
throughput(Mbytes/s)
buﬀer size (bytes)
AMLong testam x86_64-linux-pshm
prqp
prqq
foq
frqp
frqq

Simplest comparison
x86 64-linux-pshm
vs
x86 64-barrelﬁsh-pshm

Simplest comparison
x86 64-linux-pshm
vs
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
prqp prqq foq frqp
delay(us)
test type
AMShort testam x86_64-*-pshm
linux
barrelﬁsh
0
2000
4000
6000
8000
10000
12000
1 16 256 4k 64k 1M
AMLong testam x86_64-*-pshm
linux
barrelﬁsh

Simplest comparison
x86 64-linux-pshm
vs
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
prqp prqq foq frqp
delay(us)
test type
AMShort testam x86_64-*-pshm
linux
barrelfish
0
2000
4000
6000
8000
10000
12000
1 16 256 4k 64k 1M
AMLong testam x86_64-*-pshm
linux
barrelfish
On AMShort category Barrelfish is much faster, as async
MP handlers are very efficient
On AMMedium and AMLong categories, when the buffer
>= 2048 Kb Barrelfish performs worse

Performance analysis breakdown
For long messages, most of the time is spent in the memcpy()
libc function, copying bytes from one region to another
We tried different implementations to see the result

Performance analysis breakdown
For long messages, most of the time is spent in the memcpy()
libc function, copying bytes from one region to another
We tried different implementations to see the result
0
2000
4000
6000
8000
10000
12000
1 16 256 4k 64k 1M
Different memcpy() implementations
linux-glibc
bf-glibc
bf-newlib
bf-oldc
Test runs:
1 Linux with GNU GLIBC
using supl. SSE3 and
REP prefix
2 Barrelfish GNU GLIBC
memcpy()
3 Barrelfish with Red Hat
Newlib using REP
4 Barrelfish with old libc (C
language)

MPI vs UMP on the SMP
x86 64-linux-mpi
vs
x86 64-barrelﬁsh-ump

x86 64-linux-mpi
vs
0
500
1000
1500
2000
2500
3000
3500
1 16 256 4k 64k
AMMedium testam x86_64-{linux-mpi,bf-ump}
linux
barrelﬁsh

x86 64-linux-mpi
vs
0
500
1000
1500
2000
2500
3000
3500
1 16 256 4k 64k
AMMedium testam x86_64-{linux-mpi,bf-ump}
linux
barrelﬁsh
(on linux-mpi maxsize AMMedium =
AMLong)
Barrelﬁsh is performing
always worse, UMP
interconnect driver is not
designed for sending large
buffers.
Newlib memcpy() against GLIBC memcpy()!
UMP is decomposing buffers into fragments
Needs ACK for each fragment. There is piggybacking
implemented but not used to avoid handler deadlocks
Bulk transfer designed for this purpose: large buffers

MPI vs UMP on the SCC
scc-linux-mpi
vs
scc-barrelﬁsh-ump ipi

scc-linux-mpi
vs
0
5
10
15
20
25
30
35
1 16 256 4k 64k
AMMedium testam scc-{linux-mpi,bf-ump}
linux
barrelﬁsh
0
10
20
30
40
50
60
1 16 256 4k 64k 1M
AMLong testam scc-{linux-mpi,bf-ump}
linux
barrelﬁsh

scc-linux-mpi
vs
0
5
10
15
20
25
30
35
1 16 256 4k 64k
AMMedium testam scc-{linux-mpi,bf-ump}
linux
barrelfish
0
10
20
30
40
50
60
1 16 256 4k 64k 1M
AMLong testam scc-{linux-mpi,bf-ump}
linux
barrelfish
Again Newlib memcpy() against GLIBC memcpy()
Same problems as before
On Linux, SCC MPBs are fully used (1 application)
On Linux we can see a strong performance degradation
when SCC MPBs overflow (after 8Kb)

Message Passing: N to N
Description of the test application
Now we want to model a real system:
A node can send messages to any other node of the
system with the same probability.
Messages are sent in a Poisson process, idle times follow
an exponential distribution with rate parameter λ. From an
uniform distribution, we get the idle time by:
T = −
ln U
λ
We can choose the probability of sending AMShort,
AMMedium and AMLong.
Now buffers are ﬁxed size.
We also model the probability for a request to have a reply.

Values for the test runs
Test runs:
λ = 1 to 106 mgs/s in powers of 2
3 runs:
1 Majority of shorts: (ps = 0.7, pm = 0.2, pl = 0.1),
2 Majority of longs: (ps = 0.1, pm = 0.2, pl = 0.7)
3 All categories balanced (ps = 0.33, pm = 0.33, pl = 0.33)
Medium block size = 8Kb, long size = 64Kb
Reply probability = 0.33
We run every test during 5 minutes, as longer times don’t
affect numbers

x86 64 architecure results
x86 64-linux-pshm vs x86 64-barrelﬁsh-pshm vs x86 64-linux-mpi

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
16 256 4k 64k
realrate(msg/s)
perfect rate (msg/s)
x86_64 with 70% short msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
16 256 4k 64k
realrate(msg/s)
x86_64 with 70% long msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
16 256 4k 64k
realrate(msg/s)
x86_64 with 70% short msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
16 256 4k 64k
realrate(msg/s)
x86_64 with 70% long msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi
16 cores simultaneously
Saturation rate (64 kmsg/s short, 8 kmsg/s long)
PSHM runs are better, even in Barrelﬁsh
Again Newlib memcpy() against GLIBC memcpy(), even
with this barrelﬁsh-pshm outperforms linux-mpi
Greater gap with MPI for longs: works better for shorts.

MPI running on the SCC
scc-linux-mpi
0
100
200
300
400
500
600
16 256 4k 64k
realrate(msg/s)
Intel SCC with MPIRCK (48 cores)
short
balanced
long
(48 cores simultaneously)
couldn’t be evaluated due to
severe deadlocks and race
conditions.

MPI running on the SCC
scc-linux-mpi
0
100
200
300
400
500
600
16 256 4k 64k
realrate(msg/s)
Intel SCC with MPIRCK (48 cores)
short
balanced
long
(48 cores simultaneously)
couldn’t be evaluated due to
severe deadlocks and race
conditions.
Evaluation
Compiled with MPIRCK CH2 driver, using SCC MPBs
Slower convergence ratio than running with 16 cores
Convergence area: 3:1 ratio for short - balanced, and 2:1
for balanced - long

Contents
1 Introduction
Motivation
Project goals
2 Tests
Hardware
Conﬁgurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work

Conclusions
Summary of results
When there is no buffer involved, Barrelfish performs
much faster due to asynchronous design of MP.
In case of a buffer, memcpy() becomes critical
libc shipped with Barrelfish has the worst performance
GNU GLIBC in Linux is very optimized but non-portable
There is a compromise between the two: Newlib
UMP Barrelfish driver not suitable for large buffers
On x86 64 architecture
All PSHM setups outperform MPI (even Barrelfish)
On the Intel SCC
SCC MPBs are not very suitable for an operating system
due to the lack of hardware protection
The size of MPBs for multitasking are very small, message
longer than the MPBs size per core need flux control.

Conclusions
Final project conclusions
After evaluating the project, we found:
Barrelﬁsh not mature for normal work, lots of engineering
work
Time-consuming to debug race-conditions, lack of a
proper debugger and simulator
It has a lot of potential, specially because of the
asynchronous nature. This can be undoubtedly exploited.
MP models Gasnet / Barrelﬁsh are different, a lot of
quirks to make it working
Intel SCC platform has been designed more like an
accelerator than a standalone system.

Contributions
Accepted contributions
Barrelfish
17 commits accepted on the Barrelfish’ official tree:
Porting of the Newlib C library. Now all programs in the
tree link with it by default
IOAPIC index register access in 32-bit words
Cross-compiler C++ language support
System V shared memory extension
Thread mutex additional operations
Compiler/libc type decoupling
Hake tool extension for creating libraries from libraries
Bochs emulator
Accepted patch on Bochs emulator to continue deterministic
execution in debugging mode.

Contributions
Pending contributions and cross-compiler features
Pending contributions:
GNU cross-compiler tools for building programs on
Barrelfish
Gasnet BF conduit and internal modifications for running
it on Barrelfish
Cross-compiler features
Thanks to this project now it’s possible to run standard
C++ programs on Barrelfish
Compile standard GNU programs with the cross-compiler
with minor changes as:
./configure --host=x86 64-pc-barrelfish
Example: GNU bash

Contributions
GNU Bash screenshot

Future work
Proposals for continuing the project
Future work:
Redesign Barrelfish Bulk transfer for flexible bucket size
and full duplex operation.
Rewrite flounder to be thread-safe.
Better UMP driver with longer buffer windows.
Faster memcpy() implementations.
Running OpenMP/OmpSs programs with nanos runtime
using the current C++ cross-compiler

End
Questions?
Questions?

bfgasnet_pr-v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to bfgasnet_pr-v2

Similar to bfgasnet_pr-v2 (20)

bfgasnet_pr-v2