SlideShare a Scribd company logo
1 of 47
Download to read offline
Introduction Tests Conclusions Contributions Future work
Gasnet library evaluation on Barrelfish and
Intel SCC
June 30, 2012
Zeus G´omez Marmolejo
Barcelona Supercomputing Center
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Contents
1 Introduction
Motivation
Project goals
Software architecture
2 Tests
Hardware
Configurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Introduction
Motivation
Future trends:
Multi-core CPUs and multi-core GPUs in a single chip.
Shared memory and cache coherence complexity. This
May not scale in the future.
Problems with shared memory OS like Linux or Windows
and many core systems.
Message passing OS like Barrelfish.
Experiments on non-coherent multi-core shared
architectures: Intel SCC and its MPBs.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Linux approach
Multi-core operating systems using shared memory
core 0
struct page {
...
spinlock_t ptl;
};
core 1 core 2 core N...
Data sharing:
Access locks
False sharing
Memory
Contention
Hardware cache
coherence
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Barrelfish approach
No sharing, but message passing
System Knowledge Base:
No driver software!
Message passing:
No sharing at all
System processes
Asynchronous calls
Interconnect drivers
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Project goals
Looking for the appropriate library that meets the desired features
Port a well-known message passing library to
Barrelfish...
Desired features:
Portable across different
architectures, systems and
OSs.
Highly efficient.
Used in many applications and
parallel languages.
Be able to run standard
OpenMP programs via the
nanos runtime.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Project goals
Looking for the appropriate library that meets the desired features
Port a well-known message passing library to
Barrelfish...
Desired features:
Portable across different
architectures, systems and
OSs.
Highly efficient.
Used in many applications and
parallel languages.
Be able to run standard
OpenMP programs via the
nanos runtime.
The Gasnet library
from the
University of
Berkeley
fulfills these
expectations.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Gasnet library
Low level communication library
Network hardware
UDP
conduit
SMP
conduit
MPI
conduit
BF
conduit
Gasnet core API
Low level communication library: implements UPC,
Titanium, OmpSs.
Different categories: AMShort, AMMedium, AMLong
Message types: requests, replies
Private Shared Memory (PSHM) mode for a conduit
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Barrelfish Message Passing
Generated stubs for efficient message passing
msg()
user.c
...
...
ump_hdlr() {
user_flounder
_bindings.c
...
cache
write
}
core 0 process
event_dispatch() {
waitset.c
...
ump_rx() {
...
msg()
}
core 1 process
closure.handler()
...
...
user_flounder
_bindings.c
msg() {
...
}
user.c
} ...
Non-blocking asynchronous calls.
Continuation closure, called also asynchronously.
Messages sent as RPC.
Generated C code by flounder tool, in Haskell, depending
on the interconnect driver.
Fast event handling code on receiving side (polling).
When all arguments are assembled, call is made to user
program.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Differences with Barrelfish MP model
Similarities, differences and solutions proposed
Similarities:
Gasnet Nodes → Barrelfish Cores
Messages as RPC
Necessity to send large buffers
Be able to send back replies
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Differences with Barrelfish MP model
Similarities, differences and solutions proposed
Similarities:
Gasnet Nodes → Barrelfish Cores
Messages as RPC
Necessity to send large buffers
Be able to send back replies
Differences:
Gasnet calls must
block
Non-blocking
message handlers
No thread-safe
Solutions:
2 threads: application &
Gasnet
Leader-followers thread
serving model
4 independent
channels per peer
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture
Gasnet BF conduit implementation
Details of the BF implementation using previous solutions
Details:
Uses the BF flounder generated stub to pass messages.
To simulate the synchronous behavior of Gasnet, we use 2
threads. One of them is coming from the pool.
However, the binding cannot be handled by two threads
concurrently without proper locking.
thread 1
GASNET BARRELFISH
core 1
BARRELFISH
core 2
gasnet_AMShort()
ack
ack
ack
GASNET
handler call
ambf_ump_send_handler()
endr()
thread 2 thread 2 thread 1
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Contents
1 Introduction
Motivation
Project goals
Software architecture
2 Tests
Hardware
Configurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Test systems
Intel x86 64 SMP system
Sun Fire X2270 M2, with 2 Intel Xeon CPU E5620@ 2.40GHz:
chip 0 chip 1QPI link
(8 CPUs) (8 CPUs)
DDR3 0
DDR3 1
DDR3 2
DDR3 0
DDR3 1
DDR3 2
SMP system
Features:
Intel x86 64 architecture
2 chips x 4 SMP x 2 SMT = 16 CPUs
32 GB RAM NUMA
QPI link: 25 GB/s
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Test systems
Intel Single-chip Cloud Computer system
Features:
48 CPUs Intel 32-bit P54C in a
single chip
Non-coherent caches
Routers and MPBs for message
passing
4 DDR3 memory controllers
Shipped as:
A cluster of 48 linux
systems accessed
by SSH
No OS prev to
Barrelfish sees it as
a single system
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Test configurations
Possible combinations of architecture & OS & Gasnet conduit/interconnect driver
x86 64-linux-pshm. SMP with Linux and PSHM in Gasnet.
x86 64-linux-mpi. SMP with Linux with MPI conduit for Gasnet.
x86 64-barrelfish-pshm. SMP running Barrelfish and PSHM in
Gasnet.
x86 64-barrelfish-ump. SMP running Barrelfish and the User-Level
Message Passing.
scc-linux-mpi. Intel SCC running Linux on all cores with Gasnet MPI
conduit, compiled with the MPIRCK CH2 driver.
scc-barrelfish-ump ipi. Intel SCC running Barrelfish with Gasnet BF
conduit with UMP & Inter-Process Interrupts backend.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Test configurations
Possible combinations of architecture & OS & Gasnet conduit/interconnect driver
x86 64-linux-pshm. SMP with Linux and PSHM in Gasnet.
x86 64-linux-mpi. SMP with Linux with MPI conduit for Gasnet.
x86 64-barrelfish-pshm. SMP running Barrelfish and PSHM in
Gasnet.
x86 64-barrelfish-ump. SMP running Barrelfish and the User-Level
Message Passing.
scc-linux-mpi. Intel SCC running Linux on all cores with Gasnet MPI
conduit, compiled with the MPIRCK CH2 driver.
scc-barrelfish-ump ipi. Intel SCC running Barrelfish with Gasnet BF
conduit with UMP & Inter-Process Interrupts backend.
Not tested:
32-bit SMP system
Bulk transfer Barrelfish transfer mode
Intel SCC MPBs flounder backend: Fast, but very short
and unprotected. SCC is seen as an accelerator
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
Only two nodes are sending messages
Testam benchmark from Gasnet. 1000 mesages of:
Ping-pong roundtrip Request - Reply (prqp)
Ping-pong roundtrip Request - Request (prqq)
Flood one-way Request (foq)
Flood roundtrip Request - Reply (frqp)
Flood roundtrip Request - Request (frqq)
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
Only two nodes are sending messages
Testam benchmark from Gasnet. 1000 mesages of:
Ping-pong roundtrip Request - Reply (prqp)
Ping-pong roundtrip Request - Request (prqq)
Flood one-way Request (foq)
Flood roundtrip Request - Reply (frqp)
Flood roundtrip Request - Request (frqq)
0.5
1
1.5
2
2.5
3
3.5
4
4.5
prqp prqq foq frqp
delay(us)
test type
AMShort testam x86_64-linux-pshm
linux-pshm
0
2000
4000
6000
8000
10000
12000
14000
1 16 256 4k 64k 1M
throughput(Mbytes/s)
buffer size (bytes)
AMLong testam x86_64-linux-pshm
prqp
prqq
foq
frqp
frqq
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
Simplest comparison
x86 64-linux-pshm
vs
x86 64-barrelfish-pshm
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
Simplest comparison
x86 64-linux-pshm
vs
x86 64-barrelfish-pshm
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
prqp prqq foq frqp
delay(us)
test type
AMShort testam x86_64-*-pshm
linux
barrelfish
0
2000
4000
6000
8000
10000
12000
1 16 256 4k 64k 1M
throughput(Mbytes/s)
buffer size (bytes)
AMLong testam x86_64-*-pshm
linux
barrelfish
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
Simplest comparison
x86 64-linux-pshm
vs
x86 64-barrelfish-pshm
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
prqp prqq foq frqp
delay(us)
test type
AMShort testam x86_64-*-pshm
linux
barrelfish
0
2000
4000
6000
8000
10000
12000
1 16 256 4k 64k 1M
throughput(Mbytes/s)
buffer size (bytes)
AMLong testam x86_64-*-pshm
linux
barrelfish
On AMShort category Barrelfish is much faster, as async
MP handlers are very efficient
On AMMedium and AMLong categories, when the buffer
>= 2048 Kb Barrelfish performs worse
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
Performance analysis breakdown
For long messages, most of the time is spent in the memcpy()
libc function, copying bytes from one region to another
We tried different implementations to see the result
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
Performance analysis breakdown
For long messages, most of the time is spent in the memcpy()
libc function, copying bytes from one region to another
We tried different implementations to see the result
0
2000
4000
6000
8000
10000
12000
1 16 256 4k 64k 1M
throughput(Mbytes/s)
buffer size (bytes)
Different memcpy() implementations
linux-glibc
bf-glibc
bf-newlib
bf-oldc
Test runs:
1 Linux with GNU GLIBC
using supl. SSE3 and
REP prefix
2 Barrelfish GNU GLIBC
memcpy()
3 Barrelfish with Red Hat
Newlib using REP
4 Barrelfish with old libc (C
language)
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
MPI vs UMP on the SMP
x86 64-linux-mpi
vs
x86 64-barrelfish-ump
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
MPI vs UMP on the SMP
x86 64-linux-mpi
vs
x86 64-barrelfish-ump
0
500
1000
1500
2000
2500
3000
3500
1 16 256 4k 64k
throughput(Mbytes/s)
buffer size (bytes)
AMMedium testam x86_64-{linux-mpi,bf-ump}
linux
barrelfish
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
MPI vs UMP on the SMP
x86 64-linux-mpi
vs
x86 64-barrelfish-ump
0
500
1000
1500
2000
2500
3000
3500
1 16 256 4k 64k
throughput(Mbytes/s)
buffer size (bytes)
AMMedium testam x86_64-{linux-mpi,bf-ump}
linux
barrelfish
(on linux-mpi maxsize AMMedium =
AMLong)
Barrelfish is performing
always worse, UMP
interconnect driver is not
designed for sending large
buffers.
Newlib memcpy() against GLIBC memcpy()!
UMP is decomposing buffers into fragments
Needs ACK for each fragment. There is piggybacking
implemented but not used to avoid handler deadlocks
Bulk transfer designed for this purpose: large buffers
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
MPI vs UMP on the SCC
scc-linux-mpi
vs
scc-barrelfish-ump ipi
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
MPI vs UMP on the SCC
scc-linux-mpi
vs
scc-barrelfish-ump ipi
0
5
10
15
20
25
30
35
1 16 256 4k 64k
throughput(Mbytes/s)
buffer size (bytes)
AMMedium testam scc-{linux-mpi,bf-ump}
linux
barrelfish
0
10
20
30
40
50
60
1 16 256 4k 64k 1M
throughput(Mbytes/s)
buffer size (bytes)
AMLong testam scc-{linux-mpi,bf-ump}
linux
barrelfish
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: 1 to 1
MPI vs UMP on the SCC
scc-linux-mpi
vs
scc-barrelfish-ump ipi
0
5
10
15
20
25
30
35
1 16 256 4k 64k
throughput(Mbytes/s)
buffer size (bytes)
AMMedium testam scc-{linux-mpi,bf-ump}
linux
barrelfish
0
10
20
30
40
50
60
1 16 256 4k 64k 1M
throughput(Mbytes/s)
buffer size (bytes)
AMLong testam scc-{linux-mpi,bf-ump}
linux
barrelfish
Again Newlib memcpy() against GLIBC memcpy()
Same problems as before
On Linux, SCC MPBs are fully used (1 application)
On Linux we can see a strong performance degradation
when SCC MPBs overflow (after 8Kb)
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: N to N
Description of the test application
Now we want to model a real system:
A node can send messages to any other node of the
system with the same probability.
Messages are sent in a Poisson process, idle times follow
an exponential distribution with rate parameter λ. From an
uniform distribution, we get the idle time by:
T = −
ln U
λ
We can choose the probability of sending AMShort,
AMMedium and AMLong.
Now buffers are fixed size.
We also model the probability for a request to have a reply.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: N to N
Values for the test runs
Test runs:
λ = 1 to 106 mgs/s in powers of 2
3 runs:
1 Majority of shorts: (ps = 0.7, pm = 0.2, pl = 0.1),
2 Majority of longs: (ps = 0.1, pm = 0.2, pl = 0.7)
3 All categories balanced (ps = 0.33, pm = 0.33, pl = 0.33)
Medium block size = 8Kb, long size = 64Kb
Reply probability = 0.33
We run every test during 5 minutes, as longer times don’t
affect numbers
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: N to N
x86 64 architecure results
x86 64-linux-pshm vs x86 64-barrelfish-pshm vs x86 64-linux-mpi
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: N to N
x86 64 architecure results
x86 64-linux-pshm vs x86 64-barrelfish-pshm vs x86 64-linux-mpi
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
16 256 4k 64k
realrate(msg/s)
perfect rate (msg/s)
x86_64 with 70% short msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
16 256 4k 64k
realrate(msg/s)
perfect rate (msg/s)
x86_64 with 70% long msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: N to N
x86 64 architecure results
x86 64-linux-pshm vs x86 64-barrelfish-pshm vs x86 64-linux-mpi
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
16 256 4k 64k
realrate(msg/s)
perfect rate (msg/s)
x86_64 with 70% short msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
16 256 4k 64k
realrate(msg/s)
perfect rate (msg/s)
x86_64 with 70% long msgs (16 cores)
lin-pshm
bf-pshm
lin-mpi
16 cores simultaneously
Saturation rate (64 kmsg/s short, 8 kmsg/s long)
PSHM runs are better, even in Barrelfish
Again Newlib memcpy() against GLIBC memcpy(), even
with this barrelfish-pshm outperforms linux-mpi
Greater gap with MPI for longs: works better for shorts.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: N to N
MPI running on the SCC
scc-linux-mpi
0
100
200
300
400
500
600
16 256 4k 64k
realrate(msg/s)
perfect rate (msg/s)
Intel SCC with MPIRCK (48 cores)
short
balanced
long
(48 cores simultaneously)
scc-barrelfish-ump ipi
couldn’t be evaluated due to
severe deadlocks and race
conditions.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N
Message Passing: N to N
MPI running on the SCC
scc-linux-mpi
0
100
200
300
400
500
600
16 256 4k 64k
realrate(msg/s)
perfect rate (msg/s)
Intel SCC with MPIRCK (48 cores)
short
balanced
long
(48 cores simultaneously)
scc-barrelfish-ump ipi
couldn’t be evaluated due to
severe deadlocks and race
conditions.
Evaluation
Compiled with MPIRCK CH2 driver, using SCC MPBs
Slower convergence ratio than running with 16 cores
Convergence area: 3:1 ratio for short - balanced, and 2:1
for balanced - long
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Contents
1 Introduction
Motivation
Project goals
Software architecture
2 Tests
Hardware
Configurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Conclusions
Summary of results
When there is no buffer involved, Barrelfish performs
much faster due to asynchronous design of MP.
In case of a buffer, memcpy() becomes critical
libc shipped with Barrelfish has the worst performance
GNU GLIBC in Linux is very optimized but non-portable
There is a compromise between the two: Newlib
UMP Barrelfish driver not suitable for large buffers
On x86 64 architecture
All PSHM setups outperform MPI (even Barrelfish)
On the Intel SCC
SCC MPBs are not very suitable for an operating system
due to the lack of hardware protection
The size of MPBs for multitasking are very small, message
longer than the MPBs size per core need flux control.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Conclusions
Final project conclusions
After evaluating the project, we found:
Barrelfish not mature for normal work, lots of engineering
work
Time-consuming to debug race-conditions, lack of a
proper debugger and simulator
It has a lot of potential, specially because of the
asynchronous nature. This can be undoubtedly exploited.
MP models Gasnet / Barrelfish are different, a lot of
quirks to make it working
Intel SCC platform has been designed more like an
accelerator than a standalone system.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Contents
1 Introduction
Motivation
Project goals
Software architecture
2 Tests
Hardware
Configurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Contributions
Accepted contributions
Barrelfish
17 commits accepted on the Barrelfish’ official tree:
Porting of the Newlib C library. Now all programs in the
tree link with it by default
IOAPIC index register access in 32-bit words
Cross-compiler C++ language support
System V shared memory extension
Thread mutex additional operations
Compiler/libc type decoupling
Hake tool extension for creating libraries from libraries
Bochs emulator
Accepted patch on Bochs emulator to continue deterministic
execution in debugging mode.
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Contributions
Pending contributions and cross-compiler features
Pending contributions:
GNU cross-compiler tools for building programs on
Barrelfish
Gasnet BF conduit and internal modifications for running
it on Barrelfish
Cross-compiler features
Thanks to this project now it’s possible to run standard
C++ programs on Barrelfish
Compile standard GNU programs with the cross-compiler
with minor changes as:
./configure --host=x86 64-pc-barrelfish
Example: GNU bash
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Contributions
GNU Bash screenshot
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Contents
1 Introduction
Motivation
Project goals
Software architecture
2 Tests
Hardware
Configurations
MP: 1 to 1
MP: N to N
3 Conclusions
4 Contributions
5 Future work
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
Future work
Proposals for continuing the project
Future work:
Redesign Barrelfish Bulk transfer for flexible bucket size
and full duplex operation.
Rewrite flounder to be thread-safe.
Better UMP driver with longer buffer windows.
Faster memcpy() implementations.
Running OpenMP/OmpSs programs with nanos runtime
using the current C++ cross-compiler
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
Introduction Tests Conclusions Contributions Future work
End
Questions?
Questions?
Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC

More Related Content

What's hot

P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)Kirill Tsym
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas
 
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksVSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksOPNFV
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)micchie
 
Replacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with CiliumReplacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with CiliumMichal Rostecki
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switchmicchie
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK
 
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM AbstractionsPASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM Abstractionsmicchie
 
ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!Affan Syed
 
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsKernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsAnne Nicolas
 
Advanced Components on Top of L4Re
Advanced Components on Top of L4ReAdvanced Components on Top of L4Re
Advanced Components on Top of L4ReVasily Sartakov
 
Socket Programming- Data Link Access
Socket Programming- Data Link AccessSocket Programming- Data Link Access
Socket Programming- Data Link AccessLJ PROJECTS
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack monad bobo
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservicesThomas Graf
 

What's hot (20)

P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksVSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloads
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
 
Replacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with CiliumReplacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with Cilium
 
mSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software SwitchmSwitch: A Highly-Scalable, Modular Software Switch
mSwitch: A Highly-Scalable, Modular Software Switch
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
 
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM AbstractionsPASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM Abstractions
 
ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!
 
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsKernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutions
 
Memory, IPC and L4Re
Memory, IPC and L4ReMemory, IPC and L4Re
Memory, IPC and L4Re
 
Advanced Components on Top of L4Re
Advanced Components on Top of L4ReAdvanced Components on Top of L4Re
Advanced Components on Top of L4Re
 
Socket Programming- Data Link Access
Socket Programming- Data Link AccessSocket Programming- Data Link Access
Socket Programming- Data Link Access
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservices
 

Viewers also liked

Denuncia penal 25 enero 2016
Denuncia penal 25 enero 2016Denuncia penal 25 enero 2016
Denuncia penal 25 enero 2016SINAUT SUNAT
 
PAD 520 WEEK 7 ASSIGNMENT 3
PAD 520 WEEK 7 ASSIGNMENT 3PAD 520 WEEK 7 ASSIGNMENT 3
PAD 520 WEEK 7 ASSIGNMENT 3hwacer123
 
Newsletter dated 14th February 2017
Newsletter dated 14th February 2017Newsletter dated 14th February 2017
Newsletter dated 14th February 2017Rajiv Bajaj
 
Presentación1electiva4
Presentación1electiva4Presentación1electiva4
Presentación1electiva4Carlos mota
 
Patricia Pearson-Kotz Recorded PowerPoint Resume
Patricia Pearson-Kotz Recorded PowerPoint ResumePatricia Pearson-Kotz Recorded PowerPoint Resume
Patricia Pearson-Kotz Recorded PowerPoint ResumePatti Pearson-Kotz
 
ParanaVision Presentation (English)
ParanaVision Presentation (English)ParanaVision Presentation (English)
ParanaVision Presentation (English)Alphan Manas
 
Akıllı Durak Projesi Tanıtımı (1999)
Akıllı Durak Projesi Tanıtımı (1999)Akıllı Durak Projesi Tanıtımı (1999)
Akıllı Durak Projesi Tanıtımı (1999)Alphan Manas
 

Viewers also liked (10)

Denuncia penal 25 enero 2016
Denuncia penal 25 enero 2016Denuncia penal 25 enero 2016
Denuncia penal 25 enero 2016
 
PAD 520 WEEK 7 ASSIGNMENT 3
PAD 520 WEEK 7 ASSIGNMENT 3PAD 520 WEEK 7 ASSIGNMENT 3
PAD 520 WEEK 7 ASSIGNMENT 3
 
Newsletter dated 14th February 2017
Newsletter dated 14th February 2017Newsletter dated 14th February 2017
Newsletter dated 14th February 2017
 
Resuelto power
Resuelto powerResuelto power
Resuelto power
 
Presentación1electiva4
Presentación1electiva4Presentación1electiva4
Presentación1electiva4
 
Patricia Pearson-Kotz Recorded PowerPoint Resume
Patricia Pearson-Kotz Recorded PowerPoint ResumePatricia Pearson-Kotz Recorded PowerPoint Resume
Patricia Pearson-Kotz Recorded PowerPoint Resume
 
Izquierda y derecha
Izquierda y derechaIzquierda y derecha
Izquierda y derecha
 
ParanaVision Presentation (English)
ParanaVision Presentation (English)ParanaVision Presentation (English)
ParanaVision Presentation (English)
 
Akıllı Durak Projesi Tanıtımı (1999)
Akıllı Durak Projesi Tanıtımı (1999)Akıllı Durak Projesi Tanıtımı (1999)
Akıllı Durak Projesi Tanıtımı (1999)
 
Ibiza
IbizaIbiza
Ibiza
 

Similar to bfgasnet_pr-v2

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Talal Khaliq
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...PT Datacomm Diangraha
 
bfarm-v2
bfarm-v2bfarm-v2
bfarm-v2Zeus G
 
Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?OPNFV
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureAntoine Poliakov
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...Neil Armstrong
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Eric Van Hensbergen
 
04536342
0453634204536342
04536342fidan78
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16Kangaroot
 

Similar to bfgasnet_pr-v2 (20)

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
bfarm-v2
bfarm-v2bfarm-v2
bfarm-v2
 
Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows Azure
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
 
04536342
0453634204536342
04536342
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
 

bfgasnet_pr-v2

  • 1. Introduction Tests Conclusions Contributions Future work Gasnet library evaluation on Barrelfish and Intel SCC June 30, 2012 Zeus G´omez Marmolejo Barcelona Supercomputing Center Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 2. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Contents 1 Introduction Motivation Project goals Software architecture 2 Tests Hardware Configurations MP: 1 to 1 MP: N to N 3 Conclusions 4 Contributions 5 Future work Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 3. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Introduction Motivation Future trends: Multi-core CPUs and multi-core GPUs in a single chip. Shared memory and cache coherence complexity. This May not scale in the future. Problems with shared memory OS like Linux or Windows and many core systems. Message passing OS like Barrelfish. Experiments on non-coherent multi-core shared architectures: Intel SCC and its MPBs. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 4. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Linux approach Multi-core operating systems using shared memory core 0 struct page { ... spinlock_t ptl; }; core 1 core 2 core N... Data sharing: Access locks False sharing Memory Contention Hardware cache coherence Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 5. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Barrelfish approach No sharing, but message passing System Knowledge Base: No driver software! Message passing: No sharing at all System processes Asynchronous calls Interconnect drivers Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 6. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Project goals Looking for the appropriate library that meets the desired features Port a well-known message passing library to Barrelfish... Desired features: Portable across different architectures, systems and OSs. Highly efficient. Used in many applications and parallel languages. Be able to run standard OpenMP programs via the nanos runtime. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 7. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Project goals Looking for the appropriate library that meets the desired features Port a well-known message passing library to Barrelfish... Desired features: Portable across different architectures, systems and OSs. Highly efficient. Used in many applications and parallel languages. Be able to run standard OpenMP programs via the nanos runtime. The Gasnet library from the University of Berkeley fulfills these expectations. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 8. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Gasnet library Low level communication library Network hardware UDP conduit SMP conduit MPI conduit BF conduit Gasnet core API Low level communication library: implements UPC, Titanium, OmpSs. Different categories: AMShort, AMMedium, AMLong Message types: requests, replies Private Shared Memory (PSHM) mode for a conduit Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 9. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Barrelfish Message Passing Generated stubs for efficient message passing msg() user.c ... ... ump_hdlr() { user_flounder _bindings.c ... cache write } core 0 process event_dispatch() { waitset.c ... ump_rx() { ... msg() } core 1 process closure.handler() ... ... user_flounder _bindings.c msg() { ... } user.c } ... Non-blocking asynchronous calls. Continuation closure, called also asynchronously. Messages sent as RPC. Generated C code by flounder tool, in Haskell, depending on the interconnect driver. Fast event handling code on receiving side (polling). When all arguments are assembled, call is made to user program. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 10. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Differences with Barrelfish MP model Similarities, differences and solutions proposed Similarities: Gasnet Nodes → Barrelfish Cores Messages as RPC Necessity to send large buffers Be able to send back replies Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 11. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Differences with Barrelfish MP model Similarities, differences and solutions proposed Similarities: Gasnet Nodes → Barrelfish Cores Messages as RPC Necessity to send large buffers Be able to send back replies Differences: Gasnet calls must block Non-blocking message handlers No thread-safe Solutions: 2 threads: application & Gasnet Leader-followers thread serving model 4 independent channels per peer Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 12. Introduction Tests Conclusions Contributions Future work Motivation Project goals Software architecture Gasnet BF conduit implementation Details of the BF implementation using previous solutions Details: Uses the BF flounder generated stub to pass messages. To simulate the synchronous behavior of Gasnet, we use 2 threads. One of them is coming from the pool. However, the binding cannot be handled by two threads concurrently without proper locking. thread 1 GASNET BARRELFISH core 1 BARRELFISH core 2 gasnet_AMShort() ack ack ack GASNET handler call ambf_ump_send_handler() endr() thread 2 thread 2 thread 1 Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 13. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Contents 1 Introduction Motivation Project goals Software architecture 2 Tests Hardware Configurations MP: 1 to 1 MP: N to N 3 Conclusions 4 Contributions 5 Future work Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 14. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Test systems Intel x86 64 SMP system Sun Fire X2270 M2, with 2 Intel Xeon CPU E5620@ 2.40GHz: chip 0 chip 1QPI link (8 CPUs) (8 CPUs) DDR3 0 DDR3 1 DDR3 2 DDR3 0 DDR3 1 DDR3 2 SMP system Features: Intel x86 64 architecture 2 chips x 4 SMP x 2 SMT = 16 CPUs 32 GB RAM NUMA QPI link: 25 GB/s Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 15. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Test systems Intel Single-chip Cloud Computer system Features: 48 CPUs Intel 32-bit P54C in a single chip Non-coherent caches Routers and MPBs for message passing 4 DDR3 memory controllers Shipped as: A cluster of 48 linux systems accessed by SSH No OS prev to Barrelfish sees it as a single system Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 16. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Test configurations Possible combinations of architecture & OS & Gasnet conduit/interconnect driver x86 64-linux-pshm. SMP with Linux and PSHM in Gasnet. x86 64-linux-mpi. SMP with Linux with MPI conduit for Gasnet. x86 64-barrelfish-pshm. SMP running Barrelfish and PSHM in Gasnet. x86 64-barrelfish-ump. SMP running Barrelfish and the User-Level Message Passing. scc-linux-mpi. Intel SCC running Linux on all cores with Gasnet MPI conduit, compiled with the MPIRCK CH2 driver. scc-barrelfish-ump ipi. Intel SCC running Barrelfish with Gasnet BF conduit with UMP & Inter-Process Interrupts backend. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 17. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Test configurations Possible combinations of architecture & OS & Gasnet conduit/interconnect driver x86 64-linux-pshm. SMP with Linux and PSHM in Gasnet. x86 64-linux-mpi. SMP with Linux with MPI conduit for Gasnet. x86 64-barrelfish-pshm. SMP running Barrelfish and PSHM in Gasnet. x86 64-barrelfish-ump. SMP running Barrelfish and the User-Level Message Passing. scc-linux-mpi. Intel SCC running Linux on all cores with Gasnet MPI conduit, compiled with the MPIRCK CH2 driver. scc-barrelfish-ump ipi. Intel SCC running Barrelfish with Gasnet BF conduit with UMP & Inter-Process Interrupts backend. Not tested: 32-bit SMP system Bulk transfer Barrelfish transfer mode Intel SCC MPBs flounder backend: Fast, but very short and unprotected. SCC is seen as an accelerator Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 18. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 Only two nodes are sending messages Testam benchmark from Gasnet. 1000 mesages of: Ping-pong roundtrip Request - Reply (prqp) Ping-pong roundtrip Request - Request (prqq) Flood one-way Request (foq) Flood roundtrip Request - Reply (frqp) Flood roundtrip Request - Request (frqq) Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 19. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 Only two nodes are sending messages Testam benchmark from Gasnet. 1000 mesages of: Ping-pong roundtrip Request - Reply (prqp) Ping-pong roundtrip Request - Request (prqq) Flood one-way Request (foq) Flood roundtrip Request - Reply (frqp) Flood roundtrip Request - Request (frqq) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 prqp prqq foq frqp delay(us) test type AMShort testam x86_64-linux-pshm linux-pshm 0 2000 4000 6000 8000 10000 12000 14000 1 16 256 4k 64k 1M throughput(Mbytes/s) buffer size (bytes) AMLong testam x86_64-linux-pshm prqp prqq foq frqp frqq Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 20. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 Simplest comparison x86 64-linux-pshm vs x86 64-barrelfish-pshm Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 21. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 Simplest comparison x86 64-linux-pshm vs x86 64-barrelfish-pshm 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 prqp prqq foq frqp delay(us) test type AMShort testam x86_64-*-pshm linux barrelfish 0 2000 4000 6000 8000 10000 12000 1 16 256 4k 64k 1M throughput(Mbytes/s) buffer size (bytes) AMLong testam x86_64-*-pshm linux barrelfish Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 22. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 Simplest comparison x86 64-linux-pshm vs x86 64-barrelfish-pshm 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 prqp prqq foq frqp delay(us) test type AMShort testam x86_64-*-pshm linux barrelfish 0 2000 4000 6000 8000 10000 12000 1 16 256 4k 64k 1M throughput(Mbytes/s) buffer size (bytes) AMLong testam x86_64-*-pshm linux barrelfish On AMShort category Barrelfish is much faster, as async MP handlers are very efficient On AMMedium and AMLong categories, when the buffer >= 2048 Kb Barrelfish performs worse Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 23. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 Performance analysis breakdown For long messages, most of the time is spent in the memcpy() libc function, copying bytes from one region to another We tried different implementations to see the result Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 24. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 Performance analysis breakdown For long messages, most of the time is spent in the memcpy() libc function, copying bytes from one region to another We tried different implementations to see the result 0 2000 4000 6000 8000 10000 12000 1 16 256 4k 64k 1M throughput(Mbytes/s) buffer size (bytes) Different memcpy() implementations linux-glibc bf-glibc bf-newlib bf-oldc Test runs: 1 Linux with GNU GLIBC using supl. SSE3 and REP prefix 2 Barrelfish GNU GLIBC memcpy() 3 Barrelfish with Red Hat Newlib using REP 4 Barrelfish with old libc (C language) Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 25. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 MPI vs UMP on the SMP x86 64-linux-mpi vs x86 64-barrelfish-ump Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 26. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 MPI vs UMP on the SMP x86 64-linux-mpi vs x86 64-barrelfish-ump 0 500 1000 1500 2000 2500 3000 3500 1 16 256 4k 64k throughput(Mbytes/s) buffer size (bytes) AMMedium testam x86_64-{linux-mpi,bf-ump} linux barrelfish Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 27. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 MPI vs UMP on the SMP x86 64-linux-mpi vs x86 64-barrelfish-ump 0 500 1000 1500 2000 2500 3000 3500 1 16 256 4k 64k throughput(Mbytes/s) buffer size (bytes) AMMedium testam x86_64-{linux-mpi,bf-ump} linux barrelfish (on linux-mpi maxsize AMMedium = AMLong) Barrelfish is performing always worse, UMP interconnect driver is not designed for sending large buffers. Newlib memcpy() against GLIBC memcpy()! UMP is decomposing buffers into fragments Needs ACK for each fragment. There is piggybacking implemented but not used to avoid handler deadlocks Bulk transfer designed for this purpose: large buffers Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 28. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 MPI vs UMP on the SCC scc-linux-mpi vs scc-barrelfish-ump ipi Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 29. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 MPI vs UMP on the SCC scc-linux-mpi vs scc-barrelfish-ump ipi 0 5 10 15 20 25 30 35 1 16 256 4k 64k throughput(Mbytes/s) buffer size (bytes) AMMedium testam scc-{linux-mpi,bf-ump} linux barrelfish 0 10 20 30 40 50 60 1 16 256 4k 64k 1M throughput(Mbytes/s) buffer size (bytes) AMLong testam scc-{linux-mpi,bf-ump} linux barrelfish Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 30. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: 1 to 1 MPI vs UMP on the SCC scc-linux-mpi vs scc-barrelfish-ump ipi 0 5 10 15 20 25 30 35 1 16 256 4k 64k throughput(Mbytes/s) buffer size (bytes) AMMedium testam scc-{linux-mpi,bf-ump} linux barrelfish 0 10 20 30 40 50 60 1 16 256 4k 64k 1M throughput(Mbytes/s) buffer size (bytes) AMLong testam scc-{linux-mpi,bf-ump} linux barrelfish Again Newlib memcpy() against GLIBC memcpy() Same problems as before On Linux, SCC MPBs are fully used (1 application) On Linux we can see a strong performance degradation when SCC MPBs overflow (after 8Kb) Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 31. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: N to N Description of the test application Now we want to model a real system: A node can send messages to any other node of the system with the same probability. Messages are sent in a Poisson process, idle times follow an exponential distribution with rate parameter λ. From an uniform distribution, we get the idle time by: T = − ln U λ We can choose the probability of sending AMShort, AMMedium and AMLong. Now buffers are fixed size. We also model the probability for a request to have a reply. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 32. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: N to N Values for the test runs Test runs: λ = 1 to 106 mgs/s in powers of 2 3 runs: 1 Majority of shorts: (ps = 0.7, pm = 0.2, pl = 0.1), 2 Majority of longs: (ps = 0.1, pm = 0.2, pl = 0.7) 3 All categories balanced (ps = 0.33, pm = 0.33, pl = 0.33) Medium block size = 8Kb, long size = 64Kb Reply probability = 0.33 We run every test during 5 minutes, as longer times don’t affect numbers Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 33. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: N to N x86 64 architecure results x86 64-linux-pshm vs x86 64-barrelfish-pshm vs x86 64-linux-mpi Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 34. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: N to N x86 64 architecure results x86 64-linux-pshm vs x86 64-barrelfish-pshm vs x86 64-linux-mpi 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 16 256 4k 64k realrate(msg/s) perfect rate (msg/s) x86_64 with 70% short msgs (16 cores) lin-pshm bf-pshm lin-mpi 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 16 256 4k 64k realrate(msg/s) perfect rate (msg/s) x86_64 with 70% long msgs (16 cores) lin-pshm bf-pshm lin-mpi Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 35. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: N to N x86 64 architecure results x86 64-linux-pshm vs x86 64-barrelfish-pshm vs x86 64-linux-mpi 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 16 256 4k 64k realrate(msg/s) perfect rate (msg/s) x86_64 with 70% short msgs (16 cores) lin-pshm bf-pshm lin-mpi 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 16 256 4k 64k realrate(msg/s) perfect rate (msg/s) x86_64 with 70% long msgs (16 cores) lin-pshm bf-pshm lin-mpi 16 cores simultaneously Saturation rate (64 kmsg/s short, 8 kmsg/s long) PSHM runs are better, even in Barrelfish Again Newlib memcpy() against GLIBC memcpy(), even with this barrelfish-pshm outperforms linux-mpi Greater gap with MPI for longs: works better for shorts. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 36. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: N to N MPI running on the SCC scc-linux-mpi 0 100 200 300 400 500 600 16 256 4k 64k realrate(msg/s) perfect rate (msg/s) Intel SCC with MPIRCK (48 cores) short balanced long (48 cores simultaneously) scc-barrelfish-ump ipi couldn’t be evaluated due to severe deadlocks and race conditions. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 37. Introduction Tests Conclusions Contributions Future work Hardware Configurations MP: 1 to 1 MP: N to N Message Passing: N to N MPI running on the SCC scc-linux-mpi 0 100 200 300 400 500 600 16 256 4k 64k realrate(msg/s) perfect rate (msg/s) Intel SCC with MPIRCK (48 cores) short balanced long (48 cores simultaneously) scc-barrelfish-ump ipi couldn’t be evaluated due to severe deadlocks and race conditions. Evaluation Compiled with MPIRCK CH2 driver, using SCC MPBs Slower convergence ratio than running with 16 cores Convergence area: 3:1 ratio for short - balanced, and 2:1 for balanced - long Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 38. Introduction Tests Conclusions Contributions Future work Contents 1 Introduction Motivation Project goals Software architecture 2 Tests Hardware Configurations MP: 1 to 1 MP: N to N 3 Conclusions 4 Contributions 5 Future work Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 39. Introduction Tests Conclusions Contributions Future work Conclusions Summary of results When there is no buffer involved, Barrelfish performs much faster due to asynchronous design of MP. In case of a buffer, memcpy() becomes critical libc shipped with Barrelfish has the worst performance GNU GLIBC in Linux is very optimized but non-portable There is a compromise between the two: Newlib UMP Barrelfish driver not suitable for large buffers On x86 64 architecture All PSHM setups outperform MPI (even Barrelfish) On the Intel SCC SCC MPBs are not very suitable for an operating system due to the lack of hardware protection The size of MPBs for multitasking are very small, message longer than the MPBs size per core need flux control. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 40. Introduction Tests Conclusions Contributions Future work Conclusions Final project conclusions After evaluating the project, we found: Barrelfish not mature for normal work, lots of engineering work Time-consuming to debug race-conditions, lack of a proper debugger and simulator It has a lot of potential, specially because of the asynchronous nature. This can be undoubtedly exploited. MP models Gasnet / Barrelfish are different, a lot of quirks to make it working Intel SCC platform has been designed more like an accelerator than a standalone system. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 41. Introduction Tests Conclusions Contributions Future work Contents 1 Introduction Motivation Project goals Software architecture 2 Tests Hardware Configurations MP: 1 to 1 MP: N to N 3 Conclusions 4 Contributions 5 Future work Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 42. Introduction Tests Conclusions Contributions Future work Contributions Accepted contributions Barrelfish 17 commits accepted on the Barrelfish’ official tree: Porting of the Newlib C library. Now all programs in the tree link with it by default IOAPIC index register access in 32-bit words Cross-compiler C++ language support System V shared memory extension Thread mutex additional operations Compiler/libc type decoupling Hake tool extension for creating libraries from libraries Bochs emulator Accepted patch on Bochs emulator to continue deterministic execution in debugging mode. Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 43. Introduction Tests Conclusions Contributions Future work Contributions Pending contributions and cross-compiler features Pending contributions: GNU cross-compiler tools for building programs on Barrelfish Gasnet BF conduit and internal modifications for running it on Barrelfish Cross-compiler features Thanks to this project now it’s possible to run standard C++ programs on Barrelfish Compile standard GNU programs with the cross-compiler with minor changes as: ./configure --host=x86 64-pc-barrelfish Example: GNU bash Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 44. Introduction Tests Conclusions Contributions Future work Contributions GNU Bash screenshot Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 45. Introduction Tests Conclusions Contributions Future work Contents 1 Introduction Motivation Project goals Software architecture 2 Tests Hardware Configurations MP: 1 to 1 MP: N to N 3 Conclusions 4 Contributions 5 Future work Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 46. Introduction Tests Conclusions Contributions Future work Future work Proposals for continuing the project Future work: Redesign Barrelfish Bulk transfer for flexible bucket size and full duplex operation. Rewrite flounder to be thread-safe. Better UMP driver with longer buffer windows. Faster memcpy() implementations. Running OpenMP/OmpSs programs with nanos runtime using the current C++ cross-compiler Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC
  • 47. Introduction Tests Conclusions Contributions Future work End Questions? Questions? Zeus G´omez Marmolejo Gasnet library evaluation on Barrelfish and Intel SCC