Inter-process communication on steroids

1,002 views
793 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,002
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Inter-process communication on steroids

  1. 1. INTER-PROCESS COMMUNICATION ON STEROIDS Roberto A.Vitillo (LBNL) ATLAS SOFTWARE & COMPUTING WORKSHOP OCTOBER 2012 1
  2. 2. CONTEXT • AthenaMP communications ‣ reader process sends events to workers • Coprocessor communications ‣ Athena[MP] jobs interacting with a GPU server process • Available IPC mechanisms ‣ shared memory with explicit synchronization ‣ message passing with implicit synchronization 2
  3. 3. MESSAGE PASSING MODEL • One of the most successful models for providing concurrency ‣ data and synchronization in a single unit • Actor Model ‣ processes have an identity ‣ communicate by sending messages to mailing addresses ‣ Erlang, Scala • Process calculi ‣ processes are anonymous ‣ communicate by sending messages through named channels ‣ Go Programming Language 3
  4. 4. PATTERNS • Producer & Consumer ‣ producer pushes messages ‣ consumer pulls messages • Client & Server ‣ client makes a request ‣ server replies to a request 4
  5. 5. CHANNELS • Properties of a channels: ‣ name ‣ context (thread, local-process, distributed-process) ‣ asynchronous(k) ‣ topology one-to-one one-to-many many-to-many 5
  6. 6. SOCKETS • Each end of a channel is attached to a Socket • Different patterns have different Sockets, ‣ e.g. ProducerSocket, ConsumerSocket • A Socket allows to: ‣ send() buffers of data to its peers (buffer-blocking) ‣ receive() buffers of data from its peers (blocking) client server 6
  7. 7. SOCKETS Channel channel("service", ONE_TO_ONE) ISocket *socket = factory->createClientSocket(channel); socket->send("ping", 5); socket->receive(&buffer); Channel channel("service", ONE_TO_ONE); ISocket *socket = factory->createServerSocket(channel); while(true){ socket->receive(&buffer); socket->send("pong"); } client server 7
  8. 8. DATATRANSFERTECHNIQUES • Zero-Copy ‣ page table entries are transferred from the source to the destination process ‣ requires the buffers to have the same alignment • Double copy in shared memory ‣ double buffering allows the sender and receiver to transfer data in parallel • Delegate the copy to a NIC ‣ avoids cache pollution 8
  9. 9. DATATRANSFERTECHNIQUES • Single Copy with ptrace() ‣ debugging facility, sender processes is stopped ‣ receiver process accesses the sender’s process address space and performs a copy • Single Copy through a Kernel facility ‣ vmsplice (Kernel 2.6.17) ‣ KNEM (Kernel 2.6.15, requires module) ‣ Cross Memory Attach (Kernel 3.2) 9
  10. 10. IMPLEMENTATION • The API is currently implemented with ZeroMQ ‣ provides a default fall back implementation ‣ lock-free queues for threads ‣ AF_UNIX sockets for local processes ‣ TCP sockets for distributed processes • The implementation switches according to the channel configuration ‣ many-to-many channel triggers the creation of a broker process that handles all communications ‣ one-to-one, producer-consumer uses a UNIX pipe with vmsplice() 10
  11. 11. BENCHMARK: LATENCY Ivy Bridge, ZMQ 2.2, OpenMPI 1.6.2 without KNEM 0 μs 25 μs 50 μs 75 μs 100 μs Latency on SLC6 pipe vmsplice OpenMPI ZMQ • Test: 1 Byte Ping Pong • MPI is using shared memory • double copy • avoids expensive system call which dominates for small messages • The pipe implementations are dominated by the system call overhead 11
  12. 12. BENCHMARK: BANDWIDTH 0 GB/s 4 GB/s 8 GB/s 11 GB/s 15 GB/s Bandwidth on SLC 6 pipe vmsplice ZMQ OpenMPI • Benchmark: 1 Megabyte Ping Pong • Pipe performs 2 copies (U-K-U) • Spliced pipe performs a single copy (K-U) • OpenMPI is using shared memory as buffer and performs 2 copies ‣ copies are performed in parallel on both ends ‣ consumes more CPU ‣ pollutes the caches • ZMQ uses AF_UNIX sockets, i.e. two copies are performed (U-K-U) ‣ all sorts of buffers in between ‣ optimized for distributed environments 12
  13. 13. BENCHMARK: BANDWIDTH 0 GB/s 4 GB/s 8 GB/s 11 GB/s 15 GB/s Bandwidth on Linux 3.6 pipe vmsplice ZMQ OpenMPI • Benchmark: 1 Megabyte Ping Pong • Pipe performs 2 copies (U-K-U) • Spliced pipe performs a single copy (K-U) • OpenMPI is using shared memory as buffer and performs 2 copies ‣ copies are performed in parallel on both ends ‣ consumes more CPU ‣ pollutes the caches • ZMQ uses AF_UNIX sockets, i.e. two copies are performed (U-K-U) ‣ all sorts of buffers in between ‣ optimized for distributed environments 13
  14. 14. while(event){ ... socket->receive(&ready_notification); socket->send(event, event_size); ... } IN ACTION:ATHENAMP reader worker worker while(true){ ... socket->send(ready_notification, size); socket->receive(&event); ... } while(true){ ... socket->send(ready_notification, size); socket->receive(&event); ... } • SeeVakho’s talk • Temporary hybrid approach ‣ metadata in shared memory • Benefits: ‣ load balancing ‣ synchronization 14
  15. 15. CONCLUSION • Library provides an uniform message-passing abstraction for inter-process communication • Data and synchronization in a single unit • Communication patterns and topologies allow to ‣ reduce latency ‣ increase bandwidth ‣ express parallelism • More implementations will be considered ‣ shared memory segment for small messages (<10K, low latency) ‣ MPI? Doesn’t behave well with forking processes... 15
  16. 16. QUESTIONS? 16

×