The document discusses achieving high performance TCP transmission over 40GbE networks on NUMA architectures for data acquisition systems like CMS. It introduces the CMS data acquisition requirements, software framework (XDAQ), and a new TCP Layered Architecture (TCPLA) developed to optimize performance on multi-core NUMA systems. Preliminary results show TCPLA with affinity settings can saturate 40GbE links, handling data streams from multiple sources at rates over 4.5GB/s. The TCPLA provides an asynchronous event-driven model for high-performance socket handling on NUMA hardware.
1. Achieving High Performance with TCP over 40GbE
on NUMA Architectures for CMS Data Acquisition
Andrea Petrucci – CERN (PH/CMD)
The 19th Real Time Conference
27 May 2014, Nara, Japan
2. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 2
Outline
New Processor Architectures
DAQ at the CMS Experiment
CMS Online Software Framework (XDAQ)
Architecture Foundation
Memory and Thread Managements
Data Transmission
TCPLA - TCP Layered Architecture
Motivation
Architecture
Performance Tuning
Preliminary Results
Summary
3. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 3
New Processor Architectures
In middle of 2000’s the processor frequency stabilized and the number
of cores per processor started to increase
The “golden” era for software developers ended in 2004 when Intel
cancelled its high-performance uniprocessor projects
“The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in software”
(Herb Sutter, 2005)
Software designed with concurrency in mind resulted in more efficient
use of new processors
Processor evolution (source: Sam Naffziger, AMD)
4. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 4
Non-Uniform Memory Access (NUMA)
Non-Uniform Memory Access (NUMA)
The distance between processing cores and memory or I/O interrupts
varies within Non-Uniform Memory Access (NUMA) architectures
CPU
Memory
Controller
CPU
Memory
Controller
CPU
Memory
Controller
CPU
Memory
Controller
I/O
Device
s
I/O
Device
s
I/O
Device
s
I/O
Device
s
6. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 6
CMS DAQ Requirements for LHC Run 2
Parameters
Data Sources (FEDs) ~ 620
Trigger levels 2
First Level rate 100 kHz
Event size 1 to 2 MB
Readout Throughput 200 GB/s
High Level Trigger 1 kHz
Storage Bandwidth 2 GB/s
7. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 7
CMS DAQ System for LHC Run 2
CPU
Socket
CPU
Socket
I/OI/O
40 Gb/s
Ethernet
56 Gb/s
Infiniband FDR
Readout Unit
9. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 9
XDAQ Framework
The XDAQ is software platform created specifically for the
development of distributed data acquisition systems
Implemented in C++, developed by the CMS DAQ group
Provides platform independent services, tools for inter-process
communication, configuration and control
Builds upon industrial standards, open protocols and libraries, and is
designed according to the object-oriented model
For further information about XDAQ see:
J. Gutleber, S. Murray and L. Orsini, Towards a homogeneous architecture for high-energy physics
data acquisition systems published in Computer Physics Communications, vol. 153, issue 2, pp.
155-163, 2003
http://www.sciencedirect.com/science/article/pii/S0010465503001619
10. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 10
XDAQ Architecture Foundation
Core
Executive
Plugin
Interface
11. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 11
Application
Plugin
Core
Executive
Plugin
Interface
XDAQ Architecture Foundation
XML
Configuration
12. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 12
XDAQ Architecture Foundation
XML
Configuration
SOAP
HTTP
RESTful
Control and Interface
Application
Plugin
Core
Executive
Plugin
Interface
13. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 13
XDAQ Architecture Foundation
XML
Configuration
SOAP
HTTP
RESTful
Control and Interface
Application
Plugin
Core
Executive
Plugin
Interface
Hardware Access
VMEPCI Memory
14. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 14
XDAQ Architecture Foundation
XML
Configuration
SOAP
HTTP
RESTful
Control and Interface
Application
Plugin
Core
Executive
Plugin
Interface
Hardware Access
VMEPCI Memory
Protocols and Formats
Uniform building blocks - One or more executives per computer
contain application and service components
15. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 15
CPU
Memory – Memory Pools
Memory Pool
Cached buffers
All memory allocation
can bound to specific
NUMA node
Memory pools allocate and
cache memory
log2 best-fit allocation
No fragmentation of
memory over long runs
Buffers are recycled during
runs – constant time for
retrieval
16. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 16
Memory – Buffer Loaning
Step 1 Step 2 Step 3
Task A Task B
Loans
reference
Allocates
reference
Releases
reference
Buffer loaning allows zero-copy of data between software layers
and processes
Task A Task B Task A Task B
17. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 17
I/O
Controller
I/O
Controller
CPU
Multithreading
Workloop
Thread
Application
Thread
Workloops can be bound to run on specific CPU cores by
configuration
Work assigned by application
Workloops provide easy use of threads
Assignment
of work
18. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 18
Protocol
level
XDAQ
User Application
Logical
communication
NIC
Peer
Transport
Data Transmission – Peer Transports
XDAQ
User
Application
NIC
Peer
Transport
Peer
Transport
Application uses network
transparently through XDAQ
framework
User application is network and protocol independent
Routing defined by XDAQ configuration
Connections setup through Peer to Peer model
20. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 20
Motivation
Learned lessons from experience using previous peer
transports for TCP in a real environment lead to a more
advanced design to cope with
Separation of TCP socket handling and XDAQ protocols
Exploitation of a multi-threading environment
Optimisation through configurability to improve performance
Grouping sockets and also to associate different threads to
different roles
21. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 21
TCPLA inspired by uDAPL
Direct Access Programming Library
Developed by DAT collaborative
http://www.datcollaborative.org/
Defines user Direct Access Programming Library (uDAPL) and kernel Direct
Access Programming Library (kDAPL) APIs
TCPLA is inspired by
uDAPL specification,
in particular the
send/receive
semantics it describes
22. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 22
What is the TCP Layered Architecture
TCP Layered Architecture (TCPLA) is a lightweight, transport and
platform independent user-level library for handling socket
processing
Provides the user with an event driven model of handling network
communications
All calls to send and receive data are performed asynchronously
Two peer transports have been developed
ptFRL is specific for reading TCP/IP streams from FEROLs in RU machine
ptUTCP is for general purpose data flow
ptUTCP
General protocols
ptFRL
FEROL protocol
23. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 23
TCPLA Architecture Principles (I)
Send
Queue
Receive
Queue
Completion Event Queue
Received Buffer
Connection Request
Sent Buffer
Connection established
Peer Transport
Pre-allocated
buffers for receiving
Filled buffers for
sending
Event handler
TCP Layered Architecture
24. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 24
TCPLA Architecture Principles (II)
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
Receive Workloop
Send Workloop
Event Workloop
Associate 1..N socket per workloop
26. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 26
Performance Factors
CPU affinity
I/O Interrupt affinity
Memory affinity
TCP Custom Kernel Settings
27. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 27
Readout Unit Machine Affinities
Affinities
40 Gb/s
Ethernet
56 Gb/s
Infiniband
I/O I/O
Readout Unit
15131197531
CPU Socket 1
NUMA Node 1 (16 GB)
14121086420
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system
29. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 29
P2P Example: Out of the Box vs Affinity
Linksaturation
withaffinity
30. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 30
FEROL Aggregation into One RU
Aggregation n-to-1, for example
12 Streams from 12 FEROLs each
sending fragment size from 2 to 4 kB
16 Streams from 8 FEROLs each
sending fragment size from 1 to 2 kB
Concentrated in one 40 GbE NIC in
RU PC
Mellanox SX 1024 with 48 ports at 10
GbE and 12 ports at 40 GbE
Reliability and congestion handled
by TCP/IP
Post frames are enabled in the 10/40 GbE
switch
12/16 Streams
31. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 31
DAQ Test Bed
Final performance indicator of the TCPLA is based upon the
performance achieved while executing simultaneous input
and output in the RU
Test bed up to 47 FEROLs (10 GbE) in input
CPU CPU I/OI/O
40 Gb/s
Ethernet
56 Gb/s
Infiniband FDR
Readout Unit
32. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 32
Test with 1Data Source from FEROL (I)
Fragment Size (bytes)
300 400 1000 2000 3000 10000
Av.ThroughputperRU(MB/s)
0
1000
2000
3000
4000
5000
12 FEROLs x 1 RU x 4 BUs
100 kHz (12 streams)
40 GE
33. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 33
Test with 1Data Source from FEROL (II)
Fragment Size (bytes)
300 400 1000 2000 3000 10000
Av.ThroughputperRU(MB/s)
0
1000
2000
3000
4000
5000
12 FEROLs x 1 RU x 4 BUs
24 FEROLs x 2 RUs x 4 BUs
47 FEROLs x 4 RUs x 8 BUs
100 kHz (12 streams)
40 GE
working
range
90% efficiency of 40
GE
at 4.5 GB/s
35. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 35
Test with 2 Data Sources from FEROL
Fragment Size (bytes)
300 400 1000 2000 3000 10000
Av.ThroughputperRU(MB/s)
0
1000
2000
3000
4000
5000
8 FEROLs x 1 RU x 4 BUs
16 FEROLs x 2 RUs x 4 BUs
32 FEROLs x 4 RUs x 8 BUs
100 kHz (16 streams)
40 GE
working
range
90% efficiency of 40
GE
at 4.5 GB/s
36. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 36
Summary
TCPLA is based on the standard socket library with optimizations for
NUMA environments using the XDAQ framework
It has been developed to allow high performance of critical
applications in the new CMS DAQ system by exploiting multi-core
architectures
It will be used for both data flow and monitoring
37. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 37
Questions ?
Send Timeout
Receive Timeout
Affinity
Single-Threaded
Dispatcher
Multi-Threaded
Dispatcher
Poll
Select
Subnet Scanning
Port Scanning
TCP
SDP
Blocking
Non-Blocking
Connect On Request
Polling CycleDatagram
Auto Connect
FRL
I2O
B2IN
I/O Queue Size
Polling Work Loop
Waiting Work Loop
Event Queue Size
Receive Buffers
Receive Block Size
Max Clients
Max FD
TCPLA is a Powerful Tool
39. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 39
Software Architecture
PeerTransport
EventHandler InterfaceAdapter
EndPoint
SelectPoll
PublicServicePoint
Poll SelectDispatcher
40. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 40
Connecting
EventHandlerEventQueueInterfaceAdapter
Connect
Connect
Wait for establishment
Push established event
Consume event
a
a
ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml
41. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 41
Sending
Push sent event
EventHandlerEventQueueInterfaceAdapter
Push into send queue
Buffer to
send
Check sending
is possible
Consume event
Pop buffer
from send queue
Send buffer
a
a
a
a ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml
42. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 42
Receiving
Dispatch as
reference
Check for incoming
EventHandlerEventQueue Dispatcher
Incoming
packet
Push receive event Consume event
Provide free buffer
DispatcherPublicServicePoint
Get free buffer
Socket receive
a
a
a
ml ml
ml
a
Thread
Mutex-less
Asynchronous
Legend:
43. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 43
Readout Unit Machine Affinities
Affinities
40 Gb/s
Ethernet
56 Gb/s
Infiniband
I/O I/O
Readout Unit
15131197531
CPU Socket 1
NUMA Node 1 (16 GB)
14121086420
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system
Editor's Notes
memory is NEVER de-allocated, they are internally recyled
The peer transport PROTOCOLs define the FORMAT of MESSAGES for transporting
The peer transport PROTOCOLs define the FORMAT of MESSAGES for transporting