Achieving High Performance TCP over 40GbE on NUMA

Achieving High Performance with TCP over 40GbE
on NUMA Architectures for CMS Data Acquisition
Andrea Petrucci – CERN (PH/CMD)
The 19th Real Time Conference
27 May 2014, Nara, Japan

The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 2
Outline
 New Processor Architectures
 DAQ at the CMS Experiment
 CMS Online Software Framework (XDAQ)
 Architecture Foundation
 Memory and Thread Managements
 Data Transmission
 TCPLA - TCP Layered Architecture
 Motivation
 Architecture
 Performance Tuning
 Preliminary Results
 Summary

New Processor Architectures
 In middle of 2000’s the processor frequency stabilized and the number
of cores per processor started to increase
 The “golden” era for software developers ended in 2004 when Intel
cancelled its high-performance uniprocessor projects
 “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in software”
(Herb Sutter, 2005)
 Software designed with concurrency in mind resulted in more efficient
use of new processors
Processor evolution (source: Sam Naffziger, AMD)

Non-Uniform Memory Access (NUMA)
Non-Uniform Memory Access (NUMA)
 The distance between processing cores and memory or I/O interrupts
varies within Non-Uniform Memory Access (NUMA) architectures
CPU
Memory
Controller
CPU
Memory
Controller
CPU
Memory
Controller
CPU
Memory
Controller
I/O
Device
s
I/O
Device
s
I/O
Device
s
I/O
Device
s

CMS DAQ Requirements for LHC Run 2
Parameters
Data Sources (FEDs) ~ 620
Trigger levels 2
First Level rate 100 kHz
Event size 1 to 2 MB
Readout Throughput 200 GB/s
High Level Trigger 1 kHz
Storage Bandwidth 2 GB/s

CMS DAQ System for LHC Run 2
CPU
Socket
CPU
Socket
I/OI/O
40 Gb/s
Ethernet
56 Gb/s
Infiniband FDR
Readout Unit

CMS Online Software Framework (XDAQ)

XDAQ Framework
 The XDAQ is software platform created specifically for the
development of distributed data acquisition systems
 Implemented in C++, developed by the CMS DAQ group
 Provides platform independent services, tools for inter-process
communication, configuration and control
 Builds upon industrial standards, open protocols and libraries, and is
designed according to the object-oriented model
For further information about XDAQ see:
J. Gutleber, S. Murray and L. Orsini, Towards a homogeneous architecture for high-energy physics
data acquisition systems published in Computer Physics Communications, vol. 153, issue 2, pp.
155-163, 2003
http://www.sciencedirect.com/science/article/pii/S0010465503001619

XDAQ Architecture Foundation
Core
Executive
Plugin
Interface

Application
Plugin
Core
Executive
Plugin
Interface
XML
Configuration

XML
Configuration
SOAP
HTTP
RESTful
Control and Interface
Application
Plugin
Core
Executive
Plugin
Interface

XML
Configuration
SOAP
HTTP
RESTful
Application
Plugin
Core
Executive
Plugin
Interface
Hardware Access
VMEPCI Memory

XML
Configuration
SOAP
HTTP
RESTful
Application
Plugin
Core
Executive
Plugin
Interface
Hardware Access
VMEPCI Memory
Protocols and Formats
 Uniform building blocks - One or more executives per computer
contain application and service components

CPU
Memory – Memory Pools
Memory Pool
Cached buffers
All memory allocation
can bound to specific
NUMA node
 Memory pools allocate and
cache memory
 log2 best-fit allocation
 No fragmentation of
memory over long runs
 Buffers are recycled during
runs – constant time for
retrieval

Memory – Buffer Loaning
Step 1 Step 2 Step 3
Task A Task B
Loans
reference
Allocates
reference
Releases
reference
 Buffer loaning allows zero-copy of data between software layers
and processes
Task A Task B Task A Task B

I/O
Controller
I/O
Controller
CPU
Multithreading
Workloop
Thread
Application
Thread
 Workloops can be bound to run on specific CPU cores by
configuration
 Work assigned by application
 Workloops provide easy use of threads
Assignment
of work

Protocol
level
XDAQ
User Application
Logical
communication
NIC
Peer
Transport
Data Transmission – Peer Transports
XDAQ
User
Application
NIC
Peer
Transport
Peer
Transport
Application uses network
transparently through XDAQ
framework
 User application is network and protocol independent
 Routing defined by XDAQ configuration
 Connections setup through Peer to Peer model

Motivation
 Learned lessons from experience using previous peer
transports for TCP in a real environment lead to a more
advanced design to cope with
 Separation of TCP socket handling and XDAQ protocols
 Exploitation of a multi-threading environment
 Optimisation through configurability to improve performance
 Grouping sockets and also to associate different threads to
different roles

TCPLA inspired by uDAPL
Direct Access Programming Library
 Developed by DAT collaborative
 http://www.datcollaborative.org/
 Defines user Direct Access Programming Library (uDAPL) and kernel Direct
Access Programming Library (kDAPL) APIs
 TCPLA is inspired by
uDAPL specification,
in particular the
send/receive
semantics it describes

What is the TCP Layered Architecture
 TCP Layered Architecture (TCPLA) is a lightweight, transport and
platform independent user-level library for handling socket
processing
 Provides the user with an event driven model of handling network
communications
 All calls to send and receive data are performed asynchronously
 Two peer transports have been developed
 ptFRL is specific for reading TCP/IP streams from FEROLs in RU machine
 ptUTCP is for general purpose data flow
ptUTCP
General protocols
ptFRL
FEROL protocol

TCPLA Architecture Principles (I)
Send
Queue
Receive
Queue
Completion Event Queue
 Received Buffer
 Connection Request
 Sent Buffer
 Connection established
Peer Transport
Pre-allocated
buffers for receiving
Filled buffers for
sending
Event handler
TCP Layered Architecture

TCPLA Architecture Principles (II)
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
Receive Workloop
Send Workloop
Event Workloop
Associate 1..N socket per workloop

Performance Factors
CPU affinity
I/O Interrupt affinity
Memory affinity
TCP Custom Kernel Settings

Readout Unit Machine Affinities
Affinities
40 Gb/s
Ethernet
56 Gb/s
Infiniband
I/O I/O
Readout Unit
15131197531
CPU Socket 1
NUMA Node 1 (16 GB)
14121086420
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system

P2P Example: Out of the Box vs Affinity
Linksaturation
withaffinity

FEROL Aggregation into One RU
 Aggregation n-to-1, for example
 12 Streams from 12 FEROLs each
sending fragment size from 2 to 4 kB
 16 Streams from 8 FEROLs each
sending fragment size from 1 to 2 kB
 Concentrated in one 40 GbE NIC in
RU PC
 Mellanox SX 1024 with 48 ports at 10
GbE and 12 ports at 40 GbE
 Reliability and congestion handled
by TCP/IP
 Post frames are enabled in the 10/40 GbE
switch
12/16 Streams

DAQ Test Bed
 Final performance indicator of the TCPLA is based upon the
performance achieved while executing simultaneous input
and output in the RU
 Test bed up to 47 FEROLs (10 GbE) in input
CPU CPU I/OI/O
40 Gb/s
Ethernet
56 Gb/s
Infiniband FDR
Readout Unit

Test with 1Data Source from FEROL (I)
Fragment Size (bytes)
300 400 1000 2000 3000 10000
Av.ThroughputperRU(MB/s)
0
1000
2000
3000
4000
5000
12 FEROLs x 1 RU x 4 BUs
100 kHz (12 streams)
40 GE

Test with 1Data Source from FEROL (II)
300 400 1000 2000 3000 10000
0
1000
2000
3000
4000
5000
24 FEROLs x 2 RUs x 4 BUs
40 GE
working
range
90% efficiency of 40
GE
at 4.5 GB/s

Test with 2 Data Sources from FEROL

Test with 2 Data Sources from FEROL
300 400 1000 2000 3000 10000
0
1000
2000
3000
4000
5000
40 GE
working
range
90% efficiency of 40
GE
at 4.5 GB/s

Summary
 TCPLA is based on the standard socket library with optimizations for
NUMA environments using the XDAQ framework
 It has been developed to allow high performance of critical
applications in the new CMS DAQ system by exploiting multi-core
architectures
 It will be used for both data flow and monitoring

Questions ?
Send Timeout
Receive Timeout
Affinity
Single-Threaded
Dispatcher
Multi-Threaded
Dispatcher
Poll
Select
Subnet Scanning
Port Scanning
TCP
SDP
Blocking
Non-Blocking
Connect On Request
Polling CycleDatagram
Auto Connect
FRL
I2O
B2IN
I/O Queue Size
Polling Work Loop
Waiting Work Loop
Event Queue Size
Receive Buffers
Receive Block Size
Max Clients
Max FD
TCPLA is a Powerful Tool

Software Architecture
PeerTransport
EventHandler InterfaceAdapter
EndPoint
SelectPoll
PublicServicePoint
Poll SelectDispatcher

Connecting
EventHandlerEventQueueInterfaceAdapter
Connect
Connect
Wait for establishment
Push established event
Consume event
a
a
ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml

Sending
Push sent event
EventHandlerEventQueueInterfaceAdapter
Push into send queue
Buffer to
send
Check sending
is possible
Consume event
Pop buffer
from send queue
Send buffer
a
a
a
a ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml

Receiving
Dispatch as
reference
Check for incoming
EventHandlerEventQueue Dispatcher
Incoming
packet
Push receive event Consume event
Provide free buffer
DispatcherPublicServicePoint
Get free buffer
Socket receive
a
a
a
ml ml
ml
a
Thread
Mutex-less
Asynchronous
Legend:

Readout Unit Machine Affinities
Affinities
40 Gb/s
Ethernet
56 Gb/s
Infiniband
I/O I/O
Readout Unit
15131197531
CPU Socket 1
NUMA Node 1 (16 GB)
14121086420
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system

Achieving High Performance TCP over 40GbE on NUMA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Achieving High Performance TCP over 40GbE on NUMA

Similar to Achieving High Performance TCP over 40GbE on NUMA (20)

More from Andrea PETRUCCI

More from Andrea PETRUCCI (6)

Achieving High Performance TCP over 40GbE on NUMA

Editor's Notes