SlideShare a Scribd company logo
1 of 43
Achieving High Performance with TCP over 40GbE
on NUMA Architectures for CMS Data Acquisition
Andrea Petrucci – CERN (PH/CMD)
The 19th Real Time Conference
27 May 2014, Nara, Japan
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 2
Outline
 New Processor Architectures
 DAQ at the CMS Experiment
 CMS Online Software Framework (XDAQ)
 Architecture Foundation
 Memory and Thread Managements
 Data Transmission
 TCPLA - TCP Layered Architecture
 Motivation
 Architecture
 Performance Tuning
 Preliminary Results
 Summary
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 3
New Processor Architectures
 In middle of 2000’s the processor frequency stabilized and the number
of cores per processor started to increase
 The “golden” era for software developers ended in 2004 when Intel
cancelled its high-performance uniprocessor projects
 “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in software”
(Herb Sutter, 2005)
 Software designed with concurrency in mind resulted in more efficient
use of new processors
Processor evolution (source: Sam Naffziger, AMD)
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 4
Non-Uniform Memory Access (NUMA)
Non-Uniform Memory Access (NUMA)
 The distance between processing cores and memory or I/O interrupts
varies within Non-Uniform Memory Access (NUMA) architectures
CPU
Memory
Controller
CPU
Memory
Controller
CPU
Memory
Controller
CPU
Memory
Controller
I/O
Device
s
I/O
Device
s
I/O
Device
s
I/O
Device
s
CMS Experiment
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 6
CMS DAQ Requirements for LHC Run 2
Parameters
Data Sources (FEDs) ~ 620
Trigger levels 2
First Level rate 100 kHz
Event size 1 to 2 MB
Readout Throughput 200 GB/s
High Level Trigger 1 kHz
Storage Bandwidth 2 GB/s
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 7
CMS DAQ System for LHC Run 2
CPU
Socket
CPU
Socket
I/OI/O
40 Gb/s
Ethernet
56 Gb/s
Infiniband FDR
Readout Unit
CMS Online Software Framework (XDAQ)
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 9
XDAQ Framework
 The XDAQ is software platform created specifically for the
development of distributed data acquisition systems
 Implemented in C++, developed by the CMS DAQ group
 Provides platform independent services, tools for inter-process
communication, configuration and control
 Builds upon industrial standards, open protocols and libraries, and is
designed according to the object-oriented model
For further information about XDAQ see:
J. Gutleber, S. Murray and L. Orsini, Towards a homogeneous architecture for high-energy physics
data acquisition systems published in Computer Physics Communications, vol. 153, issue 2, pp.
155-163, 2003
http://www.sciencedirect.com/science/article/pii/S0010465503001619
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 10
XDAQ Architecture Foundation
Core
Executive
Plugin
Interface
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 11
Application
Plugin
Core
Executive
Plugin
Interface
XDAQ Architecture Foundation
XML
Configuration
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 12
XDAQ Architecture Foundation
XML
Configuration
SOAP
HTTP
RESTful
Control and Interface
Application
Plugin
Core
Executive
Plugin
Interface
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 13
XDAQ Architecture Foundation
XML
Configuration
SOAP
HTTP
RESTful
Control and Interface
Application
Plugin
Core
Executive
Plugin
Interface
Hardware Access
VMEPCI Memory
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 14
XDAQ Architecture Foundation
XML
Configuration
SOAP
HTTP
RESTful
Control and Interface
Application
Plugin
Core
Executive
Plugin
Interface
Hardware Access
VMEPCI Memory
Protocols and Formats
 Uniform building blocks - One or more executives per computer
contain application and service components
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 15
CPU
Memory – Memory Pools
Memory Pool
Cached buffers
All memory allocation
can bound to specific
NUMA node
 Memory pools allocate and
cache memory
 log2 best-fit allocation
 No fragmentation of
memory over long runs
 Buffers are recycled during
runs – constant time for
retrieval
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 16
Memory – Buffer Loaning
Step 1 Step 2 Step 3
Task A Task B
Loans
reference
Allocates
reference
Releases
reference
 Buffer loaning allows zero-copy of data between software layers
and processes
Task A Task B Task A Task B
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 17
I/O
Controller
I/O
Controller
CPU
Multithreading
Workloop
Thread
Application
Thread
 Workloops can be bound to run on specific CPU cores by
configuration
 Work assigned by application
 Workloops provide easy use of threads
Assignment
of work
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 18
Protocol
level
XDAQ
User Application
Logical
communication
NIC
Peer
Transport
Data Transmission – Peer Transports
XDAQ
User
Application
NIC
Peer
Transport
Peer
Transport
Application uses network
transparently through XDAQ
framework
 User application is network and protocol independent
 Routing defined by XDAQ configuration
 Connections setup through Peer to Peer model
TCP Layered Architecture
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 20
Motivation
 Learned lessons from experience using previous peer
transports for TCP in a real environment lead to a more
advanced design to cope with
 Separation of TCP socket handling and XDAQ protocols
 Exploitation of a multi-threading environment
 Optimisation through configurability to improve performance
 Grouping sockets and also to associate different threads to
different roles
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 21
TCPLA inspired by uDAPL
Direct Access Programming Library
 Developed by DAT collaborative
 http://www.datcollaborative.org/
 Defines user Direct Access Programming Library (uDAPL) and kernel Direct
Access Programming Library (kDAPL) APIs
 TCPLA is inspired by
uDAPL specification,
in particular the
send/receive
semantics it describes
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 22
What is the TCP Layered Architecture
 TCP Layered Architecture (TCPLA) is a lightweight, transport and
platform independent user-level library for handling socket
processing
 Provides the user with an event driven model of handling network
communications
 All calls to send and receive data are performed asynchronously
 Two peer transports have been developed
 ptFRL is specific for reading TCP/IP streams from FEROLs in RU machine
 ptUTCP is for general purpose data flow
ptUTCP
General protocols
ptFRL
FEROL protocol
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 23
TCPLA Architecture Principles (I)
Send
Queue
Receive
Queue
Completion Event Queue
 Received Buffer
 Connection Request
 Sent Buffer
 Connection established
Peer Transport
Pre-allocated
buffers for receiving
Filled buffers for
sending
Event handler
TCP Layered Architecture
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 24
TCPLA Architecture Principles (II)
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
socket i/o
Receive Workloop
Send Workloop
Event Workloop
Associate 1..N socket per workloop
Performance Tuning
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 26
Performance Factors
CPU affinity
I/O Interrupt affinity
Memory affinity
TCP Custom Kernel Settings
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 27
Readout Unit Machine Affinities
Affinities
40 Gb/s
Ethernet
56 Gb/s
Infiniband
I/O I/O
Readout Unit
15131197531
CPU Socket 1
NUMA Node 1 (16 GB)
14121086420
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system
Preliminary Results
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 29
P2P Example: Out of the Box vs Affinity
Linksaturation
withaffinity
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 30
FEROL Aggregation into One RU
 Aggregation n-to-1, for example
 12 Streams from 12 FEROLs each
sending fragment size from 2 to 4 kB
 16 Streams from 8 FEROLs each
sending fragment size from 1 to 2 kB
 Concentrated in one 40 GbE NIC in
RU PC
 Mellanox SX 1024 with 48 ports at 10
GbE and 12 ports at 40 GbE
 Reliability and congestion handled
by TCP/IP
 Post frames are enabled in the 10/40 GbE
switch
12/16 Streams
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 31
DAQ Test Bed
 Final performance indicator of the TCPLA is based upon the
performance achieved while executing simultaneous input
and output in the RU
 Test bed up to 47 FEROLs (10 GbE) in input
CPU CPU I/OI/O
40 Gb/s
Ethernet
56 Gb/s
Infiniband FDR
Readout Unit
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 32
Test with 1Data Source from FEROL (I)
Fragment Size (bytes)
300 400 1000 2000 3000 10000
Av.ThroughputperRU(MB/s)
0
1000
2000
3000
4000
5000
12 FEROLs x 1 RU x 4 BUs
100 kHz (12 streams)
40 GE
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 33
Test with 1Data Source from FEROL (II)
Fragment Size (bytes)
300 400 1000 2000 3000 10000
Av.ThroughputperRU(MB/s)
0
1000
2000
3000
4000
5000
12 FEROLs x 1 RU x 4 BUs
24 FEROLs x 2 RUs x 4 BUs
47 FEROLs x 4 RUs x 8 BUs
100 kHz (12 streams)
40 GE
working
range
90% efficiency of 40
GE
at 4.5 GB/s
Test with 2 Data Sources from FEROL
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 35
Test with 2 Data Sources from FEROL
Fragment Size (bytes)
300 400 1000 2000 3000 10000
Av.ThroughputperRU(MB/s)
0
1000
2000
3000
4000
5000
8 FEROLs x 1 RU x 4 BUs
16 FEROLs x 2 RUs x 4 BUs
32 FEROLs x 4 RUs x 8 BUs
100 kHz (16 streams)
40 GE
working
range
90% efficiency of 40
GE
at 4.5 GB/s
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 36
Summary
 TCPLA is based on the standard socket library with optimizations for
NUMA environments using the XDAQ framework
 It has been developed to allow high performance of critical
applications in the new CMS DAQ system by exploiting multi-core
architectures
 It will be used for both data flow and monitoring
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 37
Questions ?
Send Timeout
Receive Timeout
Affinity
Single-Threaded
Dispatcher
Multi-Threaded
Dispatcher
Poll
Select
Subnet Scanning
Port Scanning
TCP
SDP
Blocking
Non-Blocking
Connect On Request
Polling CycleDatagram
Auto Connect
FRL
I2O
B2IN
I/O Queue Size
Polling Work Loop
Waiting Work Loop
Event Queue Size
Receive Buffers
Receive Block Size
Max Clients
Max FD
TCPLA is a Powerful Tool
Backup Slides
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 39
Software Architecture
PeerTransport
EventHandler InterfaceAdapter
EndPoint
SelectPoll
PublicServicePoint
Poll SelectDispatcher
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 40
Connecting
EventHandlerEventQueueInterfaceAdapter
Connect
Connect
Wait for establishment
Push established event
Consume event
a
a
ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 41
Sending
Push sent event
EventHandlerEventQueueInterfaceAdapter
Push into send queue
Buffer to
send
Check sending
is possible
Consume event
Pop buffer
from send queue
Send buffer
a
a
a
a ml
a
Thread
Mutex-less
Asynchronous
Legend:
ml ml
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 42
Receiving
Dispatch as
reference
Check for incoming
EventHandlerEventQueue Dispatcher
Incoming
packet
Push receive event Consume event
Provide free buffer
DispatcherPublicServicePoint
Get free buffer
Socket receive
a
a
a
ml ml
ml
a
Thread
Mutex-less
Asynchronous
Legend:
The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 43
Readout Unit Machine Affinities
Affinities
40 Gb/s
Ethernet
56 Gb/s
Infiniband
I/O I/O
Readout Unit
15131197531
CPU Socket 1
NUMA Node 1 (16 GB)
14121086420
CPU Socket 0
NUMA Node 0 (16 GB)
40 GbE - I/O Interrupts
40 GbE - Socket reading
Infiniband - I/O Interrupts
DAQ threads
DAQ Memory allocation
Cores available to operating system

More Related Content

What's hot

Transfer of ut information from fpga through ethernet interface
Transfer of ut information from fpga through ethernet interfaceTransfer of ut information from fpga through ethernet interface
Transfer of ut information from fpga through ethernet interfaceeSAT Publishing House
 
Design and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioDesign and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioIJECEIAES
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane Michelle Holley
 
eProsima RPC over DDS - Connext Conf London October 2015
eProsima RPC over DDS - Connext Conf London October 2015 eProsima RPC over DDS - Connext Conf London October 2015
eProsima RPC over DDS - Connext Conf London October 2015 Jaime Martin Losa
 
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...IOSRJVSP
 
eProsima RPC over DDS - OMG June 2013 Berlin Meeting
eProsima RPC over DDS - OMG June 2013 Berlin MeetingeProsima RPC over DDS - OMG June 2013 Berlin Meeting
eProsima RPC over DDS - OMG June 2013 Berlin MeetingJaime Martin Losa
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK
 
Dpdk frame pipeline for ips ids suricata
Dpdk frame pipeline for ips ids suricataDpdk frame pipeline for ips ids suricata
Dpdk frame pipeline for ips ids suricataVipin Varghese
 
Pristine rina-tnc-2016
Pristine rina-tnc-2016Pristine rina-tnc-2016
Pristine rina-tnc-2016ICT PRISTINE
 
RINA Tutorial at ETSI ISG NGP#3
RINA Tutorial at ETSI ISG NGP#3RINA Tutorial at ETSI ISG NGP#3
RINA Tutorial at ETSI ISG NGP#3ARCFIRE ICT
 
High Performance DSP with Xilinx All Programmable Devices (Design Conference ...
High Performance DSP with Xilinx All Programmable Devices (Design Conference ...High Performance DSP with Xilinx All Programmable Devices (Design Conference ...
High Performance DSP with Xilinx All Programmable Devices (Design Conference ...Analog Devices, Inc.
 
Network processing by pid
Network processing by pidNetwork processing by pid
Network processing by pidNuno Martins
 
LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...
LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...
LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...LF_DPDK
 
Review on low power high speed 32 point cyclotomic parallel FFT Processor
Review on low power high speed 32 point cyclotomic parallel FFT ProcessorReview on low power high speed 32 point cyclotomic parallel FFT Processor
Review on low power high speed 32 point cyclotomic parallel FFT ProcessorIRJET Journal
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
 

What's hot (20)

Jg3515961599
Jg3515961599Jg3515961599
Jg3515961599
 
Transfer of ut information from fpga through ethernet interface
Transfer of ut information from fpga through ethernet interfaceTransfer of ut information from fpga through ethernet interface
Transfer of ut information from fpga through ethernet interface
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
gio's tesi
gio's tesigio's tesi
gio's tesi
 
Design and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioDesign and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined Radio
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
eProsima RPC over DDS - Connext Conf London October 2015
eProsima RPC over DDS - Connext Conf London October 2015 eProsima RPC over DDS - Connext Conf London October 2015
eProsima RPC over DDS - Connext Conf London October 2015
 
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...
 
eProsima RPC over DDS - OMG June 2013 Berlin Meeting
eProsima RPC over DDS - OMG June 2013 Berlin MeetingeProsima RPC over DDS - OMG June 2013 Berlin Meeting
eProsima RPC over DDS - OMG June 2013 Berlin Meeting
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
 
Dpdk frame pipeline for ips ids suricata
Dpdk frame pipeline for ips ids suricataDpdk frame pipeline for ips ids suricata
Dpdk frame pipeline for ips ids suricata
 
Pristine rina-tnc-2016
Pristine rina-tnc-2016Pristine rina-tnc-2016
Pristine rina-tnc-2016
 
RINA Tutorial at ETSI ISG NGP#3
RINA Tutorial at ETSI ISG NGP#3RINA Tutorial at ETSI ISG NGP#3
RINA Tutorial at ETSI ISG NGP#3
 
High Performance DSP with Xilinx All Programmable Devices (Design Conference ...
High Performance DSP with Xilinx All Programmable Devices (Design Conference ...High Performance DSP with Xilinx All Programmable Devices (Design Conference ...
High Performance DSP with Xilinx All Programmable Devices (Design Conference ...
 
Network processing by pid
Network processing by pidNetwork processing by pid
Network processing by pid
 
Mmap failure analysis
Mmap failure analysisMmap failure analysis
Mmap failure analysis
 
LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...
LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...
LF_DPDK17_Flexible and Extensible support for new protocol processing with DP...
 
3DD 1e 31 Luglio Apertura
3DD 1e 31 Luglio Apertura3DD 1e 31 Luglio Apertura
3DD 1e 31 Luglio Apertura
 
Review on low power high speed 32 point cyclotomic parallel FFT Processor
Review on low power high speed 32 point cyclotomic parallel FFT ProcessorReview on low power high speed 32 point cyclotomic parallel FFT Processor
Review on low power high speed 32 point cyclotomic parallel FFT Processor
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
 

Similar to Achieving High Performance TCP over 40GbE on NUMA

CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011
CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011
CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011Andrea PETRUCCI
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
In Network Computing Prototype Using P4 at KSC/KREONET 2019
In Network Computing Prototype Using P4 at KSC/KREONET 2019In Network Computing Prototype Using P4 at KSC/KREONET 2019
In Network Computing Prototype Using P4 at KSC/KREONET 2019Kentaro Ebisawa
 
Nad710 Introduction To Networks Using Linux
Nad710   Introduction To Networks Using LinuxNad710   Introduction To Networks Using Linux
Nad710 Introduction To Networks Using Linuxtmavroidis
 
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSFPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSIAEME Publication
 
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSFPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSIAEME Publication
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Programming the Network Data Plane
Programming the Network Data PlaneProgramming the Network Data Plane
Programming the Network Data PlaneC4Media
 
UCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and BeyondUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and BeyondEd Dodds
 
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptxDESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptxAmpofoKwadwo
 
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptxDESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptxAmpofoKwadwo
 

Similar to Achieving High Performance TCP over 40GbE on NUMA (20)

CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011
CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011
CHEP2012_Upgrade_CMS_Event_Builder_AP_21052011
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
 
LPC4300_two_cores
LPC4300_two_coresLPC4300_two_cores
LPC4300_two_cores
 
Chaudhry_On-going Efforts by SDOs_Austin
Chaudhry_On-going Efforts by SDOs_AustinChaudhry_On-going Efforts by SDOs_Austin
Chaudhry_On-going Efforts by SDOs_Austin
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
In Network Computing Prototype Using P4 at KSC/KREONET 2019
In Network Computing Prototype Using P4 at KSC/KREONET 2019In Network Computing Prototype Using P4 at KSC/KREONET 2019
In Network Computing Prototype Using P4 at KSC/KREONET 2019
 
Nad710 Introduction To Networks Using Linux
Nad710   Introduction To Networks Using LinuxNad710   Introduction To Networks Using Linux
Nad710 Introduction To Networks Using Linux
 
Unit_2_Midddleware_2.ppt
Unit_2_Midddleware_2.pptUnit_2_Midddleware_2.ppt
Unit_2_Midddleware_2.ppt
 
H344250
H344250H344250
H344250
 
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSFPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
 
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSFPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Programming the Network Data Plane
Programming the Network Data PlaneProgramming the Network Data Plane
Programming the Network Data Plane
 
UCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and BeyondUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond
 
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptxDESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
 
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptxDESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
DESIGN AND IMPLEMENTATION OF PEER TO PEER NETWORK FOR FILE SHARING.pptx
 

More from Andrea PETRUCCI

PETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsPETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsAndrea PETRUCCI
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System IIAndrea PETRUCCI
 
AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007Andrea PETRUCCI
 
GRIDCC_EELA_Conference_Andrea_Petrucci
GRIDCC_EELA_Conference_Andrea_PetrucciGRIDCC_EELA_Conference_Andrea_Petrucci
GRIDCC_EELA_Conference_Andrea_PetrucciAndrea PETRUCCI
 

More from Andrea PETRUCCI (6)

PETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsPETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_Publications
 
mg-ccr-ws-10122008
mg-ccr-ws-10122008mg-ccr-ws-10122008
mg-ccr-ws-10122008
 
CASPUR Tape Dispatcher
CASPUR Tape DispatcherCASPUR Tape Dispatcher
CASPUR Tape Dispatcher
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System II
 
AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007
 
GRIDCC_EELA_Conference_Andrea_Petrucci
GRIDCC_EELA_Conference_Andrea_PetrucciGRIDCC_EELA_Conference_Andrea_Petrucci
GRIDCC_EELA_Conference_Andrea_Petrucci
 

Achieving High Performance TCP over 40GbE on NUMA

  • 1. Achieving High Performance with TCP over 40GbE on NUMA Architectures for CMS Data Acquisition Andrea Petrucci – CERN (PH/CMD) The 19th Real Time Conference 27 May 2014, Nara, Japan
  • 2. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 2 Outline  New Processor Architectures  DAQ at the CMS Experiment  CMS Online Software Framework (XDAQ)  Architecture Foundation  Memory and Thread Managements  Data Transmission  TCPLA - TCP Layered Architecture  Motivation  Architecture  Performance Tuning  Preliminary Results  Summary
  • 3. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 3 New Processor Architectures  In middle of 2000’s the processor frequency stabilized and the number of cores per processor started to increase  The “golden” era for software developers ended in 2004 when Intel cancelled its high-performance uniprocessor projects  “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in software” (Herb Sutter, 2005)  Software designed with concurrency in mind resulted in more efficient use of new processors Processor evolution (source: Sam Naffziger, AMD)
  • 4. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 4 Non-Uniform Memory Access (NUMA) Non-Uniform Memory Access (NUMA)  The distance between processing cores and memory or I/O interrupts varies within Non-Uniform Memory Access (NUMA) architectures CPU Memory Controller CPU Memory Controller CPU Memory Controller CPU Memory Controller I/O Device s I/O Device s I/O Device s I/O Device s
  • 6. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 6 CMS DAQ Requirements for LHC Run 2 Parameters Data Sources (FEDs) ~ 620 Trigger levels 2 First Level rate 100 kHz Event size 1 to 2 MB Readout Throughput 200 GB/s High Level Trigger 1 kHz Storage Bandwidth 2 GB/s
  • 7. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 7 CMS DAQ System for LHC Run 2 CPU Socket CPU Socket I/OI/O 40 Gb/s Ethernet 56 Gb/s Infiniband FDR Readout Unit
  • 8. CMS Online Software Framework (XDAQ)
  • 9. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 9 XDAQ Framework  The XDAQ is software platform created specifically for the development of distributed data acquisition systems  Implemented in C++, developed by the CMS DAQ group  Provides platform independent services, tools for inter-process communication, configuration and control  Builds upon industrial standards, open protocols and libraries, and is designed according to the object-oriented model For further information about XDAQ see: J. Gutleber, S. Murray and L. Orsini, Towards a homogeneous architecture for high-energy physics data acquisition systems published in Computer Physics Communications, vol. 153, issue 2, pp. 155-163, 2003 http://www.sciencedirect.com/science/article/pii/S0010465503001619
  • 10. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 10 XDAQ Architecture Foundation Core Executive Plugin Interface
  • 11. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 11 Application Plugin Core Executive Plugin Interface XDAQ Architecture Foundation XML Configuration
  • 12. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 12 XDAQ Architecture Foundation XML Configuration SOAP HTTP RESTful Control and Interface Application Plugin Core Executive Plugin Interface
  • 13. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 13 XDAQ Architecture Foundation XML Configuration SOAP HTTP RESTful Control and Interface Application Plugin Core Executive Plugin Interface Hardware Access VMEPCI Memory
  • 14. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 14 XDAQ Architecture Foundation XML Configuration SOAP HTTP RESTful Control and Interface Application Plugin Core Executive Plugin Interface Hardware Access VMEPCI Memory Protocols and Formats  Uniform building blocks - One or more executives per computer contain application and service components
  • 15. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 15 CPU Memory – Memory Pools Memory Pool Cached buffers All memory allocation can bound to specific NUMA node  Memory pools allocate and cache memory  log2 best-fit allocation  No fragmentation of memory over long runs  Buffers are recycled during runs – constant time for retrieval
  • 16. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 16 Memory – Buffer Loaning Step 1 Step 2 Step 3 Task A Task B Loans reference Allocates reference Releases reference  Buffer loaning allows zero-copy of data between software layers and processes Task A Task B Task A Task B
  • 17. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 17 I/O Controller I/O Controller CPU Multithreading Workloop Thread Application Thread  Workloops can be bound to run on specific CPU cores by configuration  Work assigned by application  Workloops provide easy use of threads Assignment of work
  • 18. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 18 Protocol level XDAQ User Application Logical communication NIC Peer Transport Data Transmission – Peer Transports XDAQ User Application NIC Peer Transport Peer Transport Application uses network transparently through XDAQ framework  User application is network and protocol independent  Routing defined by XDAQ configuration  Connections setup through Peer to Peer model
  • 20. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 20 Motivation  Learned lessons from experience using previous peer transports for TCP in a real environment lead to a more advanced design to cope with  Separation of TCP socket handling and XDAQ protocols  Exploitation of a multi-threading environment  Optimisation through configurability to improve performance  Grouping sockets and also to associate different threads to different roles
  • 21. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 21 TCPLA inspired by uDAPL Direct Access Programming Library  Developed by DAT collaborative  http://www.datcollaborative.org/  Defines user Direct Access Programming Library (uDAPL) and kernel Direct Access Programming Library (kDAPL) APIs  TCPLA is inspired by uDAPL specification, in particular the send/receive semantics it describes
  • 22. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 22 What is the TCP Layered Architecture  TCP Layered Architecture (TCPLA) is a lightweight, transport and platform independent user-level library for handling socket processing  Provides the user with an event driven model of handling network communications  All calls to send and receive data are performed asynchronously  Two peer transports have been developed  ptFRL is specific for reading TCP/IP streams from FEROLs in RU machine  ptUTCP is for general purpose data flow ptUTCP General protocols ptFRL FEROL protocol
  • 23. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 23 TCPLA Architecture Principles (I) Send Queue Receive Queue Completion Event Queue  Received Buffer  Connection Request  Sent Buffer  Connection established Peer Transport Pre-allocated buffers for receiving Filled buffers for sending Event handler TCP Layered Architecture
  • 24. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 24 TCPLA Architecture Principles (II) socket i/o socket i/o socket i/o socket i/o socket i/o socket i/o socket i/o Receive Workloop Send Workloop Event Workloop Associate 1..N socket per workloop
  • 26. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 26 Performance Factors CPU affinity I/O Interrupt affinity Memory affinity TCP Custom Kernel Settings
  • 27. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 27 Readout Unit Machine Affinities Affinities 40 Gb/s Ethernet 56 Gb/s Infiniband I/O I/O Readout Unit 15131197531 CPU Socket 1 NUMA Node 1 (16 GB) 14121086420 CPU Socket 0 NUMA Node 0 (16 GB) 40 GbE - I/O Interrupts 40 GbE - Socket reading Infiniband - I/O Interrupts DAQ threads DAQ Memory allocation Cores available to operating system
  • 29. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 29 P2P Example: Out of the Box vs Affinity Linksaturation withaffinity
  • 30. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 30 FEROL Aggregation into One RU  Aggregation n-to-1, for example  12 Streams from 12 FEROLs each sending fragment size from 2 to 4 kB  16 Streams from 8 FEROLs each sending fragment size from 1 to 2 kB  Concentrated in one 40 GbE NIC in RU PC  Mellanox SX 1024 with 48 ports at 10 GbE and 12 ports at 40 GbE  Reliability and congestion handled by TCP/IP  Post frames are enabled in the 10/40 GbE switch 12/16 Streams
  • 31. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 31 DAQ Test Bed  Final performance indicator of the TCPLA is based upon the performance achieved while executing simultaneous input and output in the RU  Test bed up to 47 FEROLs (10 GbE) in input CPU CPU I/OI/O 40 Gb/s Ethernet 56 Gb/s Infiniband FDR Readout Unit
  • 32. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 32 Test with 1Data Source from FEROL (I) Fragment Size (bytes) 300 400 1000 2000 3000 10000 Av.ThroughputperRU(MB/s) 0 1000 2000 3000 4000 5000 12 FEROLs x 1 RU x 4 BUs 100 kHz (12 streams) 40 GE
  • 33. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 33 Test with 1Data Source from FEROL (II) Fragment Size (bytes) 300 400 1000 2000 3000 10000 Av.ThroughputperRU(MB/s) 0 1000 2000 3000 4000 5000 12 FEROLs x 1 RU x 4 BUs 24 FEROLs x 2 RUs x 4 BUs 47 FEROLs x 4 RUs x 8 BUs 100 kHz (12 streams) 40 GE working range 90% efficiency of 40 GE at 4.5 GB/s
  • 34. Test with 2 Data Sources from FEROL
  • 35. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 35 Test with 2 Data Sources from FEROL Fragment Size (bytes) 300 400 1000 2000 3000 10000 Av.ThroughputperRU(MB/s) 0 1000 2000 3000 4000 5000 8 FEROLs x 1 RU x 4 BUs 16 FEROLs x 2 RUs x 4 BUs 32 FEROLs x 4 RUs x 8 BUs 100 kHz (16 streams) 40 GE working range 90% efficiency of 40 GE at 4.5 GB/s
  • 36. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 36 Summary  TCPLA is based on the standard socket library with optimizations for NUMA environments using the XDAQ framework  It has been developed to allow high performance of critical applications in the new CMS DAQ system by exploiting multi-core architectures  It will be used for both data flow and monitoring
  • 37. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 37 Questions ? Send Timeout Receive Timeout Affinity Single-Threaded Dispatcher Multi-Threaded Dispatcher Poll Select Subnet Scanning Port Scanning TCP SDP Blocking Non-Blocking Connect On Request Polling CycleDatagram Auto Connect FRL I2O B2IN I/O Queue Size Polling Work Loop Waiting Work Loop Event Queue Size Receive Buffers Receive Block Size Max Clients Max FD TCPLA is a Powerful Tool
  • 39. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 39 Software Architecture PeerTransport EventHandler InterfaceAdapter EndPoint SelectPoll PublicServicePoint Poll SelectDispatcher
  • 40. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 40 Connecting EventHandlerEventQueueInterfaceAdapter Connect Connect Wait for establishment Push established event Consume event a a ml a Thread Mutex-less Asynchronous Legend: ml ml
  • 41. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 41 Sending Push sent event EventHandlerEventQueueInterfaceAdapter Push into send queue Buffer to send Check sending is possible Consume event Pop buffer from send queue Send buffer a a a a ml a Thread Mutex-less Asynchronous Legend: ml ml
  • 42. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 42 Receiving Dispatch as reference Check for incoming EventHandlerEventQueue Dispatcher Incoming packet Push receive event Consume event Provide free buffer DispatcherPublicServicePoint Get free buffer Socket receive a a a ml ml ml a Thread Mutex-less Asynchronous Legend:
  • 43. The 19th Real Time Conference, May 27 2014 , Nara Prefectural New Public Hall, Nara, Japan – Andrea Petrucci CERN (PH/CMD) 43 Readout Unit Machine Affinities Affinities 40 Gb/s Ethernet 56 Gb/s Infiniband I/O I/O Readout Unit 15131197531 CPU Socket 1 NUMA Node 1 (16 GB) 14121086420 CPU Socket 0 NUMA Node 0 (16 GB) 40 GbE - I/O Interrupts 40 GbE - Socket reading Infiniband - I/O Interrupts DAQ threads DAQ Memory allocation Cores available to operating system

Editor's Notes

  1. memory is NEVER de-allocated, they are internally recyled
  2. The peer transport PROTOCOLs define the FORMAT of MESSAGES for transporting
  3. The peer transport PROTOCOLs define the FORMAT of MESSAGES for transporting