Семинар «Использование современных информационных технологий для решения современных задач физики частиц» в московском офисе Яндекса, 3 июля 2012
Niko Neufeld, CERN
1. LHCb Trigger & DAQ
an Introductory Overview
Niko Neufeld
CERN/PH Department
Yandex, July 3rd Moscow
2. The Large Hadron Collider
LHC Trigger & DAQ - Niko Neufeld, CERN 2
3. Physics, Detectors, Trigger & DAQ
High rate signals Fast
collider electronics
data
rare, need decisions Data
Trigger
many collisions acquisition
Event Mass
Filter Storage
High throughput DAQ, Niko Neufeld, CERN 3
4. The Data Acquisition Challenge at LHC
• 15 million detector channels
• @ 40 MHz
• = ~15 * 1,000,000 * 40 * 1,000,000 bytes
• = ~ 600 TB/sec
?
LHC Trigger & DAQ - Niko Neufeld, CERN 4
5. Should we read everything?
• A typical collision is “boring”
109 Hz – Although we need also some of
these “boring” data as cross-
5 106 Hz check, calibration tool and also
some important “low-energy”
physics
• “Interesting” physics is about 6–8
orders of magnitude rarer (EWK &
Top)
EWK: 20–100 Hz
• “Exciting” physics involving new
10 Hz particles/discoveries is 9 orders
of magnitude below tot
– 100 GeV Higgs 0.1 Hz*
– 600 GeV Higgs 0.01 Hz
• We just need to efficiently
identify these rare processes from
the overwhelming background
before reading out & storing the
whole event
*Note: this is just the production rate, properly finding it is much rarer!
LHC Trigger & DAQ - Niko Neufeld, CERN 5
6. Know Your Enemy:
pp Collisions at 14 TeV at 1034 cm-2s-1
• (pp) = 70 mb
--> >7 x 108 /s
(!)
• In ATLAS and
CMS* 20 – 30
min bias
events overlap
• H ZZ
Z
H 4 muons: Reconstructed tracks
the cleanest with pt > 25 GeV
(“golden”) And this
signature
(not the H though…)
repeats every 25 ns…
*)LHCb @4x1033 cm-2-1 isn’t much nicer and in Alice (PbPb) is even more busy
LHC Trigger & DAQ - Niko Neufeld, CERN 6
7. Trivial DAQ with a real trigger 2
Sensor
Trigger
Delay Discriminator
Start Busy Logic
ADC
and not
Proces- Interrupt Set
sing Clear Q
Ready
storage
Deadtime (%) is the ratio between the time the DAQ
is busy and the total time.
High throughput DAQ, Niko Neufeld, CERN 7
8. A “simple” 40 MHz track trigger – the
LHCb PileUp system
LHC Trigger & DAQ - Niko Neufeld, CERN 8
9. Finding vertices in FPGAs
• Use r-coordinates of
hits in Si-detector discs
(detector geometry
made for this task!)
• Find coincidences
between hits on two
discs
• Count & histogram
LHC Trigger & DAQ - Niko Neufeld, CERN 9
10. LHCb Pileup Finding multiple vertices
and quality
Comparing with the “offline” truth
(full tracking, calibration, alignment)
LHC Trigger & DAQ - Niko Neufeld, CERN 10
11. LHCb Pileup Algorithm
• Time-budget for this
algorithm about 2 us
• Runs in conventional
FPGAs in a radiation-
safe area
• Limited to low pile-up
(ok for LHCb)
LHC Trigger & DAQ - Niko Neufeld, CERN 11
13. DAQ design guidelines
• Scalability – change in event-size, luminosity (pileup!)
• Robust (very little dead-time, high efficiency, non-
expert operators) intelligent control-systems
• Use industry-standard, commercial technologies (long-
term maintenance) PCs, Ethernet
• Low cost PCs, standard LANs
• High band-width (many Gigabytes/s) use local area
networks (LAN)
• “Creative” & “Flexible” (open for new things) use
software and reconfigurable logic (FPGAs)
LHC Trigger & DAQ - Niko Neufeld, CERN 13
14. One network to rule the all
• Ethernet, IEEE 802.3xx, has almost
become synonymous with Local Area
Networking
• Ethernet has many nice features: cheap,
simple, cheap, etc…
• Ethernet does not:
– guarantee delivery of messages
– allow multiple network paths
– provide quality of service or bandwidth
assignment (albeit to a varying degree
this is provided by many switches)
• Because of this raw Ethernet is rarely • Flow-control in standard Ethernet is
used, usually it serves as a transport only defined between immediate
medium for IP, UDP, TCP etc… neighbors
• Sending station is free to throw
Xoff data
away x-offed frames (and often does
)
High throughput DAQ, Niko Neufeld, CERN 14
15. Generic DAQ implemented on a LAN
Typical number of pieces
Detector
1
Custom links from the
detector 1000
“Readout Units” 100 to 1000
for protocol adaptation
Powerful Core routers
2 to 8
Edge switches
50 to 100
Servers for event
filtering
> 1000
LHC Trigger & DAQ - Niko Neufeld, CERN 15
16. Congestion
• "Bang" translates into
2 2 random, uncontrolled packet-
loss
• In Ethernet this is perfectly
valid behavior and
implemented by many low-
latency devices
• This problem comes from
synchronized sources sending
to the same destination at the
same time
Bang • Either a higher level “event-
building” protocol avoids this
congestion or the switches
must avoid packet loss with
2 deep buffer memories
LHC Trigger & DAQ - Niko Neufeld, CERN 16
17. Push-Based Event Building with
store& forward switching and load-balancing
Sources do not buffer – “Send me
so switch must buffer to “Send me
an event!”
an event!”
avoid packet loss due to
overcommitment
Event Builder 1
Event Builder 2 me
“Send
an event!”
Data Acquisition
Switch
Event Builder 3
“Send me
EB1: 0
1 an event!”
EB2: 0
1 “Send
“Send
next event
EB3: 0
1 “Send
Event Manager tonext event
EB1”
next event
to EB2”
to EB3”
Event Builders notify Event Manager
1 2
Readout system
Event Manager
available capacity
ensures that data are
sent only to nodes with
available capacity
LHC Trigger & DAQ - Niko Neufeld, CERN
3 relies on feedback
from Event Builders
17
18. LHCb DAQ
Detector
VELO ST OT RICH ECal HCal Muon
L0
Trigger
L0 trigger FE FE FE FE FE FE FE
Experiment Control System (ECS)
Electronics Electronics Electronics Electronics Electronics Electronics Electronics
LHC clock TFC
System Readout
Board
Readout
Board
Readout
Board
Readout
Board
Readout
Board
Readout
Board
Readout
Board
Front-End
MEP Request
READOUT NETWORK
55 GB/s Event building
200 - 300 MB/s
SWITCH SWITCH
SWITCH SWITCH SWITCH SWITCH SWITCH
SWITCH
C C CC C C CC C C C C C C CC C CC C CC CC
CC CC P P P P P P P P P P P P P P P P P P P P PP P P
P P P P U UUU U UUU U UUU U UUU U UUU UU UU
UU UU
MON farm HLT farm
Event data Average event size 55 kB
Timing and Fast Control Signals Average rate into farm 1 MHz
Control and Monitoring data
Average rate to tape 4 – 5 kHz
LHC Trigger & DAQ - Niko Neufeld, CERN 18
19. LHCb DAQ
• Events are very small (about 55 kB) total
– each read-out board contributes about 200 bytes (only!!)
– A UDP message on Ethernet takes 8 + 14 + 20 + 8 + 4 = 52
bytes 25% overhead(!)
• LHCb uses coalescence of messages, packing about 10
to 15 events into one message (called MEP)
message rate is ~ 80 kHz (c.f. CMS, ATLAS)
• Protocol is a simple, single stage push, every farm-
node builds complete events, the TTC system is used to
assign IP addresses coherently to the read-out boards
LHC Trigger & DAQ - Niko Neufeld, CERN 19
22. High Level Trigger Farms
And that, in simple terms, is what
we do in the High Level Trigger
High throughput DAQ, Niko Neufeld, CERN 22
23. Online Trigger Farms 2012
ALICE ATLAS CMS LHCb
# cores 2700 17000 13200 15500
(+ hyperthreading)
# servers ~ 2000 ~ 1300 1574
(mainboards)
total available ~ 500 ~ 820 800 525
cooling power
total available ~ 2000 2400 ~ 3600 2200
rack-space (Us)
CPU type(s) AMD Intel 54xx, Intel 54xx, Intel 5450,
Opteron, Intel 56xx Intel 56xx Intel 5650,
Intel 54xx, Intel E5-2670 AMD 6220
Intel 56xx
And counting…
LHC Trigger & DAQ - Niko Neufeld, CERN 23
24. LHC planning
Not yet
approved!
Long Shutdown 1 (LS1)
CMS: Myrinet InfiniBand / Ethernet
ATLAS: Merge L2 and EventCollection infrastructures
Long Shutdown 2 (LS2)
Long Shutdown 3 (LS3)
ALICE continuous read-out
LHCb 40 MHz read-out CMS track-trigger
LHC Trigger & DAQ - Niko Neufeld, CERN 24
25. Motivation
• The LHC (large hadron collider) collides protons
every 25 ns (40 MHz)
• Each collision produces about 100 kB of data in
the detector
• Currently a pre-selection in custom electronics
rejects 97.5% of these events unfortunately a
lot of them contain interesting physics
• In 2017 the detector will be changed so that all
events can be read-out into a standard compute
platform for detailed inspection
Niko Neufeld, CERN 25
26. LHCb after LS2
• Ready for all software trigger (resources permitting)
• 0-suppression on front-end electronics mandatory!
• Event-size about 100 kB, readout-rate up to 40 MHz
• Will need a network scalable up to 32 Tbit/s:
InfiniBand, 10/40/100 Gigabit Ethernet?
LHC Trigger & DAQ - Niko Neufeld, CERN 26
27. Key figures
• Minimum required bandwidth: > 32 Tbit/s
• # of 100 Gigabit/s links > 320
• # of compute units > 1500
• An event (“snapshot of a collision”) is about 100
kB of data
• # of events processed every second: 10 to 40
millions
• # of events retained after filtering: 20000 to
30000 (data reduction of at least a factor 1000)
Niko Neufeld, CERN 27
28. LHCb DAQ as of 2018
Detector GBT: custom radiation- hard
link over MMF, 3.2 Gbit/s
Readout Units (about 10000)
Input into DAQ network
(10/40 Gigabit Ethernet or
DAQ network FDR IB) (1000 to 4000)
Output from DAQ network
100 m rock
into compute unit clusters
(100 Gbit Ethernet / EDR IB)
(200 to 400 links)
Compute Units Compute units could be
servers with GPUs or other
coprocessors
LHC Trigger & DAQ - Niko Neufeld, CERN 28
29. Readout Unit
• Readout Unit needs to collect custom-links
• Some pre-processing
• Buffering
• Coalescing of data-fragment reduce message-rate /
transport overheads
• Needs an FPGA
• Sends data using standard network protocol (IB,
Ethernet)
• Sending of data can be done directly from the FPGA or
via a standard network silicon
• Works together with Compute Units to build events
Niko Neufeld, CERN 29
30. Compute Unit
• A compute unit is a destination for the event-
data fragments from the readout units
• It assembles the fragments into a complete
“event” and runs various selection algorithms on
this event
• About 0.1 % of events is retained
• A compute unit will be a high-density server
platform (mainboard with standard CPUs),
probably augmented with a co-processor card
(like Intel MIC or GPU)
Niko Neufeld, CERN 30
31. Future DAQ systems: trends
• Certainly LAN based
– InfiniBand deserves a serious evaluation for high-bandwidth (>
100 GB/s)
– In Ethernet if DCB works, might be able to build networks from
smaller units, otherwise we will stay with large store&forward
boxes
• Trend to “trigger-free” do everything in software
bigger DAQ will continue
– Physics data-handling in commodity CPUs
• Will there be a place for multi-core / coprocessor cards
(Intel MIC / CUDA)?
– IMHO this will depend on if we can establish a development
framework which allows for longterm maintenance of the
software by non-”geek” users, much more than on the actual
technology
High throughput DAQ, Niko Neufeld, CERN 31
32. Fat-Tree Topology for One Slice
• 48-port 10 GbE switches
• Mix readout-boards (ROB) and filter-farm-servers in one
switch
– 15 x readout-boards
– 18 x servers
– 15 x uplinks
Non-block switching
use 65% of installed bandwidth
(classical DAQ only 50%)
• Each slice accomodates
– 690 x inputs (ROBS)
– 828 x outputs servers
Ratio (server/ROB) is adjustable
High throughput DAQ, Niko Neufeld, CERN 32
33. Pull-Based Event Building
“Send event “Send me
“Send me
“Send1!”
to EB event an event!”
“Send1!”
to EB event an event!”
“Send1!”
to EB event
to EB 1!”Event Builder 1
Event Builder 2 me
“Send
an event!”
Data Acquisition
Switch
Event Builder 3
“Send me
EB1: 0
1 an event!”
EB2: 0
1 “EB1, get
“EB2,
next get
EB3: 0
1
next
event”
event”
Event Builders notify Event Manager elects
1 2
Readout traffic is
Event Manager of
available capacity
event-builder node
LHC Trigger & DAQ - Niko Neufeld, CERN
3 driven by Event
Builders
33
34. Summary
• Large modern DAQ systems are based entirely (mostly) on Ethernet and
big PC-server farms
• Bursty, uni-directional traffic is a challenge in the network and the
receivers, and requires substantial buffering in the switches
• The future:
– It seems that buffering in switches is being reduced (latency vs. buffering)
– Advanced flow-control is coming, but it will need to be tested if it is sufficient
for DAQ
– Ethernet is still strongest, but InfiniBand looks like a very interesting
alternative
– Integrated protocols (RDMA) can offload servers, but will be more complex
– Integration of GPUs, non-Intel processors and other many-cores will be need
to be studied
• For the DAQ and triggering the question is not if we can do it,
but how we can do it so we can afford it!
High throughput DAQ, Niko Neufeld, CERN 34
36. Cut-through switching
Head of Line Blocking
1 3 • The reason for this is the First
2 2
4 in First Out (FIFO) structure of
the input buffer
• Queuing theory tells us* that
for random traffic (and infinitely
Packet to node 4 must wait
many switch ports) the
throughput of the switchnode 4 is f
even though port to will
go down to 58.6% that
means on 100 MBit/s network
the nodes will "see" effectively
only ~ 58 MBit/s
2
4 *) "Input Versus Output Queueing on a Space-Division
Packet Switch"; Karol, M. et al. ; IEEE Trans. Comm.,
35/12
LHC Trigger & DAQ - Niko Neufeld, CERN 36
37. Event-building
Detector Readout Units send to Compute Units
Compute Units receive passively
“Push-architecture”
Readout Units
DAQ network GBT: custom radiation-
hard link over MMF, 3.2
Gbit/s (about 10000)
Input into DAQ
100 m rock network (10/40 Gigabit
Ethernet or FDR IB)
(1000 to 4000)
Output from DAQ
network into compute
unit clusters (100 Gbit
Compute Units Ethernet / EDR IB) (200
to 400 links)
Niko Neufeld, CERN 37
39. Runcontrol challenges
• Start, configure and control O(10000)
processes on farms of several 1000 nodes
• Configure and monitor O(10000) front-end
elements
• Fast data-base access, caching, pre-loading,
parallelization and all this 100% reliable!
LHC Trigger & DAQ - Niko Neufeld, CERN 39
40. Runcontrol technologies
• Communication:
– CORBA (ATLAS)
– HTTP/SOAP (CMS)
– DIM (LHCb, ALICE)
• Behavior & Automatisation:
– SMI++ (Alice)
– CLIPS (ATLAS)
– RCMS (CMS)
– SMI++ (in PVSS) (used also in the DCS)
• Job/Process control:
– Based on XDAQ, CORBA, …
– FMC/PVSS (LHCb, does also fabric monitoring)
• Logging:
– log4C, log4j, syslog, FMC (again), …
LHC Trigger & DAQ - Niko Neufeld, CERN 40
Editor's Notes
Scheme showing the basic principle of the PU vertex algorithm implemented on the VEPROBs: on top, the $r$-coordinates of the hits are combined in a coincidence matrix; then the sum of entries in a wedge between lines of constant $\\frac{R_B}{R_A}$ is used to extract the vertex information; finally the $z$-position of all vertex candidates is projected onto an histogram. The highest peak is labeled as primary vertex PV.
Left: Vertex histogram obtained from the combinations of PU hits from a collision event of 2011 data. The histogram filled in black (red) is obtained before (after) the ``peak-masking'' phase. The second peak, that is the peak with the maximum number of entries in a 3-bins wide window, is now clearly visible.Right: Distance in mm between the $z-$position of the PU vertex candidate and the $z-$position of the offline (reconstructed) vertex, for events with at least 2 PU vertices and 2 reconstructed vertices. The histogram is obtained after applying the misalignment corrections to the Pile-Up.
TFC (TTC) system used as a load-balancer. No separate event-builder units – event-building done directly on each trigger farm node. Trigger farm nodes send event-requests to TFC system. TFC system broadcasts IP address to read-out board. Readout boards push data to trigger-farm node. Single stage read-out. Unreliable network protocol. Relies on large buffers in network and some over-provisioning. Typical link-load in DAQ 70 to 80% (for up-links)