5. TRANSFORMING NETWORKING & STORAGE
5
Processor 0
Physical
Core 0
Linux* Control Plane
NUMA
Pool Caches
Queue/Rings
Buffers
10 GbE
10 GbE
Physical
Core 1
Intel® DPDK
PMD Packet I/O
Packet work
Rx
Tx
Physical
Core 2
Intel® DPDK
PMD Packet I/O
Flow work
Rx
Tx
Physical
Core 3
Intel® DPDK
PMD Packet I/O
Flow
Classification
App A, B, C
Rx
Tx
Physical
Core 5
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Run to Completion model
• I/O and Application workload can be handled on a single core
• I/O can be scaled over multiple cores
PCIe* connectivity and core usage
Using run-to-completion or pipeline software models
10 GbE
Pipeline model
• I/O application disperses packets to other cores
• Application work performed on other cores
Processor 1
Physical
Core 4
Intel® DPDK
10 GbE
Physical
Core 5
Intel® DPDK
Physical
Core 0
Intel® DPDK
PMD Packet I/O
Hash
Physical
Core 1
Intel® DPDK
App A App B App C
Physical
Core 2
Intel® DPDK
App A App B App C
Physical
Core 3
Intel® DPDK
Rx
Tx
10 GbE
Pkt Pkt
Physical
Core 4
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Pkt Pkt
Pkt Pkt
Pkt
Pkt
RSS
Mode
QPI
PCIePCIePCIePCIe
PCIePCIe
NUMA
Pool Caches
Queue/Rings
Buffers
Look at more I/O on
fewer cores with
vectorization
6. TRANSFORMING NETWORKING & STORAGE
6
Applications will generally employ both models
Technical questions to consider:
How many cycles/packet do I need for my algorithms?
Are there large data structures that need to be shared
with read/write access across packets?
Will I support timer / packet ordering functions?
Can I take advantage of a specific optimization
if you restrict an algorithm to one core?
How much data would I need to exchange between
software modules?
When to Choose Run-to-Completion vs.
Pipeline
7. TRANSFORMING NETWORKING & STORAGE
7
General architecture questions to consider:
Do some cores have easier/faster access to a hw resource?
Do you want to view cores as offload engines?
Development environment questions to consider:
Do you need to employ legacy software modules?
Does ease-of-code-maintenance trump performance?
More Run-to-Completion vs. Pipeline…
8. TRANSFORMING NETWORKING & STORAGE
8
Example: Building a More
Complicated Pipeline Applications can be distributed
/pipelined across as many cores as
needed to achieve throughput
Trade-offs will vary on when to
distribute applications vs.
consolidate
Queue/ring API serves as the
communication mechanism
Current focus is a static (boot-
time) configuration of queues
NIC driver pushes data to flow
classifier
Classifier branches packet out to
appropriate handler depending on
packet inspection
IPSEC packets could be sent to
CPM via CPM PMD or handled
on-CPU for non–accelerated
platforms
This is just an example
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Flow
Classification
Inbound IPsec pre-
Processing
L3 Forwarding
Application
Discard Application
IPsec
Post Processing
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Cave Creek
CPMNIC
10. TRANSFORMING NETWORKING & STORAGE
10
Connection Between DPDK Elements -- Rings
• Primary mechanism to move data between software units, or
between software and I/O sources or hardware accelerators
AcceleratorNIC
dispatch loop
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Flow
Classification
Inbound IPSec Pre
Processing
L3 Forwarding
Application
Discard Application
Free List
IPSec Post Processing
Forward packet to
another core
FIB
DPDK Component
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Customer
Application
11. TRANSFORMING NETWORKING & STORAGE
11
Queue/Ring Management API
Effectively a FIFO implementation in software
• Lockless implementations for single or multi-producer, single or multi- consumer
enqueue/dequeue
• Supports bulk enqueue/dequeue to support packet-bunching
• Implements watermark thresholds for back-pressure/flow control
Essential to optimizing throughput
• Used to decouple stages of a pipeline
12. TRANSFORMING NETWORKING & STORAGE
12
Pointers are implemented as binary
values in a space of 2^32 addresses
Steps:
1. ring->prod_head and ring->cons_tail
are copied to local variables
2. Use a compare-and-swap to update
ring->ph… only if ring->ph = prod_head
3. Update the enqueue obj
4. Use a compare-and-swap to update
ring->pt… only if ring->pt = prod_head
How “lockless” Operations Are Implemented
Multiple-producer enqueue Example
14. TRANSFORMING NETWORKING & STORAGE
14
Platform Quality of Service
Cache Monitoring Technology – Ability to monitor Last Level Cache occupancy
for a set of RMID’s (Resource Monitoring ID). Extensible architecture for future
monitoring events.
Cache Allocation Technology – Ability to partition Last Level Cache,
enforcement on a per Core basis through Class of Service mapping.
https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-software-support-and-tools
15. TRANSFORMING NETWORKING & STORAGE
15
Enumerate QoS
Configure class of
Service w/ bitmasks
QoS Enum/Config
On Context
Switch
Set Class of
Service in PQR
QoS Association
Config COS=2
Application
Memory Request
Tag with Cache
Class of service
Enforcement
Cache Subsystem
QoS Aware
Cache Allocation Set 1
Way 2
… … …
Way 16
Set 2
Way 2
… … …
Way 16
Set 3
Way 2
… … …
Way 16
.. .. ..
Way 2
… … …
Way 16
Set n
Way 1
. . .
Way 16
Memory
Request
Cache Allocation Technology - flow
COS
Enforce mask
COS 1
COS 2
COS 3
COS 4
WayMask1
WayMask2
WayMask3
WayMask4
Transaction
COS 1
COS 2
COS 3
COS 4
BitMask1 WayMask1
Enforcement Target 2BitMask2 WayMask2
BitMask3 WayMask3
BitMask4 WayMask4
Architectural Implementation
dependent
16. TRANSFORMING NETWORKING & STORAGE
16
Cache Allocation Technology
Bitmask examples: Only masks with contiguous ‘1’s allowed
Apps can be separated or can share LLC space
Isolated: Determinism Benefit
Shared / Overlapped: Throughput Benefit
M7
M6
M5
M4
M3
M2
M1
M0
COS 1 A A A A 50%
COS 2 A A 25%
COS 3 A 12.5%
COS 4 A 12.5%
Isolated
bitmasks
Overlapped
bitmasks
Examples of overlap and isolation (8b)
M7
M6
M5
M4
M3
M2
M1
M0
COS 1 A A A A A A A A 100%
COS 2 A A A A 50%
COS 3 A A 25%
COS 4 A 12.5%
20. TRANSFORMING NETWORKING & STORAGE
20
Haswell Cluster on Die (COD) Mode
Cluster0
Cbo
LLC
Cbo
LLC
Sbo
Sbo
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
HA0
QPI
0/1
IIO
HA1
Core
CoreCore
Core
CoreCore
Core
Core
Core
Core
Core
Core
Core
Core
Cluster1
Cbo
LLC
Cbo
LLC
Cbo
LLC
Core
Core
Core
Core
Cbo
LLC
COD Mode for 18C HSW-EPOn Haswell CPUs, all L3 cache is not on the
same ring.
• Some L3 cache has higher latency to access
• Similar to NUMA, but for L3 cache
Supported on 2S HSW-EP SKUs with 2
Home Agents (10+ cores)
• Targeted at NUMA workloads where latency is
more important than sharing data across
Caching Agents (Cbo)
• Reduces average LLC hit and local memory latencies
• HA mostly sees requests from reduced set of
threads which can lead to higher memory bandwidth
• OS/VMM own NUMA and process affinity
decisions
22. TRANSFORMING NETWORKING & STORAGE
22
40GbE Fortville family (XL710/X710)
Comparing Controller Typical Power
82599EB
1 x 40GbE
3.3 watts2
Typical Power
2 x 10GbE
5.2 watts1
Typical Power
Source as of Aug 2014: 1: 82599 Datasheet rev 2.0 Table 11.5 for 2x10GbE Twinax Typical Power [W] 2:
XL710 Data sheet rev 1.21 Table 14-7 Typical Active Power 1x40GbE Power [W]
30%
65%
2x
Power Efficiency Improvements
UP TO 30%
Reduction
TYPICAL POWER
UP TO 65%
Reduction In
GIGABIT PER WATT
Increase in
TOTAL
BANDWIDTH
23. TRANSFORMING NETWORKING & STORAGE
23
40GbE Fortville family (XL710/X710)
2x10 4x10 1x40 2x40
• Low power single chip design for PCI Express 3.0
• Intelligent load balance for high performance traffic flows
• Network virtualization Overlay stateless offloads for
VXLAN, NVGRE, Geneve
• Flexible pipeline processing – add new features after
production by upgrading firmware upgradable
25. TRANSFORMING NETWORKING & STORAGE
25
DPPD: What is it?
• Data Plane Performance Demonstrators
• An open source DPDK application
• BSD3C license
• Available on 01.org (https://01.org/intel-data-plane-performance-
demonstrators/downloads)
• Runs on host, vm and ovs
26. TRANSFORMING NETWORKING & STORAGE
26
• Config file defines
• Which cores are used
• Which interfaces are used
• Which tasks are executed and how configured
• Allows to
• Find bottlenecks and measure performance
• Try and compare different core layouts without changing code
• Reuse config file on different systems (CPUs, hyper-threads, sockets,
interfaces)
DPPD – What is it? (continued)
29. TRANSFORMING NETWORKING & STORAGE
29
• Easily reconfigurable (parses config file)
• Different pipelines through configuration
• WiFi Gateway
• BNG
• QoS
• Combination or part of the above
• Assign work to different cores
• Cache QoS Management
• Configuration follows design
• Each core is assigned to execute a (set of) task(s)
• Tasks are executed in round-robin fashion
• Tasks communicate through rings
Configuration and design
30. TRANSFORMING NETWORKING & STORAGE
30
DPPD: Very simple Port Forwarding
FWD
ETH1 ETH2
[port 0] ;DPDK port number
name=cpe0
mac=00:00:00:00:00:01
[port 1] ;DPDK port number
name=cpe1
mac=00:00:00:00:00:02
[core 1]
name=FWD
task=0
mode=none
rx port=cpe0
tx port=cpe1