2 new hw_features_cat_cod_etc

New HW
Features – CAT,
COD, Haswell &
other topics
Network Platforms Group

TRANSFORMING NETWORKING & STORAGE
2
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND
CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND
HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR
INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL
PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no
responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer
systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate
the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel AMT may not be available or certain
capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your
system vendor for more information.
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-
enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific
software for some uses. See http://www.intel.com/technology/security/ for more information.
†Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you
use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software
configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.
* Other names and brands may be claimed as the property of others.
Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to
change without notice.
Copyright © 2013, Intel Corporation. All rights reserved.

3
Topics
• Run to completion vs pipeline models
• Lockless queues
• Cache Allocation Technology
• DPPD intro

4
Run to Completion vs. Pipeline model

5
Processor 0
Physical
Core 0
Linux* Control Plane
NUMA
Pool Caches
Queue/Rings
Buffers
10 GbE
10 GbE
Physical
Core 1
Intel® DPDK
PMD Packet I/O
Packet work
Rx
Tx
Physical
Core 2
Intel® DPDK
PMD Packet I/O
Flow work
Rx
Tx
Physical
Core 3
Intel® DPDK
PMD Packet I/O
Flow
Classification
App A, B, C
Rx
Tx
Physical
Core 5
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Run to Completion model
• I/O and Application workload can be handled on a single core
• I/O can be scaled over multiple cores
PCIe* connectivity and core usage
Using run-to-completion or pipeline software models
10 GbE
Pipeline model
• I/O application disperses packets to other cores
• Application work performed on other cores
Processor 1
Physical
Core 4
Intel® DPDK
10 GbE
Physical
Core 5
Intel® DPDK
Physical
Core 0
Intel® DPDK
PMD Packet I/O
Hash
Physical
Core 1
Intel® DPDK
App A App B App C
Physical
Core 2
Intel® DPDK
App A App B App C
Physical
Core 3
Intel® DPDK
Rx
Tx
10 GbE
Pkt Pkt
Physical
Core 4
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Pkt Pkt
Pkt Pkt
Pkt
Pkt
RSS
Mode
QPI
PCIePCIePCIePCIe
PCIePCIe
NUMA
Pool Caches
Queue/Rings
Buffers
Look at more I/O on
fewer cores with
vectorization

6
Applications will generally employ both models
Technical questions to consider:
 How many cycles/packet do I need for my algorithms?
 Are there large data structures that need to be shared
with read/write access across packets?
 Will I support timer / packet ordering functions?
 Can I take advantage of a specific optimization
if you restrict an algorithm to one core?
 How much data would I need to exchange between
software modules?
When to Choose Run-to-Completion vs.
Pipeline

7
General architecture questions to consider:
 Do some cores have easier/faster access to a hw resource?
 Do you want to view cores as offload engines?
Development environment questions to consider:
 Do you need to employ legacy software modules?
 Does ease-of-code-maintenance trump performance?
More Run-to-Completion vs. Pipeline…

8
Example: Building a More
Complicated Pipeline Applications can be distributed
/pipelined across as many cores as
needed to achieve throughput
 Trade-offs will vary on when to
distribute applications vs.
consolidate
 Queue/ring API serves as the
communication mechanism
 Current focus is a static (boot-
time) configuration of queues
 NIC driver pushes data to flow
classifier
 Classifier branches packet out to
appropriate handler depending on
packet inspection
 IPSEC packets could be sent to
CPM via CPM PMD or handled
on-CPU for non–accelerated
platforms
This is just an example
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Flow
Classification
Inbound IPsec pre-
Processing
L3 Forwarding
Application
Discard Application
IPsec
Post Processing
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Cave Creek
CPMNIC

9
Lockless Queues
Used to share data between cores, threads etc.

10
Connection Between DPDK Elements -- Rings
• Primary mechanism to move data between software units, or
between software and I/O sources or hardware accelerators
AcceleratorNIC
dispatch loop
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Flow
Classification
Inbound IPSec Pre
Processing
L3 Forwarding
Application
Discard Application
Free List
IPSec Post Processing
Forward packet to
another core
FIB
DPDK Component
Poll Mode
Driver -Rx
Poll Mode
Driver -Tx
Customer
Application

11
Queue/Ring Management API
Effectively a FIFO implementation in software
• Lockless implementations for single or multi-producer, single or multi- consumer
enqueue/dequeue
• Supports bulk enqueue/dequeue to support packet-bunching
• Implements watermark thresholds for back-pressure/flow control
Essential to optimizing throughput
• Used to decouple stages of a pipeline

12
Pointers are implemented as binary
values in a space of 2^32 addresses
Steps:
1. ring->prod_head and ring->cons_tail
are copied to local variables
2. Use a compare-and-swap to update
ring->ph… only if ring->ph = prod_head
3. Update the enqueue obj
4. Use a compare-and-swap to update
ring->pt… only if ring->pt = prod_head
How “lockless” Operations Are Implemented
Multiple-producer enqueue Example

13
Haswell: Cache Allocation Technology
Enables more deterministic VNF performance

14
Platform Quality of Service
Cache Monitoring Technology – Ability to monitor Last Level Cache occupancy
for a set of RMID’s (Resource Monitoring ID). Extensible architecture for future
monitoring events.
Cache Allocation Technology – Ability to partition Last Level Cache,
enforcement on a per Core basis through Class of Service mapping.
https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-software-support-and-tools

15
Enumerate QoS
Configure class of
Service w/ bitmasks
QoS Enum/Config
On Context
Switch
Set Class of
Service in PQR
QoS Association
Config COS=2
Application
Memory Request
Tag with Cache
Class of service
Enforcement
Cache Subsystem
QoS Aware
Cache Allocation Set 1
Way 2
… … …
Way 16
Set 2
Way 2
… … …
Way 16
Set 3
Way 2
… … …
Way 16
.. .. ..
Way 2
… … …
Way 16
Set n
Way 1
. . .
Way 16
Memory
Request
Cache Allocation Technology - flow
COS
Enforce mask
COS 1
COS 2
COS 3
COS 4
WayMask1
WayMask2
WayMask3
WayMask4
Transaction
COS 1
COS 2
COS 3
COS 4
BitMask1  WayMask1
Enforcement Target 2BitMask2  WayMask2
Architectural Implementation
dependent

16
Cache Allocation Technology
Bitmask examples: Only masks with contiguous ‘1’s allowed
Apps can be separated or can share LLC space
Isolated: Determinism Benefit
Shared / Overlapped: Throughput Benefit
M7
M6
M5
M4
M3
M2
M1
M0
COS 1 A A A A 50%
COS 2 A A 25%
COS 3 A 12.5%
COS 4 A 12.5%
Isolated
bitmasks
Overlapped
bitmasks
Examples of overlap and isolation (8b)
M7
M6
M5
M4
M3
M2
M1
M0
COS 1 A A A A A A A A 100%
COS 2 A A A A 50%
COS 3 A A 25%
COS 4 A 12.5%

17
CAT Benefit: Increase Determinism
Real-time applications require determinism
Shared platform resources reduce determinism
Sample “fork bomb” makes interrupt latency unpredictable (left)
Cache QoS partitioning can solve this issue (right)
0
20
40
60
80
100
7 8 9 10 11
PercentDistribution
Interupt Latency (us)
Interrupt Latency -- With CQoS
With CQoS
CAT Restores Performance Determinism --> Critical for RTOS/Comms

18
Haswell: Cluster on Die (COD)
On HSW, all L3 Cache is not the same

19
IVB EP Architecture

20
Haswell Cluster on Die (COD) Mode
Cluster0
Cbo
LLC
Cbo
LLC
Sbo
Sbo
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
Cbo
LLC
HA0
QPI
0/1
IIO
HA1
Core
CoreCore
Core
CoreCore
Core
Core
Core
Core
Core
Core
Core
Core
Cluster1
Cbo
LLC
Cbo
LLC
Cbo
LLC
Core
Core
Core
Core
Cbo
LLC
COD Mode for 18C HSW-EPOn Haswell CPUs, all L3 cache is not on the
same ring.
• Some L3 cache has higher latency to access
• Similar to NUMA, but for L3 cache
Supported on 2S HSW-EP SKUs with 2
Home Agents (10+ cores)
• Targeted at NUMA workloads where latency is
more important than sharing data across
Caching Agents (Cbo)
• Reduces average LLC hit and local memory latencies
• HA mostly sees requests from reduced set of
threads which can lead to higher memory bandwidth
• OS/VMM own NUMA and process affinity
decisions

21
Intel ® 40Gb Ethernet Controllers

22
40GbE Fortville family (XL710/X710)
Comparing Controller Typical Power
82599EB
1 x 40GbE
3.3 watts2
Typical Power
2 x 10GbE
5.2 watts1
Typical Power
Source as of Aug 2014: 1: 82599 Datasheet rev 2.0 Table 11.5 for 2x10GbE Twinax Typical Power [W] 2:
XL710 Data sheet rev 1.21 Table 14-7 Typical Active Power 1x40GbE Power [W]
30%
65%
2x
Power Efficiency Improvements
UP TO 30%
Reduction
TYPICAL POWER
UP TO 65%
Reduction In
GIGABIT PER WATT
Increase in
TOTAL
BANDWIDTH

23
40GbE Fortville family (XL710/X710)
2x10 4x10 1x40 2x40
• Low power single chip design for PCI Express 3.0
• Intelligent load balance for high performance traffic flows
• Network virtualization Overlay stateless offloads for
VXLAN, NVGRE, Geneve
• Flexible pipeline processing – add new features after
production by upgrading firmware upgradable

24
Intro to DPPD
Data Plane Performance Demonstrators

25
DPPD: What is it?
• Data Plane Performance Demonstrators
• An open source DPDK application
• BSD3C license
• Available on 01.org (https://01.org/intel-data-plane-performance-
demonstrators/downloads)
• Runs on host, vm and ovs

26
• Config file defines
• Which cores are used
• Which interfaces are used
• Which tasks are executed and how configured
• Allows to
• Find bottlenecks and measure performance
• Try and compare different core layouts without changing code
• Reuse config file on different systems (CPUs, hyper-threads, sockets,
interfaces)
DPPD – What is it? (continued)

27
Example
Main idea
Core 5
Task
Task
...
Interface
Core 3
Core 4
Core 1
interface
Core 2
interface
interface
Interface
Interface
Interface
Task
Task
Task
Task

28
Supported tasks
• Load balance Position
• QinQ encap/decap IPv4/IPv6
• ARP
• QoS
• Routing
• Unmpls
• Policing
• ACL
• Classify
• Drop
• Basic Forwarding (no touch)
• L2 Forwarding (change MAC)
• GRE encap/decap
• Load balance network
• Load balance QinQ

29
• Easily reconfigurable (parses config file)
• Different pipelines through configuration
• WiFi Gateway
• BNG
• QoS
• Combination or part of the above
• Assign work to different cores
• Cache QoS Management
• Configuration follows design
• Each core is assigned to execute a (set of) task(s)
• Tasks are executed in round-robin fashion
• Tasks communicate through rings
Configuration and design

30
DPPD: Very simple Port Forwarding
FWD
ETH1 ETH2
[port 0] ;DPDK port number
name=cpe0
mac=00:00:00:00:00:01
[port 1] ;DPDK port number
name=cpe1
mac=00:00:00:00:00:02
[core 1]
name=FWD
task=0
mode=none
rx port=cpe0
tx port=cpe1

2 new hw_features_cat_cod_etc

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to 2 new hw_features_cat_cod_etc

Similar to 2 new hw_features_cat_cod_etc (20)

More from videos

More from videos (14)

Recently uploaded

Recently uploaded (20)

2 new hw_features_cat_cod_etc