Exploit the Integrated Graphics in Packet Processing
1. EXPLOIT THE
INTEGRATED GRAPHICS
IN PACKET PROCESSING
Speaker: Prof. Fulvio Risso
Supervisor:
Progetto di Reti Locali
Course: 2010/2011
Academic year:
Francesco Corazza
2. Francesco Corazza 2
Scenario
Packet processing are demanding more performances:
• Increasing network speed
• More intelligence in network devices
• Deeper packet analysis
• …
Intel is the best network hardware choice thanks to:
• Scale economy
• Price/quality ratio
• Power Consumption
We will deal with packet processing on Intel platforms…
3. Francesco Corazza 3
Overview
Issues:
• Intel
• Have not yet deployed efficient tools for our needs
• Discrete GPU
• Heavy
• Expensive
• Not power-saving
• Affected by BUS bottleneck
Focus:
• Consumer platforms
• CPU + GPU solutions
Two different objectives can be identified…
4. Francesco Corazza 4
Presentation Structure
Objectives:
Focus on
Focus on
Integrated
the Field
Graphics
Chapter Division:
What is the How convenient
hardware hardware can be
What kind of best fit on exploited in these app?
application is these
packet applications?
Which What is the CPU+GP
processing? features GPU
hardware U
solutions
differentiate most solutions
them from profitable for
general these app?
computing?
6. Francesco Corazza Focus on the Field 6
Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
7. Francesco Corazza Focus on the Field 7
Packet processing Applications
• Memory intensive
• Frequent data load from packet
• Huge amount of data involved in the processing
• No data locality
• Unpredictable loads from different memory areas
• Small tasks, over a large number of packets
8. Francesco Corazza Focus on the Field 8
Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
9. Francesco Corazza 11
General computing vs. Packet processing
Memory
Core
access Structure
activity
patterns
CPU bounded Locality pattern Complex tasks
General launched once
Computing
Application ALU-based
computation
Caches are Small amount of
useful memory required
Memory Very repetitive
Packet bounded Random pattern
small tasks
Processing
Load/Store- Unpredictable
Application based loads from Huge amount of
computation memory memory involved
Differences in hardware will mirror differences in software…
10. Francesco Corazza Focus on the Field 12
Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
11. Francesco Corazza Focus on the Field 13
Network Processors
Packet processing Applications
• Memory
• Narrow data buses • Memory intensive
• Huge amount of data involved
• Multiple data buses in the processing
• Frequent data load from packet
• Memory Hierarchies
• Few caches • No data locality
• Unpredictable loads from
• Superscalar execution different memory areas
• Massive number of threads
• Thread-level parallelism
• Small tasks, over a large
• Zero-overhead switching number of packets
• Asynchronous code
Packet processing is a market niche, so the industry was obliged to
move to solutions borrowed from mainstream consumer market…
12. Francesco Corazza Focus on the Field 14
Network Hardware Evolution
The scale economies have dropped out specific hardware:
• Network Processors
• CISCO
• Tilera
• … T
• Consumer Processors I
• GPU solutions
• Nvidia Fermi M
• CPU+GPU solutions E
• Our investigation lays here
• Hybrid Processors
• Intel Many Integrated Core
• AMD Fusion
13. Francesco Corazza Focus on the Field 15
Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• GPU
• CPU + GPU
• Intel MIC
• How convenient hardware can be exploited in these app?
14. Francesco Corazza Focus on the Field 16
GPU – Features
Packet processing Applications
• Shared Memory • Memory intensive
• High bandwidth • Huge amount of data involved
in the processing
• Coalesced access • Frequent data load from packet
• No data locality
• Unpredictable loads from
• Lots of Execution Units different memory areas
• Slow cores
• Massive parallelism • Small tasks, over a large
number of packets
• SIMT execution model
• More flexible than SIMD
15. Francesco Corazza Focus on the Field 19
CPU + GPU solutions
… just wait few slides to find out how it will end up
Let's take a look to the architectures that we will face in the future…
16. Francesco Corazza Focus on the Field 20
Intel MIC (Many Integrated Core)
• Built from Single-Chip Cloud Computer and Larrabee
researches
• Programming GPU with x86 Instruction Set
• Development tools in common with Xeon
• Same tools can compile both for the processor and for the co-processor
• HPC market target
• Knights Corner (First Implementation):
• 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit
vector unit, GDDR5 memory, PCI Express 2.0
17. Francesco Corazza Focus on the Field 21
Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?
• GPGPU
• DirectCompute
• OpenCL
18. Francesco Corazza Focus on the Field 22
GPGPU – Overview
• General-Purpose computing on graphics processing units
• Programming GPUs through accessible programming interfaces
and industry-standard languages such as C
• Allows software developers to use stream processing on non-
graphics data
• Competing interfaces
• Nvidia Compute Unified Device Architecture (CUDA)
• AMD Stream (now joined into OpenCL)
• Microsoft DirectCompute (new subset of DirectX10/11 APIs)
• Convergence towards standardization (like OpenGL)
• Khronos Group OpenCL
These frameworks lye just above hardware…
19. Francesco Corazza Focus on the Field 23
GPGPU – Layer representation
Media playback or processing,
Applications media UI, recognition, etc.
Technical
Accelerator, Brook+, Rapidmind, Ct
Domain Domain
Libraries Languages MKL, ACML, cuFFT, D3DX, etc.
DirectCompute, CUDA, CAL,
Compute Languages
OpenCL, LRB Native, etc.
Processors CPU, GPU, Larrabee
nVidia, Intel, AMD, S3, etc.
20. Francesco Corazza Focus on the Field 25
GPGPU – Analysis
• CUDA
• Tight hardware integration
• Depence on Nvidia hardware
• OpenCL
• Give up lower-level hooks into the architecture
• Heterogeneous computational resources
• Integration in the Khronos family (eg. OpenGL)
• DirectCompute
• Only Windows (Wine/Mono are immature)
• Integration in DirectX APIs
• GPGPU under the hood of Windows 7
For their spread, we are going to cover the latter two languages…
21. Francesco Corazza Focus on the Field 26
DirectCompute
Exposes the compute functionality of the GPU as a new
type of shader (tool that determines the final appearance of an object's surface)
• Compute Shader
• Delivers the performance of 3-D games to new applications
• Rendering integration
• Demonstrates tight integration between computation and rendering
• Supported by all processor vendors
• DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0
• Scalable parallel processing model
• Code should scale for several generations
22. Francesco Corazza Focus on the Field 27
DirectCompute – Rendering Pipeline
Render scene
Write out scene image
Use Compute for
image post-processing
Output final image
23. Francesco Corazza Focus on the Field 30
DirectCompute – Programming Model
Dispatch
• 3D grid of thread groups
Thread Group
• 3D grid of threads
• numThreads(nX, nY, nZ)
Thread
• One invocation of a shader
Threads in the same group run concurrently
24. Francesco Corazza Focus on the Field 31
DirectCompute – Execution Model
• A thread is executed by a scalar
processors
• A thread group is executed on a
multiprocessor
• A compute shader kernel is
launched as a grid of thread-
groups (Only one grid of thread groups
can execute on a device at one time)
25. Francesco Corazza Focus on the Field 35
DirectCompute – Example HLSL code
struct BufferStruct{ uint4 color;};
// group size
#define thread_group_size_x 4
#define thread_group_size_y 4
RWStructuredBuffer<BufferStruct> g_OutBuff;
/* This is the number of threads in a thread group, 4x4x1 in this example case */
// e.g.: [numthreads( 4, 4, 1 )]
[numthreads( thread_group_size_x, thread_group_size_y, 1 )]
void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint
groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID )
{
int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1)
int stride = thread_group_size_x * N_THREAD_GROUPS_X;
// buffer stide, assumes data stride = data width (i.e. no padding)
int idx = dispatchThreadID.y * stride + dispatchThreadID.x;
float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y);
g_OutBuff[ idx ].color = color;
}
26. Francesco Corazza Focus on the Field 36
OpenCL – Overview
Open Computing Language
• Access to heterogeneous computational resources
• Parallel execution on single or multiple processors
• GPU, CPU, GPU + CPU or multiple GPUs
• Desktop and Handheld Profiles
• Work with graphics APIs
• OpenGL
• C99 with extensions
• Familiar to developers
• Rich set of built-in functions
• Easy to develop data- and task- parallel compute programs
• Defines hardware and numerical precision requirements
27. Francesco Corazza Focus on the Field 37
OpenCL – Execution Model (I)
• Work item
• Basic unit of work on an OpenCL device
• Kernel
• Basic unit of executable code
• Similar to a C function
• Data-parallel or task-parallel
• Program
• Collection of kernels and functions
• Analogous to a dynamic library
• Context
• Environment within which work- items executes
• Applications
• Queue kernel execution instances
• In-order: one queue to a device
• Executed in-order or out-of-order
28. Francesco Corazza Focus on the Field 43
OpenCL – Coding (I)
• Work-item
• Smallest execution entity
• Every time a Kernel is launched, lots of work-items (a number
specified by the programmer) are launched, each one executing the
same code
• Unique ID
• Accessible from the kernel
• Used to distinguish the data to be processed by each work-item
• Work-group
• Allow communication and cooperation between work-items
• Reflect work-items organization
• (N-dimensional grid of work-groups, N = 1, 2 or 3)
• Independent element of execution in N-D domain
• ND-Range
• Computation domain (Organization level)
• Specify how work-groups are organized
• (N-dimensional grid of work-groups, N = 1, 2 or 3)
• Defines the total number of work-items that execute in parallel
30. Francesco Corazza Focus on the Field 45
OpenCL – Coding (III)
Process a 1024 x 1024 image
Global problem dimensions:
• 1024 x 1024 = 1 kernel execution per pixel
• 1,048,576 total executions
data-parallel
scalar
void scalar_mul ( int n, kernel void dp_mul(
const float *a, global const float *a,
const float *b, global const float *b,
float *result) global float *result )
{ {
int i; int id = get_global_id(0);
for (i=0; i<n; i++) result[id] = a[id] * b[id];
result[i] = a[i] * b[i]; }
} // execute dp_mul over “n”
work-items
32. Francesco Corazza Focus on Integrated Graphics 47
CPU+GPU solutions
The architectures involved are:
• Intel Core 2° Generation (Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• AMD Fusion
Let’s compare them…
33. Francesco Corazza Focus on Integrated Graphics 48
CPU+GPU solutions
Market Target Release Date
Desktop / Hi-End 01/2011
Mobile / Industrial
11/2010
embedded
Mobile / Tablets 01/2010
Consumer / Desktop 01/2011
34. Francesco Corazza Focus on Integrated Graphics 49
Focus on Integrated Graphics
• Intel Core 2° Generation (Sandy Bridge)
• Features
• Integrated GPU
• AVX (Advanced Vector Extensions)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• AMD Fusion
35. Francesco Corazza Focus on Integrated Graphics 50
Sandy Bridge – Features (I)
• CPU die redesigned
• Chip’s northbridge and GPU are both on-die (in the previous
versions they were on a physically separate chip)
• LLC (Last Level Cache, formerly L3 Cache)
• Thanks to new ring bus LLC is shared amongst all components,
including the GPU
• Each individual core had its own private path to the LLC cache
• Unified Memory Architecture (UMA)
• Architecture where the graphics subsystem does not have
exclusive dedicated memory and uses the host system’s memory
• Dynamic Video Memory Technology (DVMT)
• Hyper Threading
36. Francesco Corazza Focus on Integrated Graphics 51
Sandy Bridge – Features (II)
• Turbo Boost Technology 2.0
• Adjust the processor core and GPU frequencies to increase
performance and maintain the allotted power/thermal budget
• Processor can increase individual core speed or graphics speed as
the workload dictates
• Developers cannot directly control it
• AVX (Advanced Vector eXtension)
• Extends SIMD instructions from 128 bits to 256 bits.
• AVX enables a single instruction to work on eight floating points at
a time instead of the four that the current SIMD provides
• Increased processor performance with minimal power gains
(HUGI: Hurry Up And Get Idle)
Next diagram shows the integration that Intel have reached…
37. Francesco Corazza Focus on Integrated Graphics 52
Sandy Bridge – Block Diagram
Now we have to zoom in into the graphic processor…
38. Francesco Corazza Focus on Integrated Graphics 53
Sandy Bridge – Integrated GPU (I)
39. Francesco Corazza Focus on Integrated Graphics 54
Sandy Bridge – Integrated GPU (II)
• DirectCompute support
• DirectX 10.1
• The internal ISA maps one-to-one with most DirectX10 API
instructions resulting in a very CISC-like architecture
• Execution Unit (EU)
• The pipeline decoder uses only fixed-type function logic to limit the
overall power consumption (unlike NVIDIA and AMD that have
programmable stream processors)
• Each EU can dual issue picking instructions from multiple threads
• Transcendental math is handled by hardware in the EU and its
performance has been sped up considerably
GPU’s parallel capabilities are exploited thanks DirectCompute, but
what about CPU?
40. Francesco Corazza Focus on Integrated Graphics 55
AVX – Overview
•KEY FEATURES
•Wider Vectors
•Increased from 128 to 256 bit
•Two 128-bit load ports
•Enhanced Data Rearrangement
•Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes
•Three and four Operands
•Non Destructive Source for both AVX 128 and AVX 256
•Flexible unaligned memory access support
•Extensible new opcode (VEX)
•BENEFITS
•Higher peak FLOPs with good power efficiency
•Organize, access and pull only necessary data more quickly and efficiently
•Fewer register copies, better register use for both vector and scalar code
•More opportunities to fuse load and compute operations
•Code size reduction
Some assembly instructions can show the power of AVX…
41. Francesco Corazza Focus on Integrated Graphics 56
AVX – Instructions (I)
42. Francesco Corazza Focus on Integrated Graphics 57
AVX – Instructions (II)
45. Francesco Corazza Focus on Integrated Graphics 62
AVX – Benchmarks
SIMD processing works best with data-parallel
applications where the data is arranged in a
structure of array (SOA) format. Graphics and image
processing applications are often highly parallel and
well-structured, and thus are typically good
candidates for SIMD processing. Geometry or mesh
data, on the other hand, is not always uniformly
structured in a neat grid.
46. Francesco Corazza Focus on Integrated Graphics 63
Sandy Bridge – Conclusion
• Interesting features for packet processing
• Integrated Memory controller
• DirectCompute
• AVX
• CPU+GPU integration is only on the physical layer
• Packet processing can exploit CPU or GPU
• Unpredictable evolution
• DirectCompute could exploit CPU
• AVX could exploit GPU
• Next Ivy Bridge will support both OpenCL and DirectX11
47. Francesco Corazza Focus on Integrated Graphics 64
Focus on Integrated Graphics
• Intel Core 2° Generation (Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Features
• Block Diagram
• Customization
• Nvidia Tegra (Tegra 2)
• AMD Fusion
48. Francesco Corazza Focus on Integrated Graphics 65
Atom E600 – Features (I)
• SoC (System on Chip)
• Power optimized
• Fanless performance
• I/O flexible and open
• Flexible application Specific Needs
• PCIe instead of proprietary FSB
• 7 years long life support
• Hyper-Threading Technology
• Two logical processors
• SSE3 (Streaming SIMD Extensions)
• Support for SIMD intructions
49. Francesco Corazza Focus on Integrated Graphics 66
Atom E600 – Features (II)
• Power saving
• Intel SpeedStep Technology
• Enables the operating system to program a processor to transition to
lower frequency and/or voltage levels while executing a workload
• Deep power down technology
• Able to reduce static power consumption by turning off power to cache
and other sub-systems in the processor.
• In-order processing
• Guarantees greater power efficiency, CPU will not reorder an instruction
stream to extract instruction-level parallelism
• DirectCompute support
• Tunnel Creek supports only DirectX9
The next diagram shows the insight of the Atom architecture…
50. Francesco Corazza Focus on Integrated Graphics 67
Atom E600 – Block Diagram
Atom does not support
DirectCompute, so we have
to concentrate on the great
flexibility of the
architecture…
51. Francesco Corazza Focus on Integrated Graphics 68
Atom E600 – Customization
• Open connection
• Developers can attach the
processor to a variety of chipsets
• application-specific third-party
chipsets
• FPGAs
• ASIC
• Processor can be used without a
chipset (limited I/O needs)
• The processor’s four PCIe
connections can attach to discrete
PCIe peripherals such as Ethernet
controllers
52. Francesco Corazza Focus on Integrated Graphics 69
Atom E600 – Conclusion
• Interesting features for packet processing
• Power saving features
• Long support
• Flexible Architecture
• Any support to GPGPU
• Old school GPGPU
• Use OpenGL ES 2.0 shaders (programmable shaders)
• Rewrite the code as a fragment shader
• Wait for Cedar Trail (2011 – not yet released)
• DirectX 10.1
53. Francesco Corazza Focus on Integrated Graphics 70
Focus on Integrated Graphics
• Intel Core 2° Generation (Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• Features
• Block Diagram
• AMD Fusion
54. Francesco Corazza Focus on Integrated Graphics 71
Tegra – Features
• SoC (System-on-a-chip)
• ARM CPU Dual Core
• GeForce GPU
• ULP (Ultra-low power consumption)
• Graphics support
• No DirectX support
• No CUDA support
• OpenGL ES 2.0 support
The next diagram shows quantitatively a view of a Tegra chip…
55. Francesco Corazza Focus on Integrated Graphics 72
Tegra – Block Diagram
56. Francesco Corazza Focus on Integrated Graphics 73
Tegra – Conclusion
• Interesting features for packet processing
• Integrated Memory controller
• Low power consumption
• Any support to GPGPU
• Old school GPGPU
• Use OpenGL ES 2.0 shaders (programmable shaders)
• Rewrite the code as a fragment shader
• Wait for Tegra 3 (third quarter of 2011)
• DirectX 11
• CUDA
57. Francesco Corazza Focus on Integrated Graphics 74
Focus on Integrated Graphics
• Intel Core 2° Generation ( Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• AMD Fusion
• AMD Vision
• Features
• APU Roadmap
• Integration Highlights
58. Francesco Corazza Focus on Integrated Graphics 75
Fusion – AMD Vision
Fusion is a step-forward technology:
AMD have realized this heterogeneous architecture developing APUs…
59. Francesco Corazza Focus on Integrated Graphics 76
Fusion – Features (I)
Video
60. Francesco Corazza Focus on Integrated Graphics 77
Fusion – Features (II)
• DirectCompute support (DirectX 11)
• OpenCL 1.1
• Additive capabilities of an APU and a
discrete graphics solution
• Power-oriented benefits
• Massive SIMD GPU (SSE5)
• Programmable scalar and vector
processor cores
• APU family
• Bulldozer (Sandy Bridge’s opponent)
• Performance and scalability
• Bobcat (Atom’s opponent)
Let’s compare this two solutions…
61. Francesco Corazza Focus on Integrated Graphics 79
Fusion – Features (III)
The difference between Bulldozer/Bobcat is also the market target…
62. Francesco Corazza Focus on Integrated Graphics 81
Fusion – APU roadmap
The high level of integration differentiate APUs from CPUs…
63. Francesco Corazza Focus on Integrated Graphics 82
Fusion – Integration Highlights
• Shared memory
• Lower latencies
• PCI Express
• Cut down some latencies
• No discrete GPU, less
• Cost
• Power
• Motherboard complexity
64. Francesco Corazza Focus on Integrated Graphics 83
Fusion – Conclusion
• Interesting features for packet processing
• OpenCL/DirectCompute/SSE5
• Architecture tight integrated
• New technology (First-Come-First-Served)
• OpenCL
• Could be the “El Dorado” for packet processing
• CPU/GPU working in AND/OR configuration
• Shared Memory
• Embedded implementation of Fusion technology
• AMD declaredly support it to bring the power of heterogeneous
computing mainstream
66. Francesco Corazza Conclusions 85
Summary (I)
This presentation has disclosed several ways of exploiting
integrated graphics and, more generally, consumer architectures
for packet processing:
• GPGPU-driven solutions
• CUDA, OpenCL, DirectX11
• SIMD-driven solutions
• Exploit very parallel operations through this SIMD implementation
• AVX, SSE
• Custom hardware solutions
• Design flexible modules tailored on specific needs
• FPGA
The former solutions are the most in vogue at the moment…
67. Francesco Corazza Conclusions 86
Summary (II)
Open Direct Open
SSE FPGA
CL Compute GL
V
X X V V
(AVX)
V
X V X V
(SSE 3)
V
X X X V
(SSE 3)
V
V X V V
(SSE 5)
68. Francesco Corazza Conclusions 87
Recommendations
Write directly parallel code is more efficient than hardware
parallelization: