Exploit the Integrated Graphics in Packet Processing

EXPLOIT THE
INTEGRATED GRAPHICS
IN PACKET PROCESSING

Speaker: Prof. Fulvio Risso
Supervisor:
Progetto di Reti Locali
Course: 2010/2011
Academic year:
Francesco Corazza

Francesco Corazza 2

Scenario
Packet processing are demanding more performances:
• Increasing network speed
• More intelligence in network devices
• Deeper packet analysis
• …

Intel is the best network hardware choice thanks to:
• Scale economy
• Price/quality ratio
• Power Consumption

We will deal with packet processing on Intel platforms…

Francesco Corazza 3

Overview
Issues:
• Intel
• Have not yet deployed efficient tools for our needs
• Discrete GPU
• Heavy
• Expensive
• Not power-saving
• Affected by BUS bottleneck

Focus:
• Consumer platforms
• CPU + GPU solutions

Two different objectives can be identified…

Francesco Corazza 4

Presentation Structure
Objectives:

Focus on
Focus on
Integrated
the Field
Graphics

Chapter Division:
What is the How convenient
hardware hardware can be
What kind of best fit on exploited in these app?
application is these
packet applications?
Which What is the CPU+GP
processing? features GPU
hardware U
solutions
differentiate most solutions
them from profitable for
general these app?
computing?

Francesco Corazza Focus on the Field 6

Focus on the field
• What kind of application is packet processing?
• Which features differentiate them from general computing?
• What is the hardware best fit on these applications?
• What is the hardware most profitable for these app?
• How convenient hardware can be exploited in these app?


Packet processing Applications
• Memory intensive
• Frequent data load from packet
• Huge amount of data involved in the processing
• No data locality
• Unpredictable loads from different memory areas
• Small tasks, over a large number of packets


Focus on the field

Francesco Corazza 11

General computing vs. Packet processing
Memory
Core
access Structure
activity
patterns

CPU bounded Locality pattern Complex tasks
General launched once
Computing
Application ALU-based
computation
Caches are Small amount of
useful memory required

Memory Very repetitive
Packet bounded Random pattern
small tasks
Processing
Load/Store- Unpredictable
Application based loads from Huge amount of
computation memory memory involved

Differences in hardware will mirror differences in software…


Focus on the field


Network Processors
• Memory
• Narrow data buses • Memory intensive
• Huge amount of data involved
• Multiple data buses in the processing
• Frequent data load from packet
• Memory Hierarchies
• Few caches • No data locality
• Unpredictable loads from
• Superscalar execution different memory areas
• Massive number of threads
• Thread-level parallelism
• Small tasks, over a large
• Zero-overhead switching number of packets
• Asynchronous code

Packet processing is a market niche, so the industry was obliged to
move to solutions borrowed from mainstream consumer market…


Network Hardware Evolution
The scale economies have dropped out specific hardware:

• Network Processors
• CISCO
• Tilera
• … T
• Consumer Processors I
• GPU solutions
• Nvidia Fermi M
• CPU+GPU solutions E
• Our investigation lays here
• Hybrid Processors
• Intel Many Integrated Core
• AMD Fusion


Focus on the field
• GPU
• CPU + GPU
• Intel MIC



GPU – Features
• Shared Memory • Memory intensive
• High bandwidth • Huge amount of data involved
in the processing
• Coalesced access • Frequent data load from packet
• No data locality
• Unpredictable loads from
• Lots of Execution Units different memory areas

• Slow cores
• Massive parallelism • Small tasks, over a large
number of packets
• SIMT execution model
• More flexible than SIMD


CPU + GPU solutions
… just wait few slides to find out how it will end up

Let's take a look to the architectures that we will face in the future…


Intel MIC (Many Integrated Core)
• Built from Single-Chip Cloud Computer and Larrabee
researches
• Programming GPU with x86 Instruction Set

• Development tools in common with Xeon
• Same tools can compile both for the processor and for the co-processor
• HPC market target

• Knights Corner (First Implementation):
• 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit
vector unit, GDDR5 memory, PCI Express 2.0


Focus on the field
• GPGPU
• DirectCompute
• OpenCL


GPGPU – Overview
• General-Purpose computing on graphics processing units
• Programming GPUs through accessible programming interfaces
and industry-standard languages such as C
• Allows software developers to use stream processing on non-
graphics data
• Competing interfaces
• Nvidia Compute Unified Device Architecture (CUDA)
• AMD Stream (now joined into OpenCL)
• Microsoft DirectCompute (new subset of DirectX10/11 APIs)
• Convergence towards standardization (like OpenGL)
• Khronos Group OpenCL

These frameworks lye just above hardware…


GPGPU – Layer representation

Media playback or processing,
Applications media UI, recognition, etc.
Technical
Accelerator, Brook+, Rapidmind, Ct
Domain Domain
Libraries Languages MKL, ACML, cuFFT, D3DX, etc.

DirectCompute, CUDA, CAL,
Compute Languages
OpenCL, LRB Native, etc.

Processors CPU, GPU, Larrabee
nVidia, Intel, AMD, S3, etc.


GPGPU – Analysis
• CUDA
• Tight hardware integration
• Depence on Nvidia hardware
• OpenCL
• Give up lower-level hooks into the architecture
• Heterogeneous computational resources
• Integration in the Khronos family (eg. OpenGL)
• DirectCompute
• Only Windows (Wine/Mono are immature)
• Integration in DirectX APIs
• GPGPU under the hood of Windows 7

For their spread, we are going to cover the latter two languages…


DirectCompute
Exposes the compute functionality of the GPU as a new
type of shader (tool that determines the final appearance of an object's surface)
• Compute Shader
• Delivers the performance of 3-D games to new applications
• Rendering integration
• Demonstrates tight integration between computation and rendering
• Supported by all processor vendors
• DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0
• Scalable parallel processing model
• Code should scale for several generations


DirectCompute – Rendering Pipeline

Render scene

Write out scene image

Use Compute for
image post-processing

Output final image


DirectCompute – Programming Model
Dispatch
• 3D grid of thread groups

Thread Group
• 3D grid of threads
• numThreads(nX, nY, nZ)

Thread
• One invocation of a shader

Threads in the same group run concurrently


DirectCompute – Execution Model

• A thread is executed by a scalar
processors

• A thread group is executed on a
multiprocessor

• A compute shader kernel is
launched as a grid of thread-
groups (Only one grid of thread groups
can execute on a device at one time)


DirectCompute – Example HLSL code
struct BufferStruct{ uint4 color;};

// group size
#define thread_group_size_x 4
#define thread_group_size_y 4
RWStructuredBuffer<BufferStruct> g_OutBuff;

/* This is the number of threads in a thread group, 4x4x1 in this example case */
// e.g.: [numthreads( 4, 4, 1 )]
[numthreads( thread_group_size_x, thread_group_size_y, 1 )]

void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint
groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID )
{
int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1)
int stride = thread_group_size_x * N_THREAD_GROUPS_X;
// buffer stide, assumes data stride = data width (i.e. no padding)
int idx = dispatchThreadID.y * stride + dispatchThreadID.x;
float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y);
g_OutBuff[ idx ].color = color;
}


OpenCL – Overview
Open Computing Language
• Access to heterogeneous computational resources
• Parallel execution on single or multiple processors
• GPU, CPU, GPU + CPU or multiple GPUs
• Desktop and Handheld Profiles
• Work with graphics APIs
• OpenGL
• C99 with extensions
• Familiar to developers
• Rich set of built-in functions
• Easy to develop data- and task- parallel compute programs
• Defines hardware and numerical precision requirements


OpenCL – Execution Model (I)
• Work item
• Basic unit of work on an OpenCL device
• Kernel
• Basic unit of executable code
• Similar to a C function
• Data-parallel or task-parallel
• Program
• Collection of kernels and functions
• Analogous to a dynamic library
• Context
• Environment within which work- items executes
• Applications
• Queue kernel execution instances
• In-order: one queue to a device
• Executed in-order or out-of-order


OpenCL – Coding (I)
• Work-item
• Smallest execution entity
• Every time a Kernel is launched, lots of work-items (a number
specified by the programmer) are launched, each one executing the
same code
• Unique ID
• Accessible from the kernel
• Used to distinguish the data to be processed by each work-item
• Work-group
• Allow communication and cooperation between work-items
• Reflect work-items organization
• (N-dimensional grid of work-groups, N = 1, 2 or 3)
• Independent element of execution in N-D domain
• ND-Range
• Computation domain (Organization level)
• Specify how work-groups are organized
• (N-dimensional grid of work-groups, N = 1, 2 or 3)
• Defines the total number of work-items that execute in parallel


OpenCL – Coding (II)


OpenCL – Coding (III)
Process a 1024 x 1024 image
Global problem dimensions:
• 1024 x 1024 = 1 kernel execution per pixel
• 1,048,576 total executions

data-parallel
scalar

void scalar_mul ( int n, kernel void dp_mul(
const float *a, global const float *a,
const float *b, global const float *b,
float *result) global float *result )
{ {
int i; int id = get_global_id(0);
for (i=0; i<n; i++) result[id] = a[id] * b[id];
result[i] = a[i] * b[i]; }
} // execute dp_mul over “n”
work-items

Francesco Corazza Focus on Integrated Graphics 47

CPU+GPU solutions
The architectures involved are:
• Intel Core 2° Generation (Sandy Bridge)
• Intel Atom E600 Series (Tunnel Creek)
• Nvidia Tegra (Tegra 2)
• AMD Fusion

Let’s compare them…


CPU+GPU solutions
Market Target Release Date

Desktop / Hi-End 01/2011

Mobile / Industrial
11/2010
embedded

Mobile / Tablets 01/2010

Consumer / Desktop 01/2011


Focus on Integrated Graphics
• Features
• Integrated GPU
• AVX (Advanced Vector Extensions)
• AMD Fusion


Sandy Bridge – Features (I)
• CPU die redesigned
• Chip’s northbridge and GPU are both on-die (in the previous
versions they were on a physically separate chip)
• LLC (Last Level Cache, formerly L3 Cache)
• Thanks to new ring bus LLC is shared amongst all components,
including the GPU
• Each individual core had its own private path to the LLC cache
• Unified Memory Architecture (UMA)
• Architecture where the graphics subsystem does not have
exclusive dedicated memory and uses the host system’s memory
• Dynamic Video Memory Technology (DVMT)
• Hyper Threading


Sandy Bridge – Features (II)
• Turbo Boost Technology 2.0
• Adjust the processor core and GPU frequencies to increase
performance and maintain the allotted power/thermal budget
• Processor can increase individual core speed or graphics speed as
the workload dictates
• Developers cannot directly control it
• AVX (Advanced Vector eXtension)
• Extends SIMD instructions from 128 bits to 256 bits.
• AVX enables a single instruction to work on eight floating points at
a time instead of the four that the current SIMD provides
• Increased processor performance with minimal power gains
(HUGI: Hurry Up And Get Idle)

Next diagram shows the integration that Intel have reached…


Sandy Bridge – Block Diagram

Now we have to zoom in into the graphic processor…


Sandy Bridge – Integrated GPU (I)


Sandy Bridge – Integrated GPU (II)
• DirectCompute support
• DirectX 10.1
• The internal ISA maps one-to-one with most DirectX10 API
instructions resulting in a very CISC-like architecture
• Execution Unit (EU)
• The pipeline decoder uses only fixed-type function logic to limit the
overall power consumption (unlike NVIDIA and AMD that have
programmable stream processors)
• Each EU can dual issue picking instructions from multiple threads
• Transcendental math is handled by hardware in the EU and its
performance has been sped up considerably

GPU’s parallel capabilities are exploited thanks DirectCompute, but
what about CPU?


AVX – Overview
•KEY FEATURES
•Wider Vectors
•Increased from 128 to 256 bit
•Two 128-bit load ports
•Enhanced Data Rearrangement
•Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes
•Three and four Operands
•Non Destructive Source for both AVX 128 and AVX 256
•Flexible unaligned memory access support
•Extensible new opcode (VEX)
•BENEFITS
•Higher peak FLOPs with good power efficiency
•Organize, access and pull only necessary data more quickly and efficiently
•Fewer register copies, better register use for both vector and scalar code
•More opportunities to fuse load and compute operations
•Code size reduction

Some assembly instructions can show the power of AVX…


AVX – Instructions (I)


AVX – Instructions (II)


AVX – Code Example (I)

Assembly:
High level code:
#include <immintrin.h> ; -- Begin _foo
ALIGN 16
PUBLIC _foo
void foo(float *a, float *b, float *r)
{ _foo PROC NEAR
__m256 s1, s2, res; ; parameter 1: 4 + esp
; parameter 2: 8 + esp
s1 = _mm256_loadu_ps(a); ; parameter 3: 12 + esp
s2 = _mm256_loadu_ps(b); $B2$1: ; Preds $B2$0
mov eax, DWORD PTR [4+esp]
res = _mm256_add_ps(s1, s2);
mov edx, DWORD PTR [8+esp]
_mm256_storeu_ps(r, res); mov ecx, DWORD PTR [12+esp]
} vmovups ymm0, YMMWORD PTR [eax]
vaddps ymm1, ymm0, YMMWORD PTR [edx]
vmovups YMMWORD PTR [ecx], ymm1
; LOE ebx ebp esi edi
$B2$2: ; Preds $B2$1
ret ;10.1
ALIGN 16
; LOE
_foo ENDP
;_foo ENDS


AVX – Benchmarks


AVX – Benchmarks

SIMD processing works best with data-parallel
applications where the data is arranged in a
structure of array (SOA) format. Graphics and image
processing applications are often highly parallel and
well-structured, and thus are typically good
candidates for SIMD processing. Geometry or mesh
data, on the other hand, is not always uniformly
structured in a neat grid.


Sandy Bridge – Conclusion
• Interesting features for packet processing
• Integrated Memory controller
• DirectCompute
• AVX
• CPU+GPU integration is only on the physical layer
• Packet processing can exploit CPU or GPU
• Unpredictable evolution
• DirectCompute could exploit CPU
• AVX could exploit GPU
• Next Ivy Bridge will support both OpenCL and DirectX11


• Features
• Block Diagram
• Customization
• AMD Fusion


Atom E600 – Features (I)
• SoC (System on Chip)
• Power optimized
• Fanless performance
• I/O flexible and open
• Flexible application Specific Needs
• PCIe instead of proprietary FSB
• 7 years long life support

• Hyper-Threading Technology
• Two logical processors
• SSE3 (Streaming SIMD Extensions)
• Support for SIMD intructions


Atom E600 – Features (II)
• Power saving
• Intel SpeedStep Technology
• Enables the operating system to program a processor to transition to
lower frequency and/or voltage levels while executing a workload
• Deep power down technology
• Able to reduce static power consumption by turning off power to cache
and other sub-systems in the processor.
• In-order processing
• Guarantees greater power efficiency, CPU will not reorder an instruction
stream to extract instruction-level parallelism
• DirectCompute support
• Tunnel Creek supports only DirectX9

The next diagram shows the insight of the Atom architecture…


Atom E600 – Block Diagram

Atom does not support
DirectCompute, so we have
to concentrate on the great
flexibility of the
architecture…


Atom E600 – Customization
• Open connection
• Developers can attach the
processor to a variety of chipsets
• application-specific third-party
chipsets
• FPGAs
• ASIC
• Processor can be used without a
chipset (limited I/O needs)
• The processor’s four PCIe
connections can attach to discrete
PCIe peripherals such as Ethernet
controllers


Atom E600 – Conclusion
• Power saving features
• Long support
• Flexible Architecture
• Any support to GPGPU
• Old school GPGPU
• Use OpenGL ES 2.0 shaders (programmable shaders)
• Rewrite the code as a fragment shader
• Wait for Cedar Trail (2011 – not yet released)
• DirectX 10.1


• Features
• Block Diagram

• AMD Fusion


Tegra – Features
• SoC (System-on-a-chip)
• ARM CPU Dual Core
• GeForce GPU
• ULP (Ultra-low power consumption)
• Graphics support
• No DirectX support
• No CUDA support
• OpenGL ES 2.0 support

The next diagram shows quantitatively a view of a Tegra chip…


Tegra – Block Diagram


Tegra – Conclusion
• Integrated Memory controller
• Low power consumption
• Any support to GPGPU
• Old school GPGPU
• Use OpenGL ES 2.0 shaders (programmable shaders)
• Rewrite the code as a fragment shader
• Wait for Tegra 3 (third quarter of 2011)
• DirectX 11
• CUDA


• Intel Core 2° Generation ( Sandy Bridge)
• AMD Fusion
• AMD Vision
• Features
• APU Roadmap
• Integration Highlights


Fusion – AMD Vision
Fusion is a step-forward technology:

AMD have realized this heterogeneous architecture developing APUs…


Fusion – Features (I)
Video


Fusion – Features (II)
• DirectCompute support (DirectX 11)
• OpenCL 1.1
• Additive capabilities of an APU and a
discrete graphics solution
• Power-oriented benefits
• Massive SIMD GPU (SSE5)
• Programmable scalar and vector
processor cores
• APU family
• Bulldozer (Sandy Bridge’s opponent)
• Performance and scalability
• Bobcat (Atom’s opponent)

Let’s compare this two solutions…


Fusion – Features (III)

The difference between Bulldozer/Bobcat is also the market target…


Fusion – APU roadmap

The high level of integration differentiate APUs from CPUs…


Fusion – Integration Highlights
• Shared memory
• Lower latencies
• PCI Express
• Cut down some latencies
• No discrete GPU, less
• Cost
• Power
• Motherboard complexity


Fusion – Conclusion
• OpenCL/DirectCompute/SSE5
• Architecture tight integrated
• New technology (First-Come-First-Served)
• OpenCL
• Could be the “El Dorado” for packet processing
• CPU/GPU working in AND/OR configuration
• Shared Memory
• Embedded implementation of Fusion technology
• AMD declaredly support it to bring the power of heterogeneous
computing mainstream

Francesco Corazza Conclusions 85

Summary (I)
This presentation has disclosed several ways of exploiting
integrated graphics and, more generally, consumer architectures
for packet processing:

• GPGPU-driven solutions
• CUDA, OpenCL, DirectX11
• SIMD-driven solutions
• Exploit very parallel operations through this SIMD implementation
• AVX, SSE
• Custom hardware solutions
• Design flexible modules tailored on specific needs
• FPGA

The former solutions are the most in vogue at the moment…


Summary (II)
Open Direct Open
SSE FPGA
CL Compute GL

V
X X V V
(AVX)

V
X V X V
(SSE 3)

V
X X X V
(SSE 3)

V
V X V V
(SSE 5)


Recommendations
Write directly parallel code is more efficient than hardware
parallelization:

Francesco Corazza 89

Bibliography
• Lecture notes of course “Tecnologie per reti di calcolatori”
• http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm
• http://www.intel.com/technology/atom/index.htm
• http://www.intel.com/technology/architecture-silicon/mic/index.htm
• http://sites.amd.com/us/fusion/apu/pages/fusion.aspx
• http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell-
architettura_index.html
• http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/
• http://www.multicorepacketprocessing.com/
• http://www.nvidia.co.uk/object/tegra-2.html
• http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763-
6.html
• http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html
• http://gpgpu.org/
• http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/
• http://gpgpu-computing.blogspot.com/
• http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx
• http://www.khronos.org/developers/resources/opencl/#ttutorials
• http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related

Exploit the Integrated Graphics in Packet Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Exploit the Integrated Graphics in Packet Processing

Similar to Exploit the Integrated Graphics in Packet Processing (20)

Recently uploaded

Recently uploaded (20)

Exploit the Integrated Graphics in Packet Processing