• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Exploit the Integrated  Graphics in Packet Processing
 

Exploit the Integrated Graphics in Packet Processing

on

  • 1,522 views

 

Statistics

Views

Total Views
1,522
Views on SlideShare
1,520
Embed Views
2

Actions

Likes
0
Downloads
15
Comments
0

1 Embed 2

http://www.linkedin.com 2

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Exploit the Integrated  Graphics in Packet Processing Exploit the Integrated Graphics in Packet Processing Presentation Transcript

    • EXPLOIT THEINTEGRATED GRAPHICSIN PACKET PROCESSINGSpeaker: Prof. Fulvio RissoSupervisor: Progetto di Reti LocaliCourse: 2010/2011Academic year: Francesco Corazza
    • Francesco Corazza 2ScenarioPacket processing are demanding more performances:• Increasing network speed• More intelligence in network devices• Deeper packet analysis• …Intel is the best network hardware choice thanks to:• Scale economy• Price/quality ratio• Power Consumption We will deal with packet processing on Intel platforms…
    • Francesco Corazza 3OverviewIssues: • Intel • Have not yet deployed efficient tools for our needs • Discrete GPU • Heavy • Expensive • Not power-saving • Affected by BUS bottleneckFocus: • Consumer platforms • CPU + GPU solutions Two different objectives can be identified…
    • Francesco Corazza 4Presentation StructureObjectives: Focus on Focus on Integrated the Field GraphicsChapter Division: What is the How convenient hardware hardware can be What kind of best fit on exploited in these app? application is these packet applications? Which What is the CPU+GP processing? features GPU hardware U solutions differentiate most solutions them from profitable for general these app? computing?
    • FOCUS ON THE FIELD
    • Francesco Corazza Focus on the Field 6Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app?
    • Francesco Corazza Focus on the Field 7Packet processing Applications• Memory intensive • Frequent data load from packet • Huge amount of data involved in the processing• No data locality • Unpredictable loads from different memory areas• Small tasks, over a large number of packets
    • Francesco Corazza Focus on the Field 8Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app?
    • Francesco Corazza 11General computing vs. Packet processing Memory Core access Structure activity patterns CPU bounded Locality pattern Complex tasks General launched once Computing Application ALU-based computation Caches are Small amount of useful memory required Memory Very repetitive Packet bounded Random pattern small tasks Processing Load/Store- Unpredictable Application based loads from Huge amount of computation memory memory involved Differences in hardware will mirror differences in software…
    • Francesco Corazza Focus on the Field 12Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app?
    • Francesco Corazza Focus on the Field 13Network Processors Packet processing Applications• Memory • Narrow data buses • Memory intensive • Huge amount of data involved • Multiple data buses in the processing • Frequent data load from packet • Memory Hierarchies • Few caches • No data locality • Unpredictable loads from• Superscalar execution different memory areas • Massive number of threads • Thread-level parallelism • Small tasks, over a large • Zero-overhead switching number of packets • Asynchronous code Packet processing is a market niche, so the industry was obliged to move to solutions borrowed from mainstream consumer market…
    • Francesco Corazza Focus on the Field 14Network Hardware EvolutionThe scale economies have dropped out specific hardware:• Network Processors • CISCO • Tilera • … T• Consumer Processors I • GPU solutions • Nvidia Fermi M • CPU+GPU solutions E • Our investigation lays here• Hybrid Processors • Intel Many Integrated Core • AMD Fusion
    • Francesco Corazza Focus on the Field 15Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app? • GPU • CPU + GPU • Intel MIC• How convenient hardware can be exploited in these app?
    • Francesco Corazza Focus on the Field 16GPU – Features Packet processing Applications• Shared Memory • Memory intensive • High bandwidth • Huge amount of data involved in the processing • Coalesced access • Frequent data load from packet • No data locality • Unpredictable loads from• Lots of Execution Units different memory areas • Slow cores • Massive parallelism • Small tasks, over a large number of packets• SIMT execution model • More flexible than SIMD
    • Francesco Corazza Focus on the Field 19CPU + GPU solutions… just wait few slides to find out how it will end up Lets take a look to the architectures that we will face in the future…
    • Francesco Corazza Focus on the Field 20Intel MIC (Many Integrated Core)• Built from Single-Chip Cloud Computer and Larrabee researches • Programming GPU with x86 Instruction Set• Development tools in common with Xeon • Same tools can compile both for the processor and for the co-processor • HPC market target• Knights Corner (First Implementation): • 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit vector unit, GDDR5 memory, PCI Express 2.0
    • Francesco Corazza Focus on the Field 21Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app? • GPGPU • DirectCompute • OpenCL
    • Francesco Corazza Focus on the Field 22GPGPU – Overview• General-Purpose computing on graphics processing units • Programming GPUs through accessible programming interfaces and industry-standard languages such as C • Allows software developers to use stream processing on non- graphics data• Competing interfaces • Nvidia Compute Unified Device Architecture (CUDA) • AMD Stream (now joined into OpenCL) • Microsoft DirectCompute (new subset of DirectX10/11 APIs)• Convergence towards standardization (like OpenGL) • Khronos Group OpenCL These frameworks lye just above hardware…
    • Francesco Corazza Focus on the Field 23GPGPU – Layer representation Media playback or processing, Applications media UI, recognition, etc. Technical Accelerator, Brook+, Rapidmind, Ct Domain Domain Libraries Languages MKL, ACML, cuFFT, D3DX, etc. DirectCompute, CUDA, CAL, Compute Languages OpenCL, LRB Native, etc. Processors CPU, GPU, Larrabee nVidia, Intel, AMD, S3, etc.
    • Francesco Corazza Focus on the Field 25GPGPU – Analysis• CUDA • Tight hardware integration • Depence on Nvidia hardware• OpenCL • Give up lower-level hooks into the architecture • Heterogeneous computational resources • Integration in the Khronos family (eg. OpenGL)• DirectCompute • Only Windows (Wine/Mono are immature) • Integration in DirectX APIs • GPGPU under the hood of Windows 7 For their spread, we are going to cover the latter two languages…
    • Francesco Corazza Focus on the Field 26DirectComputeExposes the compute functionality of the GPU as a newtype of shader (tool that determines the final appearance of an objects surface)• Compute Shader • Delivers the performance of 3-D games to new applications• Rendering integration • Demonstrates tight integration between computation and rendering• Supported by all processor vendors • DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0• Scalable parallel processing model • Code should scale for several generations
    • Francesco Corazza Focus on the Field 27DirectCompute – Rendering PipelineRender scene Write out scene image Use Compute forimage post-processing Output final image
    • Francesco Corazza Focus on the Field 30DirectCompute – Programming Model Dispatch • 3D grid of thread groups Thread Group • 3D grid of threads • numThreads(nX, nY, nZ) Thread • One invocation of a shaderThreads in the same group run concurrently
    • Francesco Corazza Focus on the Field 31DirectCompute – Execution Model • A thread is executed by a scalar processors • A thread group is executed on a multiprocessor • A compute shader kernel is launched as a grid of thread- groups (Only one grid of thread groups can execute on a device at one time)
    • Francesco Corazza Focus on the Field 35DirectCompute – Example HLSL codestruct BufferStruct{ uint4 color;};// group size#define thread_group_size_x 4#define thread_group_size_y 4RWStructuredBuffer<BufferStruct> g_OutBuff;/* This is the number of threads in a thread group, 4x4x1 in this example case */// e.g.: [numthreads( 4, 4, 1 )][numthreads( thread_group_size_x, thread_group_size_y, 1 )]void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uintgroupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID ){ int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1) int stride = thread_group_size_x * N_THREAD_GROUPS_X; // buffer stide, assumes data stride = data width (i.e. no padding) int idx = dispatchThreadID.y * stride + dispatchThreadID.x; float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y); g_OutBuff[ idx ].color = color;}
    • Francesco Corazza Focus on the Field 36OpenCL – OverviewOpen Computing Language• Access to heterogeneous computational resources• Parallel execution on single or multiple processors • GPU, CPU, GPU + CPU or multiple GPUs• Desktop and Handheld Profiles• Work with graphics APIs • OpenGL• C99 with extensions • Familiar to developers • Rich set of built-in functions • Easy to develop data- and task- parallel compute programs • Defines hardware and numerical precision requirements
    • Francesco Corazza Focus on the Field 37OpenCL – Execution Model (I)• Work item • Basic unit of work on an OpenCL device• Kernel • Basic unit of executable code • Similar to a C function • Data-parallel or task-parallel• Program • Collection of kernels and functions • Analogous to a dynamic library• Context • Environment within which work- items executes• Applications • Queue kernel execution instances • In-order: one queue to a device • Executed in-order or out-of-order
    • Francesco Corazza Focus on the Field 43OpenCL – Coding (I)• Work-item • Smallest execution entity • Every time a Kernel is launched, lots of work-items (a number specified by the programmer) are launched, each one executing the same code • Unique ID • Accessible from the kernel • Used to distinguish the data to be processed by each work-item• Work-group • Allow communication and cooperation between work-items • Reflect work-items organization • (N-dimensional grid of work-groups, N = 1, 2 or 3) • Independent element of execution in N-D domain• ND-Range • Computation domain (Organization level) • Specify how work-groups are organized • (N-dimensional grid of work-groups, N = 1, 2 or 3) • Defines the total number of work-items that execute in parallel
    • Francesco Corazza Focus on the Field 44OpenCL – Coding (II)
    • Francesco Corazza Focus on the Field 45OpenCL – Coding (III)Process a 1024 x 1024 imageGlobal problem dimensions: • 1024 x 1024 = 1 kernel execution per pixel • 1,048,576 total executions data-parallel scalar void scalar_mul ( int n, kernel void dp_mul( const float *a, global const float *a, const float *b, global const float *b, float *result) global float *result ) { { int i; int id = get_global_id(0); for (i=0; i<n; i++) result[id] = a[id] * b[id]; result[i] = a[i] * b[i]; } } // execute dp_mul over “n” work-items
    • FOCUS ONINTEGRATED GRAPHICS
    • Francesco Corazza Focus on Integrated Graphics 47CPU+GPU solutionsThe architectures involved are:• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion Let’s compare them…
    • Francesco Corazza Focus on Integrated Graphics 48CPU+GPU solutions Market Target Release Date Desktop / Hi-End 01/2011 Mobile / Industrial 11/2010 embedded Mobile / Tablets 01/2010 Consumer / Desktop 01/2011
    • Francesco Corazza Focus on Integrated Graphics 49Focus on Integrated Graphics• Intel Core 2° Generation (Sandy Bridge) • Features • Integrated GPU • AVX (Advanced Vector Extensions)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion
    • Francesco Corazza Focus on Integrated Graphics 50Sandy Bridge – Features (I)• CPU die redesigned • Chip’s northbridge and GPU are both on-die (in the previous versions they were on a physically separate chip)• LLC (Last Level Cache, formerly L3 Cache) • Thanks to new ring bus LLC is shared amongst all components, including the GPU • Each individual core had its own private path to the LLC cache• Unified Memory Architecture (UMA) • Architecture where the graphics subsystem does not have exclusive dedicated memory and uses the host system’s memory • Dynamic Video Memory Technology (DVMT)• Hyper Threading
    • Francesco Corazza Focus on Integrated Graphics 51Sandy Bridge – Features (II)• Turbo Boost Technology 2.0 • Adjust the processor core and GPU frequencies to increase performance and maintain the allotted power/thermal budget • Processor can increase individual core speed or graphics speed as the workload dictates • Developers cannot directly control it• AVX (Advanced Vector eXtension) • Extends SIMD instructions from 128 bits to 256 bits. • AVX enables a single instruction to work on eight floating points at a time instead of the four that the current SIMD provides • Increased processor performance with minimal power gains (HUGI: Hurry Up And Get Idle) Next diagram shows the integration that Intel have reached…
    • Francesco Corazza Focus on Integrated Graphics 52Sandy Bridge – Block Diagram Now we have to zoom in into the graphic processor…
    • Francesco Corazza Focus on Integrated Graphics 53Sandy Bridge – Integrated GPU (I)
    • Francesco Corazza Focus on Integrated Graphics 54Sandy Bridge – Integrated GPU (II)• DirectCompute support • DirectX 10.1 • The internal ISA maps one-to-one with most DirectX10 API instructions resulting in a very CISC-like architecture• Execution Unit (EU) • The pipeline decoder uses only fixed-type function logic to limit the overall power consumption (unlike NVIDIA and AMD that have programmable stream processors) • Each EU can dual issue picking instructions from multiple threads • Transcendental math is handled by hardware in the EU and its performance has been sped up considerably GPU’s parallel capabilities are exploited thanks DirectCompute, but what about CPU?
    • Francesco Corazza Focus on Integrated Graphics 55AVX – Overview•KEY FEATURES •Wider Vectors •Increased from 128 to 256 bit •Two 128-bit load ports •Enhanced Data Rearrangement •Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes •Three and four Operands •Non Destructive Source for both AVX 128 and AVX 256 •Flexible unaligned memory access support •Extensible new opcode (VEX)•BENEFITS •Higher peak FLOPs with good power efficiency •Organize, access and pull only necessary data more quickly and efficiently •Fewer register copies, better register use for both vector and scalar code •More opportunities to fuse load and compute operations •Code size reduction Some assembly instructions can show the power of AVX…
    • Francesco Corazza Focus on Integrated Graphics 56AVX – Instructions (I)
    • Francesco Corazza Focus on Integrated Graphics 57AVX – Instructions (II)
    • Francesco Corazza Focus on Integrated Graphics 58AVX – Code Example (I) Assembly: High level code: #include <immintrin.h> ; -- Begin _foo ALIGN 16 PUBLIC _foo void foo(float *a, float *b, float *r) { _foo PROC NEAR __m256 s1, s2, res; ; parameter 1: 4 + esp ; parameter 2: 8 + esp s1 = _mm256_loadu_ps(a); ; parameter 3: 12 + esp s2 = _mm256_loadu_ps(b); $B2$1: ; Preds $B2$0 mov eax, DWORD PTR [4+esp] res = _mm256_add_ps(s1, s2); mov edx, DWORD PTR [8+esp] _mm256_storeu_ps(r, res); mov ecx, DWORD PTR [12+esp] } vmovups ymm0, YMMWORD PTR [eax] vaddps ymm1, ymm0, YMMWORD PTR [edx] vmovups YMMWORD PTR [ecx], ymm1 ; LOE ebx ebp esi edi $B2$2: ; Preds $B2$1 ret ;10.1 ALIGN 16 ; LOE _foo ENDP ;_foo ENDS
    • Francesco Corazza Focus on Integrated Graphics 61AVX – Benchmarks
    • Francesco Corazza Focus on Integrated Graphics 62AVX – Benchmarks SIMD processing works best with data-parallel applications where the data is arranged in a structure of array (SOA) format. Graphics and image processing applications are often highly parallel and well-structured, and thus are typically good candidates for SIMD processing. Geometry or mesh data, on the other hand, is not always uniformly structured in a neat grid.
    • Francesco Corazza Focus on Integrated Graphics 63Sandy Bridge – Conclusion• Interesting features for packet processing • Integrated Memory controller • DirectCompute • AVX• CPU+GPU integration is only on the physical layer • Packet processing can exploit CPU or GPU • Unpredictable evolution • DirectCompute could exploit CPU • AVX could exploit GPU • Next Ivy Bridge will support both OpenCL and DirectX11
    • Francesco Corazza Focus on Integrated Graphics 64Focus on Integrated Graphics• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek) • Features • Block Diagram • Customization• Nvidia Tegra (Tegra 2)• AMD Fusion
    • Francesco Corazza Focus on Integrated Graphics 65Atom E600 – Features (I)• SoC (System on Chip)• Power optimized • Fanless performance• I/O flexible and open • Flexible application Specific Needs • PCIe instead of proprietary FSB• 7 years long life support• Hyper-Threading Technology • Two logical processors• SSE3 (Streaming SIMD Extensions) • Support for SIMD intructions
    • Francesco Corazza Focus on Integrated Graphics 66Atom E600 – Features (II)• Power saving • Intel SpeedStep Technology • Enables the operating system to program a processor to transition to lower frequency and/or voltage levels while executing a workload • Deep power down technology • Able to reduce static power consumption by turning off power to cache and other sub-systems in the processor. • In-order processing • Guarantees greater power efficiency, CPU will not reorder an instruction stream to extract instruction-level parallelism• DirectCompute support • Tunnel Creek supports only DirectX9 The next diagram shows the insight of the Atom architecture…
    • Francesco Corazza Focus on Integrated Graphics 67Atom E600 – Block Diagram Atom does not supportDirectCompute, so we haveto concentrate on the great flexibility of the architecture…
    • Francesco Corazza Focus on Integrated Graphics 68Atom E600 – Customization• Open connection • Developers can attach the processor to a variety of chipsets • application-specific third-party chipsets • FPGAs • ASIC • Processor can be used without a chipset (limited I/O needs) • The processor’s four PCIe connections can attach to discrete PCIe peripherals such as Ethernet controllers
    • Francesco Corazza Focus on Integrated Graphics 69Atom E600 – Conclusion• Interesting features for packet processing • Power saving features • Long support • Flexible Architecture• Any support to GPGPU • Old school GPGPU • Use OpenGL ES 2.0 shaders (programmable shaders) • Rewrite the code as a fragment shader • Wait for Cedar Trail (2011 – not yet released) • DirectX 10.1
    • Francesco Corazza Focus on Integrated Graphics 70Focus on Integrated Graphics• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2) • Features • Block Diagram• AMD Fusion
    • Francesco Corazza Focus on Integrated Graphics 71Tegra – Features• SoC (System-on-a-chip) • ARM CPU Dual Core • GeForce GPU• ULP (Ultra-low power consumption)• Graphics support • No DirectX support • No CUDA support • OpenGL ES 2.0 support The next diagram shows quantitatively a view of a Tegra chip…
    • Francesco Corazza Focus on Integrated Graphics 72Tegra – Block Diagram
    • Francesco Corazza Focus on Integrated Graphics 73Tegra – Conclusion• Interesting features for packet processing • Integrated Memory controller • Low power consumption• Any support to GPGPU • Old school GPGPU • Use OpenGL ES 2.0 shaders (programmable shaders) • Rewrite the code as a fragment shader • Wait for Tegra 3 (third quarter of 2011) • DirectX 11 • CUDA
    • Francesco Corazza Focus on Integrated Graphics 74Focus on Integrated Graphics• Intel Core 2° Generation ( Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion • AMD Vision • Features • APU Roadmap • Integration Highlights
    • Francesco Corazza Focus on Integrated Graphics 75Fusion – AMD VisionFusion is a step-forward technology:AMD have realized this heterogeneous architecture developing APUs…
    • Francesco Corazza Focus on Integrated Graphics 76Fusion – Features (I) Video
    • Francesco Corazza Focus on Integrated Graphics 77Fusion – Features (II)• DirectCompute support (DirectX 11)• OpenCL 1.1 • Additive capabilities of an APU and a discrete graphics solution • Power-oriented benefits• Massive SIMD GPU (SSE5) • Programmable scalar and vector processor cores• APU family • Bulldozer (Sandy Bridge’s opponent) • Performance and scalability • Bobcat (Atom’s opponent) Let’s compare this two solutions…
    • Francesco Corazza Focus on Integrated Graphics 79Fusion – Features (III) The difference between Bulldozer/Bobcat is also the market target…
    • Francesco Corazza Focus on Integrated Graphics 81Fusion – APU roadmap The high level of integration differentiate APUs from CPUs…
    • Francesco Corazza Focus on Integrated Graphics 82Fusion – Integration Highlights• Shared memory • Lower latencies• PCI Express • Cut down some latencies• No discrete GPU, less • Cost • Power • Motherboard complexity
    • Francesco Corazza Focus on Integrated Graphics 83Fusion – Conclusion• Interesting features for packet processing • OpenCL/DirectCompute/SSE5 • Architecture tight integrated • New technology (First-Come-First-Served)• OpenCL • Could be the “El Dorado” for packet processing • CPU/GPU working in AND/OR configuration • Shared Memory • Embedded implementation of Fusion technology • AMD declaredly support it to bring the power of heterogeneous computing mainstream
    • CONCLUSIONS
    • Francesco Corazza Conclusions 85Summary (I)This presentation has disclosed several ways of exploitingintegrated graphics and, more generally, consumer architecturesfor packet processing:• GPGPU-driven solutions • CUDA, OpenCL, DirectX11• SIMD-driven solutions • Exploit very parallel operations through this SIMD implementation • AVX, SSE• Custom hardware solutions • Design flexible modules tailored on specific needs • FPGA The former solutions are the most in vogue at the moment…
    • Francesco Corazza Conclusions 86Summary (II) Open Direct Open SSE FPGA CL Compute GL V X X V V (AVX) V X V X V (SSE 3) V X X X V (SSE 3) V V X V V (SSE 5)
    • Francesco Corazza Conclusions 87RecommendationsWrite directly parallel code is more efficient than hardwareparallelization:
    • THANK YOUQuestions?
    • Francesco Corazza 89Bibliography• Lecture notes of course “Tecnologie per reti di calcolatori”• http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm• http://www.intel.com/technology/atom/index.htm• http://www.intel.com/technology/architecture-silicon/mic/index.htm• http://sites.amd.com/us/fusion/apu/pages/fusion.aspx• http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell- architettura_index.html• http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/• http://www.multicorepacketprocessing.com/• http://www.nvidia.co.uk/object/tegra-2.html• http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763- 6.html• http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html• http://gpgpu.org/• http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/• http://gpgpu-computing.blogspot.com/• http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx• http://www.khronos.org/developers/resources/opencl/#ttutorials• http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related