0
EXPLOIT THEINTEGRATED GRAPHICSIN PACKET PROCESSINGSpeaker:                        Prof. Fulvio RissoSupervisor:           ...
Francesco Corazza                                                    2ScenarioPacket processing are demanding more perform...
Francesco Corazza                                                 3OverviewIssues:   • Intel      • Have not yet deployed ...
Francesco Corazza                                                                          4Presentation StructureObjectiv...
FOCUS ON THE FIELD
Francesco Corazza                         Focus on the Field   6Focus on the field• What kind of application is packet pro...
Francesco Corazza                               Focus on the Field   7Packet processing Applications• Memory intensive  • ...
Francesco Corazza                         Focus on the Field   8Focus on the field• What kind of application is packet pro...
Francesco Corazza                                                    11General computing vs. Packet processing            ...
Francesco Corazza                         Focus on the Field   12Focus on the field• What kind of application is packet pr...
Francesco Corazza                                   Focus on the Field   13Network Processors                             ...
Francesco Corazza                       Focus on the Field   14Network Hardware EvolutionThe scale economies have dropped ...
Francesco Corazza                         Focus on the Field   15Focus on the field• What kind of application is packet pr...
Francesco Corazza                     Focus on the Field   16GPU – Features                               Packet processin...
Francesco Corazza                                    Focus on the Field   19CPU + GPU solutions… just wait few slides to f...
Francesco Corazza                                      Focus on the Field   20Intel MIC (Many Integrated Core)• Built from...
Francesco Corazza                         Focus on the Field   21Focus on the field• What kind of application is packet pr...
Francesco Corazza                                   Focus on the Field   22GPGPU – Overview• General-Purpose computing on ...
Francesco Corazza                                      Focus on the Field   23GPGPU – Layer representation                ...
Francesco Corazza                                  Focus on the Field   25GPGPU – Analysis• CUDA  • Tight hardware integra...
Francesco Corazza                                         Focus on the Field   26DirectComputeExposes the compute function...
Francesco Corazza                            Focus on the Field   27DirectCompute – Rendering PipelineRender scene        ...
Francesco Corazza                                    Focus on the Field   30DirectCompute – Programming Model             ...
Francesco Corazza               Focus on the Field   31DirectCompute – Execution Model                    • A thread is ex...
Francesco Corazza                                                  Focus on the Field   35DirectCompute – Example HLSL cod...
Francesco Corazza                               Focus on the Field   36OpenCL – OverviewOpen Computing Language• Access to...
Francesco Corazza                                   Focus on the Field   37OpenCL – Execution Model (I)• Work item  • Basi...
Francesco Corazza                                             Focus on the Field   43OpenCL – Coding (I)• Work-item  • Sma...
Francesco Corazza      Focus on the Field   44OpenCL – Coding (II)
Francesco Corazza                                                         Focus on the Field   45OpenCL – Coding (III)Proc...
FOCUS ONINTEGRATED GRAPHICS
Francesco Corazza                   Focus on Integrated Graphics   47CPU+GPU solutionsThe architectures involved are:• Int...
Francesco Corazza                          Focus on Integrated Graphics   48CPU+GPU solutions                     Market T...
Francesco Corazza                       Focus on Integrated Graphics   49Focus on Integrated Graphics• Intel Core 2° Gener...
Francesco Corazza                          Focus on Integrated Graphics   50Sandy Bridge – Features (I)• CPU die redesigne...
Francesco Corazza                            Focus on Integrated Graphics   51Sandy Bridge – Features (II)• Turbo Boost Te...
Francesco Corazza                            Focus on Integrated Graphics   52Sandy Bridge – Block Diagram             Now...
Francesco Corazza    Focus on Integrated Graphics   53Sandy Bridge – Integrated GPU (I)
Francesco Corazza                            Focus on Integrated Graphics   54Sandy Bridge – Integrated GPU (II)• DirectCo...
Francesco Corazza                                                           Focus on Integrated Graphics   55AVX – Overvie...
Francesco Corazza        Focus on Integrated Graphics   56AVX – Instructions (I)
Francesco Corazza         Focus on Integrated Graphics   57AVX – Instructions (II)
Francesco Corazza                                                        Focus on Integrated Graphics   58AVX – Code Examp...
Francesco Corazza   Focus on Integrated Graphics   61AVX – Benchmarks
Francesco Corazza                                      Focus on Integrated Graphics   62AVX – Benchmarks                  ...
Francesco Corazza                          Focus on Integrated Graphics   63Sandy Bridge – Conclusion• Interesting feature...
Francesco Corazza                  Focus on Integrated Graphics   64Focus on Integrated Graphics• Intel Core 2° Generation...
Francesco Corazza                          Focus on Integrated Graphics   65Atom E600 – Features (I)• SoC (System on Chip)...
Francesco Corazza                                   Focus on Integrated Graphics   66Atom E600 – Features (II)• Power savi...
Francesco Corazza             Focus on Integrated Graphics   67Atom E600 – Block Diagram  Atom does not supportDirectCompu...
Francesco Corazza                             Focus on Integrated Graphics   68Atom E600 – Customization• Open connection ...
Francesco Corazza                                Focus on Integrated Graphics   69Atom E600 – Conclusion• Interesting feat...
Francesco Corazza                  Focus on Integrated Graphics   70Focus on Integrated Graphics• Intel Core 2° Generation...
Francesco Corazza                          Focus on Integrated Graphics   71Tegra – Features• SoC (System-on-a-chip)  • AR...
Francesco Corazza       Focus on Integrated Graphics   72Tegra – Block Diagram
Francesco Corazza                                Focus on Integrated Graphics   73Tegra – Conclusion• Interesting features...
Francesco Corazza                   Focus on Integrated Graphics   74Focus on Integrated Graphics• Intel Core 2° Generatio...
Francesco Corazza                       Focus on Integrated Graphics   75Fusion – AMD VisionFusion is a step-forward techn...
Francesco Corazza           Focus on Integrated Graphics   76Fusion – Features (I)                    Video
Francesco Corazza                             Focus on Integrated Graphics   77Fusion – Features (II)• DirectCompute suppo...
Francesco Corazza                          Focus on Integrated Graphics   79Fusion – Features (III)    The difference betw...
Francesco Corazza                            Focus on Integrated Graphics   81Fusion – APU roadmap        The high level o...
Francesco Corazza             Focus on Integrated Graphics   82Fusion – Integration Highlights• Shared memory  • Lower lat...
Francesco Corazza                             Focus on Integrated Graphics   83Fusion – Conclusion• Interesting features f...
CONCLUSIONS
Francesco Corazza                                         Conclusions   85Summary (I)This presentation has disclosed sever...
Francesco Corazza                            Conclusions   86Summary (II)                    Open                   Direct...
Francesco Corazza                             Conclusions   87RecommendationsWrite directly parallel code is more efficien...
THANK YOUQuestions?
Francesco Corazza                                                        89Bibliography•   Lecture notes of course “Tecnol...
Upcoming SlideShare
Loading in...5
×

Exploit the Integrated Graphics in Packet Processing

1,468

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,468
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Exploit the Integrated Graphics in Packet Processing"

  1. 1. EXPLOIT THEINTEGRATED GRAPHICSIN PACKET PROCESSINGSpeaker: Prof. Fulvio RissoSupervisor: Progetto di Reti LocaliCourse: 2010/2011Academic year: Francesco Corazza
  2. 2. Francesco Corazza 2ScenarioPacket processing are demanding more performances:• Increasing network speed• More intelligence in network devices• Deeper packet analysis• …Intel is the best network hardware choice thanks to:• Scale economy• Price/quality ratio• Power Consumption We will deal with packet processing on Intel platforms…
  3. 3. Francesco Corazza 3OverviewIssues: • Intel • Have not yet deployed efficient tools for our needs • Discrete GPU • Heavy • Expensive • Not power-saving • Affected by BUS bottleneckFocus: • Consumer platforms • CPU + GPU solutions Two different objectives can be identified…
  4. 4. Francesco Corazza 4Presentation StructureObjectives: Focus on Focus on Integrated the Field GraphicsChapter Division: What is the How convenient hardware hardware can be What kind of best fit on exploited in these app? application is these packet applications? Which What is the CPU+GP processing? features GPU hardware U solutions differentiate most solutions them from profitable for general these app? computing?
  5. 5. FOCUS ON THE FIELD
  6. 6. Francesco Corazza Focus on the Field 6Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app?
  7. 7. Francesco Corazza Focus on the Field 7Packet processing Applications• Memory intensive • Frequent data load from packet • Huge amount of data involved in the processing• No data locality • Unpredictable loads from different memory areas• Small tasks, over a large number of packets
  8. 8. Francesco Corazza Focus on the Field 8Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app?
  9. 9. Francesco Corazza 11General computing vs. Packet processing Memory Core access Structure activity patterns CPU bounded Locality pattern Complex tasks General launched once Computing Application ALU-based computation Caches are Small amount of useful memory required Memory Very repetitive Packet bounded Random pattern small tasks Processing Load/Store- Unpredictable Application based loads from Huge amount of computation memory memory involved Differences in hardware will mirror differences in software…
  10. 10. Francesco Corazza Focus on the Field 12Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app?
  11. 11. Francesco Corazza Focus on the Field 13Network Processors Packet processing Applications• Memory • Narrow data buses • Memory intensive • Huge amount of data involved • Multiple data buses in the processing • Frequent data load from packet • Memory Hierarchies • Few caches • No data locality • Unpredictable loads from• Superscalar execution different memory areas • Massive number of threads • Thread-level parallelism • Small tasks, over a large • Zero-overhead switching number of packets • Asynchronous code Packet processing is a market niche, so the industry was obliged to move to solutions borrowed from mainstream consumer market…
  12. 12. Francesco Corazza Focus on the Field 14Network Hardware EvolutionThe scale economies have dropped out specific hardware:• Network Processors • CISCO • Tilera • … T• Consumer Processors I • GPU solutions • Nvidia Fermi M • CPU+GPU solutions E • Our investigation lays here• Hybrid Processors • Intel Many Integrated Core • AMD Fusion
  13. 13. Francesco Corazza Focus on the Field 15Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app? • GPU • CPU + GPU • Intel MIC• How convenient hardware can be exploited in these app?
  14. 14. Francesco Corazza Focus on the Field 16GPU – Features Packet processing Applications• Shared Memory • Memory intensive • High bandwidth • Huge amount of data involved in the processing • Coalesced access • Frequent data load from packet • No data locality • Unpredictable loads from• Lots of Execution Units different memory areas • Slow cores • Massive parallelism • Small tasks, over a large number of packets• SIMT execution model • More flexible than SIMD
  15. 15. Francesco Corazza Focus on the Field 19CPU + GPU solutions… just wait few slides to find out how it will end up Lets take a look to the architectures that we will face in the future…
  16. 16. Francesco Corazza Focus on the Field 20Intel MIC (Many Integrated Core)• Built from Single-Chip Cloud Computer and Larrabee researches • Programming GPU with x86 Instruction Set• Development tools in common with Xeon • Same tools can compile both for the processor and for the co-processor • HPC market target• Knights Corner (First Implementation): • 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit vector unit, GDDR5 memory, PCI Express 2.0
  17. 17. Francesco Corazza Focus on the Field 21Focus on the field• What kind of application is packet processing?• Which features differentiate them from general computing?• What is the hardware best fit on these applications?• What is the hardware most profitable for these app?• How convenient hardware can be exploited in these app? • GPGPU • DirectCompute • OpenCL
  18. 18. Francesco Corazza Focus on the Field 22GPGPU – Overview• General-Purpose computing on graphics processing units • Programming GPUs through accessible programming interfaces and industry-standard languages such as C • Allows software developers to use stream processing on non- graphics data• Competing interfaces • Nvidia Compute Unified Device Architecture (CUDA) • AMD Stream (now joined into OpenCL) • Microsoft DirectCompute (new subset of DirectX10/11 APIs)• Convergence towards standardization (like OpenGL) • Khronos Group OpenCL These frameworks lye just above hardware…
  19. 19. Francesco Corazza Focus on the Field 23GPGPU – Layer representation Media playback or processing, Applications media UI, recognition, etc. Technical Accelerator, Brook+, Rapidmind, Ct Domain Domain Libraries Languages MKL, ACML, cuFFT, D3DX, etc. DirectCompute, CUDA, CAL, Compute Languages OpenCL, LRB Native, etc. Processors CPU, GPU, Larrabee nVidia, Intel, AMD, S3, etc.
  20. 20. Francesco Corazza Focus on the Field 25GPGPU – Analysis• CUDA • Tight hardware integration • Depence on Nvidia hardware• OpenCL • Give up lower-level hooks into the architecture • Heterogeneous computational resources • Integration in the Khronos family (eg. OpenGL)• DirectCompute • Only Windows (Wine/Mono are immature) • Integration in DirectX APIs • GPGPU under the hood of Windows 7 For their spread, we are going to cover the latter two languages…
  21. 21. Francesco Corazza Focus on the Field 26DirectComputeExposes the compute functionality of the GPU as a newtype of shader (tool that determines the final appearance of an objects surface)• Compute Shader • Delivers the performance of 3-D games to new applications• Rendering integration • Demonstrates tight integration between computation and rendering• Supported by all processor vendors • DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0• Scalable parallel processing model • Code should scale for several generations
  22. 22. Francesco Corazza Focus on the Field 27DirectCompute – Rendering PipelineRender scene Write out scene image Use Compute forimage post-processing Output final image
  23. 23. Francesco Corazza Focus on the Field 30DirectCompute – Programming Model Dispatch • 3D grid of thread groups Thread Group • 3D grid of threads • numThreads(nX, nY, nZ) Thread • One invocation of a shaderThreads in the same group run concurrently
  24. 24. Francesco Corazza Focus on the Field 31DirectCompute – Execution Model • A thread is executed by a scalar processors • A thread group is executed on a multiprocessor • A compute shader kernel is launched as a grid of thread- groups (Only one grid of thread groups can execute on a device at one time)
  25. 25. Francesco Corazza Focus on the Field 35DirectCompute – Example HLSL codestruct BufferStruct{ uint4 color;};// group size#define thread_group_size_x 4#define thread_group_size_y 4RWStructuredBuffer<BufferStruct> g_OutBuff;/* This is the number of threads in a thread group, 4x4x1 in this example case */// e.g.: [numthreads( 4, 4, 1 )][numthreads( thread_group_size_x, thread_group_size_y, 1 )]void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uintgroupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID ){ int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1) int stride = thread_group_size_x * N_THREAD_GROUPS_X; // buffer stide, assumes data stride = data width (i.e. no padding) int idx = dispatchThreadID.y * stride + dispatchThreadID.x; float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y); g_OutBuff[ idx ].color = color;}
  26. 26. Francesco Corazza Focus on the Field 36OpenCL – OverviewOpen Computing Language• Access to heterogeneous computational resources• Parallel execution on single or multiple processors • GPU, CPU, GPU + CPU or multiple GPUs• Desktop and Handheld Profiles• Work with graphics APIs • OpenGL• C99 with extensions • Familiar to developers • Rich set of built-in functions • Easy to develop data- and task- parallel compute programs • Defines hardware and numerical precision requirements
  27. 27. Francesco Corazza Focus on the Field 37OpenCL – Execution Model (I)• Work item • Basic unit of work on an OpenCL device• Kernel • Basic unit of executable code • Similar to a C function • Data-parallel or task-parallel• Program • Collection of kernels and functions • Analogous to a dynamic library• Context • Environment within which work- items executes• Applications • Queue kernel execution instances • In-order: one queue to a device • Executed in-order or out-of-order
  28. 28. Francesco Corazza Focus on the Field 43OpenCL – Coding (I)• Work-item • Smallest execution entity • Every time a Kernel is launched, lots of work-items (a number specified by the programmer) are launched, each one executing the same code • Unique ID • Accessible from the kernel • Used to distinguish the data to be processed by each work-item• Work-group • Allow communication and cooperation between work-items • Reflect work-items organization • (N-dimensional grid of work-groups, N = 1, 2 or 3) • Independent element of execution in N-D domain• ND-Range • Computation domain (Organization level) • Specify how work-groups are organized • (N-dimensional grid of work-groups, N = 1, 2 or 3) • Defines the total number of work-items that execute in parallel
  29. 29. Francesco Corazza Focus on the Field 44OpenCL – Coding (II)
  30. 30. Francesco Corazza Focus on the Field 45OpenCL – Coding (III)Process a 1024 x 1024 imageGlobal problem dimensions: • 1024 x 1024 = 1 kernel execution per pixel • 1,048,576 total executions data-parallel scalar void scalar_mul ( int n, kernel void dp_mul( const float *a, global const float *a, const float *b, global const float *b, float *result) global float *result ) { { int i; int id = get_global_id(0); for (i=0; i<n; i++) result[id] = a[id] * b[id]; result[i] = a[i] * b[i]; } } // execute dp_mul over “n” work-items
  31. 31. FOCUS ONINTEGRATED GRAPHICS
  32. 32. Francesco Corazza Focus on Integrated Graphics 47CPU+GPU solutionsThe architectures involved are:• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion Let’s compare them…
  33. 33. Francesco Corazza Focus on Integrated Graphics 48CPU+GPU solutions Market Target Release Date Desktop / Hi-End 01/2011 Mobile / Industrial 11/2010 embedded Mobile / Tablets 01/2010 Consumer / Desktop 01/2011
  34. 34. Francesco Corazza Focus on Integrated Graphics 49Focus on Integrated Graphics• Intel Core 2° Generation (Sandy Bridge) • Features • Integrated GPU • AVX (Advanced Vector Extensions)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion
  35. 35. Francesco Corazza Focus on Integrated Graphics 50Sandy Bridge – Features (I)• CPU die redesigned • Chip’s northbridge and GPU are both on-die (in the previous versions they were on a physically separate chip)• LLC (Last Level Cache, formerly L3 Cache) • Thanks to new ring bus LLC is shared amongst all components, including the GPU • Each individual core had its own private path to the LLC cache• Unified Memory Architecture (UMA) • Architecture where the graphics subsystem does not have exclusive dedicated memory and uses the host system’s memory • Dynamic Video Memory Technology (DVMT)• Hyper Threading
  36. 36. Francesco Corazza Focus on Integrated Graphics 51Sandy Bridge – Features (II)• Turbo Boost Technology 2.0 • Adjust the processor core and GPU frequencies to increase performance and maintain the allotted power/thermal budget • Processor can increase individual core speed or graphics speed as the workload dictates • Developers cannot directly control it• AVX (Advanced Vector eXtension) • Extends SIMD instructions from 128 bits to 256 bits. • AVX enables a single instruction to work on eight floating points at a time instead of the four that the current SIMD provides • Increased processor performance with minimal power gains (HUGI: Hurry Up And Get Idle) Next diagram shows the integration that Intel have reached…
  37. 37. Francesco Corazza Focus on Integrated Graphics 52Sandy Bridge – Block Diagram Now we have to zoom in into the graphic processor…
  38. 38. Francesco Corazza Focus on Integrated Graphics 53Sandy Bridge – Integrated GPU (I)
  39. 39. Francesco Corazza Focus on Integrated Graphics 54Sandy Bridge – Integrated GPU (II)• DirectCompute support • DirectX 10.1 • The internal ISA maps one-to-one with most DirectX10 API instructions resulting in a very CISC-like architecture• Execution Unit (EU) • The pipeline decoder uses only fixed-type function logic to limit the overall power consumption (unlike NVIDIA and AMD that have programmable stream processors) • Each EU can dual issue picking instructions from multiple threads • Transcendental math is handled by hardware in the EU and its performance has been sped up considerably GPU’s parallel capabilities are exploited thanks DirectCompute, but what about CPU?
  40. 40. Francesco Corazza Focus on Integrated Graphics 55AVX – Overview•KEY FEATURES •Wider Vectors •Increased from 128 to 256 bit •Two 128-bit load ports •Enhanced Data Rearrangement •Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes •Three and four Operands •Non Destructive Source for both AVX 128 and AVX 256 •Flexible unaligned memory access support •Extensible new opcode (VEX)•BENEFITS •Higher peak FLOPs with good power efficiency •Organize, access and pull only necessary data more quickly and efficiently •Fewer register copies, better register use for both vector and scalar code •More opportunities to fuse load and compute operations •Code size reduction Some assembly instructions can show the power of AVX…
  41. 41. Francesco Corazza Focus on Integrated Graphics 56AVX – Instructions (I)
  42. 42. Francesco Corazza Focus on Integrated Graphics 57AVX – Instructions (II)
  43. 43. Francesco Corazza Focus on Integrated Graphics 58AVX – Code Example (I) Assembly: High level code: #include <immintrin.h> ; -- Begin _foo ALIGN 16 PUBLIC _foo void foo(float *a, float *b, float *r) { _foo PROC NEAR __m256 s1, s2, res; ; parameter 1: 4 + esp ; parameter 2: 8 + esp s1 = _mm256_loadu_ps(a); ; parameter 3: 12 + esp s2 = _mm256_loadu_ps(b); $B2$1: ; Preds $B2$0 mov eax, DWORD PTR [4+esp] res = _mm256_add_ps(s1, s2); mov edx, DWORD PTR [8+esp] _mm256_storeu_ps(r, res); mov ecx, DWORD PTR [12+esp] } vmovups ymm0, YMMWORD PTR [eax] vaddps ymm1, ymm0, YMMWORD PTR [edx] vmovups YMMWORD PTR [ecx], ymm1 ; LOE ebx ebp esi edi $B2$2: ; Preds $B2$1 ret ;10.1 ALIGN 16 ; LOE _foo ENDP ;_foo ENDS
  44. 44. Francesco Corazza Focus on Integrated Graphics 61AVX – Benchmarks
  45. 45. Francesco Corazza Focus on Integrated Graphics 62AVX – Benchmarks SIMD processing works best with data-parallel applications where the data is arranged in a structure of array (SOA) format. Graphics and image processing applications are often highly parallel and well-structured, and thus are typically good candidates for SIMD processing. Geometry or mesh data, on the other hand, is not always uniformly structured in a neat grid.
  46. 46. Francesco Corazza Focus on Integrated Graphics 63Sandy Bridge – Conclusion• Interesting features for packet processing • Integrated Memory controller • DirectCompute • AVX• CPU+GPU integration is only on the physical layer • Packet processing can exploit CPU or GPU • Unpredictable evolution • DirectCompute could exploit CPU • AVX could exploit GPU • Next Ivy Bridge will support both OpenCL and DirectX11
  47. 47. Francesco Corazza Focus on Integrated Graphics 64Focus on Integrated Graphics• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek) • Features • Block Diagram • Customization• Nvidia Tegra (Tegra 2)• AMD Fusion
  48. 48. Francesco Corazza Focus on Integrated Graphics 65Atom E600 – Features (I)• SoC (System on Chip)• Power optimized • Fanless performance• I/O flexible and open • Flexible application Specific Needs • PCIe instead of proprietary FSB• 7 years long life support• Hyper-Threading Technology • Two logical processors• SSE3 (Streaming SIMD Extensions) • Support for SIMD intructions
  49. 49. Francesco Corazza Focus on Integrated Graphics 66Atom E600 – Features (II)• Power saving • Intel SpeedStep Technology • Enables the operating system to program a processor to transition to lower frequency and/or voltage levels while executing a workload • Deep power down technology • Able to reduce static power consumption by turning off power to cache and other sub-systems in the processor. • In-order processing • Guarantees greater power efficiency, CPU will not reorder an instruction stream to extract instruction-level parallelism• DirectCompute support • Tunnel Creek supports only DirectX9 The next diagram shows the insight of the Atom architecture…
  50. 50. Francesco Corazza Focus on Integrated Graphics 67Atom E600 – Block Diagram Atom does not supportDirectCompute, so we haveto concentrate on the great flexibility of the architecture…
  51. 51. Francesco Corazza Focus on Integrated Graphics 68Atom E600 – Customization• Open connection • Developers can attach the processor to a variety of chipsets • application-specific third-party chipsets • FPGAs • ASIC • Processor can be used without a chipset (limited I/O needs) • The processor’s four PCIe connections can attach to discrete PCIe peripherals such as Ethernet controllers
  52. 52. Francesco Corazza Focus on Integrated Graphics 69Atom E600 – Conclusion• Interesting features for packet processing • Power saving features • Long support • Flexible Architecture• Any support to GPGPU • Old school GPGPU • Use OpenGL ES 2.0 shaders (programmable shaders) • Rewrite the code as a fragment shader • Wait for Cedar Trail (2011 – not yet released) • DirectX 10.1
  53. 53. Francesco Corazza Focus on Integrated Graphics 70Focus on Integrated Graphics• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2) • Features • Block Diagram• AMD Fusion
  54. 54. Francesco Corazza Focus on Integrated Graphics 71Tegra – Features• SoC (System-on-a-chip) • ARM CPU Dual Core • GeForce GPU• ULP (Ultra-low power consumption)• Graphics support • No DirectX support • No CUDA support • OpenGL ES 2.0 support The next diagram shows quantitatively a view of a Tegra chip…
  55. 55. Francesco Corazza Focus on Integrated Graphics 72Tegra – Block Diagram
  56. 56. Francesco Corazza Focus on Integrated Graphics 73Tegra – Conclusion• Interesting features for packet processing • Integrated Memory controller • Low power consumption• Any support to GPGPU • Old school GPGPU • Use OpenGL ES 2.0 shaders (programmable shaders) • Rewrite the code as a fragment shader • Wait for Tegra 3 (third quarter of 2011) • DirectX 11 • CUDA
  57. 57. Francesco Corazza Focus on Integrated Graphics 74Focus on Integrated Graphics• Intel Core 2° Generation ( Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion • AMD Vision • Features • APU Roadmap • Integration Highlights
  58. 58. Francesco Corazza Focus on Integrated Graphics 75Fusion – AMD VisionFusion is a step-forward technology:AMD have realized this heterogeneous architecture developing APUs…
  59. 59. Francesco Corazza Focus on Integrated Graphics 76Fusion – Features (I) Video
  60. 60. Francesco Corazza Focus on Integrated Graphics 77Fusion – Features (II)• DirectCompute support (DirectX 11)• OpenCL 1.1 • Additive capabilities of an APU and a discrete graphics solution • Power-oriented benefits• Massive SIMD GPU (SSE5) • Programmable scalar and vector processor cores• APU family • Bulldozer (Sandy Bridge’s opponent) • Performance and scalability • Bobcat (Atom’s opponent) Let’s compare this two solutions…
  61. 61. Francesco Corazza Focus on Integrated Graphics 79Fusion – Features (III) The difference between Bulldozer/Bobcat is also the market target…
  62. 62. Francesco Corazza Focus on Integrated Graphics 81Fusion – APU roadmap The high level of integration differentiate APUs from CPUs…
  63. 63. Francesco Corazza Focus on Integrated Graphics 82Fusion – Integration Highlights• Shared memory • Lower latencies• PCI Express • Cut down some latencies• No discrete GPU, less • Cost • Power • Motherboard complexity
  64. 64. Francesco Corazza Focus on Integrated Graphics 83Fusion – Conclusion• Interesting features for packet processing • OpenCL/DirectCompute/SSE5 • Architecture tight integrated • New technology (First-Come-First-Served)• OpenCL • Could be the “El Dorado” for packet processing • CPU/GPU working in AND/OR configuration • Shared Memory • Embedded implementation of Fusion technology • AMD declaredly support it to bring the power of heterogeneous computing mainstream
  65. 65. CONCLUSIONS
  66. 66. Francesco Corazza Conclusions 85Summary (I)This presentation has disclosed several ways of exploitingintegrated graphics and, more generally, consumer architecturesfor packet processing:• GPGPU-driven solutions • CUDA, OpenCL, DirectX11• SIMD-driven solutions • Exploit very parallel operations through this SIMD implementation • AVX, SSE• Custom hardware solutions • Design flexible modules tailored on specific needs • FPGA The former solutions are the most in vogue at the moment…
  67. 67. Francesco Corazza Conclusions 86Summary (II) Open Direct Open SSE FPGA CL Compute GL V X X V V (AVX) V X V X V (SSE 3) V X X X V (SSE 3) V V X V V (SSE 5)
  68. 68. Francesco Corazza Conclusions 87RecommendationsWrite directly parallel code is more efficient than hardwareparallelization:
  69. 69. THANK YOUQuestions?
  70. 70. Francesco Corazza 89Bibliography• Lecture notes of course “Tecnologie per reti di calcolatori”• http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm• http://www.intel.com/technology/atom/index.htm• http://www.intel.com/technology/architecture-silicon/mic/index.htm• http://sites.amd.com/us/fusion/apu/pages/fusion.aspx• http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell- architettura_index.html• http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/• http://www.multicorepacketprocessing.com/• http://www.nvidia.co.uk/object/tegra-2.html• http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763- 6.html• http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html• http://gpgpu.org/• http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/• http://gpgpu-computing.blogspot.com/• http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx• http://www.khronos.org/developers/resources/opencl/#ttutorials• http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×