Embedded Solutions 2010: Intel Multicore by Eastronics


Published on

Introduction to embedded Intel® Architecture:
­ Intel® Nehalem micro architecture,
­ Intel® Multicore platforms roadmap
­ Intel® software libraries and Multi threading tools

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Embedded Solutions 2010: Intel Multicore by Eastronics

  1. 1. Intel Multi Core Micro Architecture and software tools Nir Arazy Field Application engineer Eastronics June 2010 Nir.arazy@easx.co.il
  2. 2. Legal Disclaimer • Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights. • The Intel products) referred to in this document is intended for standard commercial use only. Customer are solely responsible for assessing the suitability of the product for use in particular applications. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. • Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628- 8686 or 1-916-356-3104. • All information provided related to future Intel products and plans is preliminary and subject to change at any time, without notice. All dates provided are subject to change without notice. Intel may make changes to specifications and product descriptions at any time, without notice. • Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. • * Other names and brands may be claimed as the property of others. • Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice. • Copyright © 2009, Intel Corporation. All rights reserved. 2 Document# 408075 Intel Confidential
  3. 3. 2007 2008 2009 2010 2011 3 Document# 408075 Intel Confidential
  4. 4. Moore’s Law – GHz to Multi-Core Performance Through Multi-Core Performance “Concurrency is the next major revolution in how we Intel MC Assistance write software” •Threading -Dr Dobb’s Journal, •Multi-tasking Herb Sutter •Training March 2005 •Tools Performance Through frequency 2006 - + 4 Document# 408075 Intel Confidential
  5. 5. Multi-core is Mainstream Is Your Software Ready? Multiple execution cores ramping across Intel platforms 5 Document# 408075 Intel Confidential
  6. 6. Agenda • HW based parallelism – Multi-Cores – Turbo boost – SMT – SSE • SW tools to enable efficient parallelism – IPP – TBB – Thread Checker – Thread Profiler 6 Document# 408075 Intel Confidential
  7. 7. Simultaneous Multi-Threading (SMT) w/o SMT SMT • SMT – Run 2 threads at the same time per core • Take advantage of 4-wide execution engine Time (proc. – Keep it fed with multiple threads cycles) – Hide latency of a single thread • Most power efficient performance feature – Very low die area cost – Can provide significant performance benefit depending on application – Much more efficient than adding an entire core Note: Each box • Nehalem/Westmere advantages represents a processor – Larger caches execution unit – Massive memory BW Simultaneous multi-threading enhances 7 performance and energy efficiency Intel Confidential Document# 408075
  8. 8. Enhanced Cache Subsystem • 3-level cache hierarchy 32KB FLC 32KB FLC – First Level Cache (FLC) Instruction Instruction – 32 KB Instruction & 32 KB Data per core 32KB FLC 32KB FLC – Equivalent to L1 Cache in Intel® Data Data Core™ microarchitecture – Mid Level Cache (MLC) 256KB 256KB – 256 KB per core MLC MLC – Last Level Cache Core 0 Core 1 – Up-to 4MB shared across both core – Inclusive cache policy – minimize ≤ 4MB Last Level Cache snoop traffic – Equivalent to L2 Cache in Intel® Processor Cache Subsystem Core™2 Duo microarchitecture 8 Document# 408075 Intel Confidential
  9. 9. All New 2010 Intel® Core™ Performance-Based Technology Overview Core 2010 Features CPU Thread Intel® Turbo Boost Intel® and Hyper- Hyper- Intel® Hyper-Threading Technology CPU Thread Threading CPU Thread Technologies • Smart multitasking by doubling the number of GFX Core processor threads per core with Intel® Hyper- CPU Thread Threading Technology Intel® Turbo Boost Technology1 Intelligently and seamlessly delivers CPU Core Intel HD Graphics improved CPU performance to match your with Dynamic workload when thermal and power headroom CPU Core Frequency Mobile Only exist GFX Core Intel® HD Graphics with Dynamic Frequency Available on Mobile only Delivers graphics performance boost to graphics intensive applications provided thermal and power headroom exist New Intel® Core processors with Intel® Turbo Boost Technology and Dynamic Frequency to maximize performance of CPU and graphics intensive tasks Note1: See Intel® Turbo Boost Technology disclaimer in the back-up 9 Document# 408075 Intel Confidential
  10. 10. Intel® Turbo Boost Technology Previous Current Platform Generation +Multiple Dynamically trade TDP budget Speed Bins Scenario 1 Scenario 2 +Multiple CPU Intensive Load GFX Intensive Load Speed Bins +1 Speed Bin GFX Turbo C3 State C3 State or lower or lower Core 1 Core 2 Core 1 Core 2 CPU GFX CPU GFX Core 1 Core 2 Single core Single Core Dual Core Intel® Intelligent Power sharing CPU Turbo CPU Turbo CPU Turbo Note: CPU and GFX can turbo simultaneously Strategy: Maximize CPU and GFX performance while staying within the processor TDP and Tjmax Note: Some features may be available only on certain SKU’s 10 Document# 408075 Intel Confidential
  11. 11. Intel® Turbo Boost Technology Processor w/Turbo Processor w/out Turbo Intel® Turbo Boost Technology is targeted to deliver additional performance gains on Platform Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. 11 Document# 408075 Intel Confidential
  12. 12. Intel® Advanced Digital Media Boost Single Cycle SSE In Each Core SSE Operation (SSE/SSE2/SSE3) (SSE/SSE2/SSE3) Single SOURCE 127 0 Cycle X4 X3 X2 X1 SSE SSE/2/3 OP Y4 Y3 Y2 Y1 DECODE DECODE DEST Previous CLOCK X2opY2 X1opY1 CYCLE 1 EXECUTE EXECUTE CLOCK X4opY4 X3opY3 CYCLE 2 Intel® Core™ Microarchitecture CLOCK X4opY4 X3opY3 X2opY2 X1opY1 CYCLE 1 128 bit Single Cycle in each core 12 Document# 408075 Intel Confidential
  13. 13. Single Instruction Multiple Data (SIMD) • Anything that fits into 16 byte… • and all conversions! 4x floats 2x doubles 16x bytes 8x words 4x dwords 2x qwords 1x dqword 13 Document# 408075 Intel Confidential
  14. 14. Intel® Advanced Vector Extension (Intel® AVX) • Features: – New 256-bit Instruction Set Architecture (ISA) – Built on legacy 128-bit SIMD (SSEx) and 64-bit SIMD (MMX) ISA extensions – Enhancements to 128-bit SIMD instructions – Support for 3 and 4 -operand syntax • Benefits: Expected Intel® AVX benefits: - Image, video and audio processing - CNC* & PLC compute performance - High performance Digital Signal & Image Processing (DSIP) within small Size, Weight and total Power (SW&P) • Targeted segments: -Military/Aerospace/Government - Medical Imaging - Comms, Industrial Controllers & Digital Signage Source: http://software.intel.com/en-us/avx/ Performance Improvements for Floating Point Intensive Applications 14 Document# 408075 Intel Confidential
  15. 15. Agenda • HW based parallelism – Multi-Cores – SMT – Turbo boost – SSE • SW tools to enable efficient parallelism – IPP – TBB – Thread Checker – Thread Profiler 15 Document# 408075 Intel Confidential
  16. 16. Simplified Threaded Development with Intel® Tools Architectural Analysis Introduce Threads Confidence/Correctness Optimize / Tune Analyzers Compilers Checkers Analyzers • Find the code that • Built-in optimization • Find deadlocks and • Tune for can benefit from • OpenMP race conditions performance threading Libraries and scalability • Find hotspots that • Multimedia & data processing • Visualize efficiency limit performance • Math Processing of threaded code • Threading 16 Document# 408075 Intel Confidential
  17. 17. Intel® Integrated Performance Primitives (Intel® IPP) — Overview and Benefits Application Source Code Intel IPP Usage Code Samples Rapid Free Code • • Sample video/audio/speech codecs Image processing and JPEG Application Samples • • Signal processing Data compression Development • .NET and Java integration API calls Intel IPP Library C/C++ API Cross-platform • • Cryptography Image processing • • Data Compression Data Integrity Compatibility API • • Image color conversion JPEG / JPEG2000 • • Signal processing Matrix mathematics and • • Computer Vision Video coding • • Vector mathematics String processing Code Re-Use • Audio coding • Speech coding • Speech recognition Static/Dynamic Link Intel IPP Processor-Optimized Binaries Intel® Atom™ Processors Processor- • • Intel® Core™ i7 Processors Outstanding Optimized • • Intel® Core™ 2 Duo and Core™ Extreme Processors Intel® Core™ Duo and Core™ Solo Processors Performance Implementation • • Intel® Pentium® D Dual-Core Processors Intel® Xeon® 64-bit Dual-Core Processors • Intel® Pentium® M and Pentium® 4 Processors • Intel® Itanium® 64-bit Processor Family • Intel® Xeon® DP and MP Processors 17 Document# 408075 Intel Confidential
  18. 18. Intel® IPP Function Library • Over 11,000 functions in 15 domains • Threaded application support – all functions are fully thread-safe – many functions internally threaded • Multiple data type support – Fixed and floating point data type support – 8, 16, 32 and 64-bit • Supports both static and dynamic linking – Maximize performance while balancing application size 18 Document# 408075 Intel Confidential
  19. 19. Intel® Integrated Performance Primitives (IPP) Intel IPP vs. C on single processor • 200% faster (average over all domains) • Optimized C performance normalized to 1 System configuration: Intel® Xeon® 4 Processor, 2.8GHz, 2GB using Windows* XP 19 Document# 408075 Intel Confidential
  20. 20. Threading In Application 20 Document# 408075 Intel Confidential
  21. 21. Threading Inside Intel IPP 21 Document# 408075 Intel Confidential
  22. 22. Intel® IPP Code Samples: Multithreaded H.264 Video Decode Measured using a Dell* Inspiron* 9400 PC with an Intel® Core™ Duo Processor 2.2GHz, 512MB RAM using Microsoft Windows* XP SP2. Codec samples compiled using Intel® C++ Compiler 9.1 using compilation options $(ICL_OMPLIB_OPT) /Qwd9,171,188,593,810,981,1125,1418 -D_OMP_KARABAS -D_OPENMP -Qopenmp 22 Document# 408075 Intel Confidential
  23. 23. Intel® Threading Building Blocks Extend C++ for parallelism Highlights • A C++ runtime library that does thread management, letting developers focus on proven parallel patterns • Appropriately scales to the number of HW threads available • Supports nested parallelism • The thread library API is portable across Linux, Windows, and Mac OS* platforms. Open Source community extended support to FreeBSD*, IA Solaris* and XBox* 360 • Run-time library provides optimal size thread pool, task granularity and performance oriented scheduling • Automatic load balancing through task stealing • Cache efficiency and memory reuse • Committed to: • compiler independence • processor independence • OS independence Both GPL and commercial licenses are available. http://threadingbuildingblocks.org *Other names and brands may be claimed as the property of others 23 Document# 408075 Intel Confidential
  24. 24. Check Intel® TBB online www.threadingbuildingblocks.org Active user forums, FAQs, technical blogs, latest documentation Open Source Package License information. Several very important contributions were made by the OS community allowing TBB 2.1 to build and work on: XBox* 360, Sun Solaris*, AIX* TBB news column and introductory videos *Other names and brands may be claimed as the property of others 24 Document# 408075 Intel Confidential
  25. 25. Threading Tools Intel Software Solutions Group: http://www.intel.com/software Intel® Thread Checker –Used to create correct multi-threaded code Intel® Thread Profiler –Used to analyze performance 25 Document# 408075 Intel Confidential
  26. 26. Data Race • Suppose a=1, b=2 Thread1 Thread2 x=a+b b = 42 What is value of x if: – Thread1 runs before Thread2? x = 3 – Thread2 runs before Thread1? x = 43 Execution order is not guaranteed 26 Document# 408075 Intel Confidential
  27. 27. Intel® Thread Checker Diagnostics 27 Document# 408075 Intel Confidential
  28. 28. Source Code Viewer 28 Document# 408075 Intel Confidential
  29. 29. Performance Profile Speedup 3 2 1 0 1 2 3 4 Threads Possible causes for this scalability profile: 1. Insufficient parallel work 2. Load imbalance 3. Synchronization overhead 4. Memory bandwidth limitations 29 Document# 408075 Intel Confidential
  30. 30. Finding Serial and Parallel Time 30 Document# 408075 Intel Confidential
  31. 31. Load Imbalance • Unequal work loads lead to idle threads and wasted time Thread 0 Busy Thread 1 Idle Thread 2 Thread 3 Start thread Time Join thread s s 31 Document# 408075 Intel Confidential
  32. 32. Synchronization • By definition, synchronization serializes execution • Lock contention means more idle time for threads Thread 0 Thread Busy 1 Idle Thread 2 In Critical Thread 3 Time 32 Document# 408075 Intel Confidential
  33. 33. Real example : Before fix (thread profiler) Switching Serial Overhead Paralle l 33 Document# 408075 Intel Confidential
  34. 34. Real example: After fix 2 X Speed Up Serial Parallel 34 34 Intel Confidential Document# 408075
  35. 35. Summary • If the hardware doesn’t win outright (unlikely) Then it is the SW’s fault – And we can fix the SW • Parallelization is an imperative • Intel offers a set of tools, world-wide experience and online support. • Questions to be asked: – Have we enabled SMT? – Have we investigated the capabilities of SSE? – Did we license Intel SW tools? (IPP/TBB/Thread Checker…) – Where can I find Intel acronym dictionary???? 35 Document# 408075 Intel Confidential
  36. 36. Thank You! 36 Document# 408075 Intel Confidential