Under the Armor of Knights Corner: Intel MIC Architecture at Hotchips 2012
Upcoming SlideShare
Loading in...5
×
 

Under the Armor of Knights Corner: Intel MIC Architecture at Hotchips 2012

on

  • 4,937 views

George Chrysos, the leading architect of Intel Xeon Phi co-processor shared the new architecture details of upcoming Intel's HPC powerhouse. Designed for highly-parallel applications, Intel Xeon Phi ...

George Chrysos, the leading architect of Intel Xeon Phi co-processor shared the new architecture details of upcoming Intel's HPC powerhouse. Designed for highly-parallel applications, Intel Xeon Phi co-processor based on Intel Mani Integrated Core architecture will deliver the combination of industry leading performance per watt with the ability to re-use the existing code and applications without necessity of re-writing them. Equipped with more than 50 cores and built using Intel's latest 22nm 3D Tri-gate transistor technology, new co-processors will be in production this year with first supercomputers from top500 list already taking advantage of this technology.

Statistics

Views

Total Views
4,937
Views on SlideShare
3,131
Embed Views
1,806

Actions

Likes
4
Downloads
142
Comments
0

12 Embeds 1,806

http://www.blognone.com 1584
http://www.linkedin.com 153
http://beartai.net 39
http://zidomag.com 8
http://blognone255.rssing.com 6
http://m.blognone.com 4
http://localhost 3
http://dwh 3
http://feeds 2
https://twitter.com 2
http://www.64feed.com 1
http://blognone26.zahooga.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Under the Armor of Knights Corner: Intel MIC Architecture at Hotchips 2012 Under the Armor of Knights Corner: Intel MIC Architecture at Hotchips 2012 Presentation Transcript

  • Intel® Xeon Phi™ coprocessor(codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012 More on Twitter @IntelITS
  • Legal DisclaimersCopyright © 2012 Intel Corporation. All rights reserved.INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OFINTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTELS PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOUSHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES ANDREASONABLE ATTORNEYS FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTELOR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shallhave no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20Intel, the Intel logo, Xeon, Intel Core and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries Other names and brands may be claimed as the property of others.Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations andfunctions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with otherproducts.For more complete information about performance and benchmark results, visit Performance Test DisclosureThis document contains information on products in the design phase of development.All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.Intels compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does notguarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guidesfor more information regarding the specific instruction sets covered by this notice. Notice revision #20110804WARNING: Altering clock frequency and/or voltage may: (i) reduce system stability and useful life of the system and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or otherdamage; and (v) affect system data integrity. Intel has not tested, and does not warranty, the operation of the processor beyond its specif ications. Intel assumes no responsibility that the processor, including if used with altered clock frequencies and/or voltages, will be fitfor any particular purpose. For more information, visit Overclocking Intel ProcessorsWarning: Altering PC memory frequency and/or voltage may (i) reduce system stability and use life of the system, memory and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additionalheat or other damage; and (v) affect system data integrity. Intel assumes no responsibility that the memory, included if used with altered clock frequencies and/or voltages, will be fit for any particular purpose. Check with memory manufacturer for warranty and additionaldetailsAvailable  on  select  Intel®  Core™  Intel®  Xeon®  and  Intel®  Xeon  Phi™  processors.  Requires  an  Intel®  HT  Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more informationincluding details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.Requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visithttp://www.intel.com/info/em64tRequires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, andsystem configuration. For more information, visit http://www.intel.com/go/turboENERGY STAR is a system-level energy specification, defined by the Environmental Protection Agency, that relies on all system components, such as processor, chipset, power supply, etc.) For more information, visit http://www.intel.com/technology/epa/index.html
  • Intel® Many Integrated Core (Intel MIC) Architecture Targeted at highly parallel HPC workloads • Physics, Chemistry, Biology, Financial Services Power efficient cores, support for parallelism • Cores: less speculation, threads, wider SIMD • Scalability: high BW on die interconnect and memory General Purpose Programming Environment • Runs Linux (full service, open source OS) • Runs applications written in Fortran, C, C++, … • Supports X86 memory model, IEEE 754 • x86 collateral (libraries, compilers, Intel® VTune™ debuggers, etc)3 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Knights Corner Coprocessor KNC Card KNC Card TCP/IP PC e x16 GDDR5 Channel … GDDR5 Channel Intel® Xeon® Processor PCIe x16 Channel GDDR5 KN50 Cores > … KN Linux OS Channel GDDR5 System Memory GDDR5 Channel … GDDR5 Channel >= 8GB GDDR5 memory4 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Knights Corner – Power Efficient Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters 1381 1380 1266 1400 1200 MFLOPS/Watt 1000 800 600 + + + 400 200 0 Intel Corp Nagasaki Univ. Barcelona Supercomputing Center Knights Corner ATI Radeon Nvidia Tesla 2090 Higher is Better Source: www.green500.org Top500 #150 Top500 #456 Top500 #177 72.9 kW 47 kW 81.5 kW5 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Knights Corner Micro-architecture Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR MC TD TD TD TD GDDR MC GDDR MC GDDR MC TD TD TD TD L2 L2 L2 L2 Core Core Core Core6 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Knights Corner Core PPF PF D0 D1 D2 E WB T0 IP T1 IP L1 TLB Code Cache Miss T2 IP and 32KB T3 IP Code Cache TLB Miss 16B/Cycle (2 IPC) 4 Threads In-Order Decode uCode 512KB TLB Miss HWP L2 Cache Handler Pipe 0 Pipe 1 L2 Ctl L2 TLB VPU RF X87 RF Scalar RF VPU X87 ALU 0 ALU 1 To On-Die Interconnect 512b SIMD TLB Miss L1 TLB and 32KB Data Cache DCache Miss Core X86 specific logic < 2% of core + L2 area7 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Vector Processing Unit PPF PF D0 D1 D2 E WB D2 E VC1 VC2 V1-V4 WB D2 E VC1 VC2 V1 V2 V3 V4 VPU LD DEC RF 3R, 1W Vector ALUs EMU 16 Wide x 32 bit ST 8 Wide x 64 bit Fused Multiply Add Mask Scatter RF Gather8 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Interconnect BL - 64 Bytes Data Core Core Core Core L2 L2 L2 L2 AD Command and Address AK Coherence and Credits TD TD TD TD TD TD TD TD AK AD L2 L2 L2 L2 Core Core Core Core BL – 64 Bytes9 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Distributed Tag Directories Core Core Core Core L2 L2 L2 L2 TAG Core Valid Mask State TAG Core Valid Mask State TD TD TD TD TD TD TD TD L2 L2 L2 L2 Tag Directories track cache-lines in all L2s Core Core Core Core10 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Interleaved Memory Access Core Core GDDR MC L2 L2 TD TD Core GDDR MC L2 TD Core Core L2 TD L2 TD Core GDDR MC TD L2 TD TD GDDR MC L2 L2 Core Core11 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Interconnect: 2X AD/AK BL - 64 Bytes Core Core Core Core L2 L2 L2 L2 AD AK TD TD TD TD 2x TD TD TD TD AK AD L2 L2 L2 L2 Core Core Core Core BL – 64 Bytes12 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Multi-threaded Triad – Saturation for 1 AD/AK Ring Performance Simulation Data indicates saturation for a single AD/AK ring 0 5 10 15 20 25 30 35 40 45 50 Cores Running Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance13 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Multi-threaded Triad – Benefit of Doubling AD/AK Silicon Data for 2 AD + AK rings > 40% Performance Simulation Data indicates saturation for a single AD/AK ring 0 5 10 15 20 25 30 35 40 45 50 Cores Running Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance14 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Streaming Stores Streams Triad for (i=0; i<HUGE; i++) A[i] = k*B[i] + C[i]; Without Streaming Stores Read A, B, C, Write A 256 Bytes transferred to/from memory per iteration With Streaming Stores Read B, C, Write A 192 Bytes transferred to/from memory per iteration15 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Multi-threaded Triad — with Streaming Stores Silicon Data Streaming Stores > 30% Performance 0 5 10 15 20 25 30 35 40 45 50 Cores Running Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance16 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Cache Hierarchy Micro-architecture Choices L2 TLB 64 entry, holds PTEs and PDEs vs. no L2 TLB Dcache Capability Simultaneous 512b load and 512b store vs. 1 load or store per cycle L2 Cache 512 KB vs. 256 KB Hardware Prefetcher 16 stream detectors, prefetch into the L2 vs. no HWP (rely only on software prefetching)17 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Per-Core ST Performance Improvement (per cycle) Spec FP 2006 3.0 Performance impact of KNC core uArch improvements 2.5 2.0 1.5 1.0 0.5 0.0 >1.8x Average Performance/Cycle Improvement – 1 Core, 1 Thread Results measured in development labs at Intel on Knights Corner and Knights Ferry prototype hardware and systems. For more information go to http://www.intel.com/performance18 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Caches – For or Against? Relative BW Relative BW/Watt 50 45 40 Caches: 35  high data BW 30  low energy per byte of data supplied 25  programmer friendly (coherence just works) 20 15 10 5 0 Memory BW L2 Cache BW L1 Cache BW Coherent Caches are a key MIC Architecture Advantage Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance.19 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Example: Stencils spatial time-step simulation of a physical system L2$ Sized Cache blocking promotes much higher performance and performance/watt vs. memory streaming20 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Power Management: All On and Running PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core21 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Core C1: Clock Gate Core PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core When all 4T on a core have halted, core clock gates itself22 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Core C6: Power Gate Core PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core C1 time-out, power gate core, save leakage, requires core-re-init23 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Package Auto C3 PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core Timeout when all cores have been in C6, clock gate the L2 and interconnect24 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Package C6 PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core Host Driver can initiate Package C6 – Uncore Voltage Off, requires partial restart25 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Summary Intel® Xeon Phi™ coprocessor provides: Performance and Performance/Watt for highly parallel HPC with cores, threads, wide-SIMD, caches, memory BW Intel Architecture general purpose programming environment advanced power management technology KNC delivers programmability and performance/watt for highly parallel HPC26 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Thank You Knights Corner brought to you by: IAG (Intel Architecture Group) • DCSG (Data Center and Systems Group) • VPG (Visual and Parallel Group) MIC – HW Architecture – HW Design – SW SSG (Software and Services Group) MIC IL PCL (Intel Labs – Parallel Computing Lab)27 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Vector Processor: 512b SIMD Width SP SP SP SP 15 DP7 11 DP5 7 DP3 3 DP1 Shared Multiplier SP SP SP SP Circuit for SP/DP 14 10 6 2 SP SP SP SP 13 9 5 1 DP6 DP4 DP2 DP0 SP SP SP SP 12 8 4 0 RF3 RF2 RF1 RF0 16 wide SP SIMD, 8 wide DP SIMD 2:1 Ratio good for circuit optimization29 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Gather/Scatter Address Machinery Gather Instruction Loop gather-prime loop: gather-step; jump-mask-not-zero loop Vector Register Index0 Index1 Index2 Index3 Index4 Index5 Index6 Index7 Scalar Register Base Address + + + + + + + + Mask Register Addr0 Addr1 Addr2 Addr3 Addr4 Addr5 Addr6 Addr7 1 1 1 1 1 1 1 1 Clear Find First To TLB/ Access Address DCACHE Clear = = Gather/Scatter machine takes advantage of cache-line locality30 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.
  • Package Deep C3 PCIe IO Core Core Core Core PCIe Client L2 L2 L2 L2 Logic GDDR5 GDDR5 GDDR5 GDDR5 GDDR IO TD TD TD TD GDDR IO GDDR MC GDDR MC GDDR5 GDDR5 GDDR MC GDDR MC TD TD TD TD GDDR5 GDDR5 GDDR5 GDDR5 L2 L2 L2 L2 Core Core Core Core Host Driver Initiated – L2/Ring/TDs dropped to retention V, memory in self refresh31 Visual and Parallel Computing Group Copyright © 2012 Intel Corporation. All rights reserved.