Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Architecture

176 views

Published on

HPC DAY 2017 - http://www.hpcday.eu/

Accelerating tomorrow's HPC and AI workflows with Intel Architecture

Atanas Atanasov | HPC solution architect, EMEA region at Intel

Published in: Technology
  • Be the first to comment

HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Architecture

  1. 1. © Copyright 2017 Intel Corporation Atanas Atanasov
  2. 2. Intel Confidential | NDA Required Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. For more information go to http://www.intel.com/performance. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. No computer system can be absolutely secure. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. Intel, the Intel logo, Xeon, Intel vPro, Intel Xeon Phi, Look Inside., are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. © 2017 Intel Corporation. LegalDisclaimers 2
  3. 3. Disclaimers Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. The cost reduction scenarios described are intended to enable you to get a better understanding of how the purchase of a given Intel based product, combined with a number of situation-specific variables, might affect future costs and savings. Circumstances will vary and there may be unaccounted-for costs related to the use and deployment of a given product. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs or cost reduction. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. No computer system can be absolutely secure. Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process. SPEC, SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). © 2016 Intel Corporation. Intel, the Intel logo, Xeon, Xeon Phi, Xeon Phi logos and Xeon logos are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 3Intel Confidential
  4. 4. 4 Agenda • Challenges in HPC/AI and SSF • Compute: Xeon Scalable Family • Fabric: Omni-Path • Storage: Optane • AI: Nervana
  5. 5. 2 HPCisFoundationaltoInsight Aerospace Biology Brain Modeling Chemistry/Chemical Engineering Climate Computer Aided Engineering Cosmology Cybersecurity Defense Pharmacology Particle Physics Metallurgy Manufacturing / Design Life Sciences Government Lab Geosciences / Oil & Gas Genomics Fluid Dynamics 1Source: IDC HPC and ROI Study Update (September 2015) 2Source: IDC 2015 Q1 World Wide x86 Sever Tracker vs IDC 2015 Q1 World Wide HPC Sever Tracker DigitalContentCreationEDAEconomics/FinancialServicesFraudDetection SocialSciences;Literature,linguistics,marketingUniversityAcademicWeather Business Innovation A New Science Paradigm Fundamental Discovery High ROI: $515 Average Return Per $1 of HPC Investment1 Advancing Science And Our Understanding of the Universe Data-Driven Analytics Joins Theory, Experimentation, and Computational Science
  6. 6. 2 Growing Challenges in HPC “The Walls” System Bottlenecks Memory | I/O | Storage Energy Efficient Performance Space | Resiliency | Unoptimized Software Divergent Infrastructure Barriers to Extending Usage Resources Split Among Modeling and Simulation | Big Data Analytics | Machine Learning | Visualization HPC Optimized Democratization at Every Scale | Cloud Access | Exploration of New Parallel Programming Models Big Datahpc Machine learning visualization
  7. 7. 11 What Makes a Great HPC Solution? Parallel File SystemSwitch Fabric Login and Management Nodes . . . Actual configurations depend on specific OEM offerings and implementation. Intel® Omni-Path Fabric 1GbE for administration IBA 10/40 GbE Networking Gateways Intel® Software Tools Intel® Parallel Studio Intel® Node Manager Intel® Trace Analyzer I/O Nodes Intel® Networking Intel® Omni-Path Fabric Intel® Silicon Photonics Burst Buffer Intel® Xeon® Processors Intel® Omni-Path Fabric Intel® Optane™ Technology Compute Nodes Intel® Compute Intel® Xeon Phi™ Processors Intel® Xeon® Processors Intel® Optane™ Technology Intel® Omni-Path Fabric Intel® Solutions for Lustre* Intel® Enterprise Edition for Lustre* Intel® Foundation Edition for Lustre* Intel® Cloud Edition for Lustre* Reference Architecture Intel® Cluster Ready Intel® Scalable System Framework
  8. 8. 3 A Holistic Architectural Approach is Required Compute Memory Fabric Storage PERFORMANCEICAPABILITY TIME System Software Innovative Technologies Tighter Integration Application Modernized Code Community ISV Proprietary System Memory Cores Graphics Fabric FPGA I/O
  9. 9. 5 Intel® Scalable System Framework A Holistic Design Solution for All HPC Needs Small Clusters Through Supercomputers Compute and Data-Centric Computing Standards-Based Programmability On-Premise and Cloud-Based Intel® Xeon® Processors Intel® Xeon Phi™ Processors Intel® Xeon Phi™ Coprocessors Intel® Server Boards and Platforms Intel® Solutions for Lustre* Intel® Optane™ Technology 3D XPoint™ Technology Intel® SSDs Intel® Omni-Path Architecture Intel® True Scale Fabric Intel® Ethernet Intel® Silicon Photonics HPC System Software Stack Intel® Software Tools Intel® Cluster Ready Program Intel Supported SDVis Compute Memory/Storage Fabric Software Intel Silicon Photonics
  10. 10. XEON Scalable Family 10
  11. 11. Intel®Xeon®ScalableplatformThe foundation of Data Center Innovation: Agile & Trusted Infrastructure delivers1.65xaverageperformanceboostoverpriorGeneration1 11 1 Up to 1.65x Geomean based on Normalized Generational Performance going from Intel® Xeon® processor E5-26xx v4 to Intel® Xeon® Scalable processor (estimated based on Intel internal testing of OLTP Brokerage, SAP SD 2-Tier, HammerDB, Server-side Java, SPEC*int_rate_base2006, SPEC*fp_rate_base2006, Server Virtualization, STREAM* triad, LAMMPS, DPDK L3 Packet Forwarding, Black-Scholes, Intel Distribution for LINPACK Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. Performance Pervasive through compute, storage, and network Agility Rapid service delivery Security Pervasive data security with near zero performance overhead
  12. 12. 12 Typical2-socketconfiguration CPU x8 CPU x8x4 x4 DMI 2 Intel® QPI Intel Xeon E5 v4 (2016) Purley (2017) PCIe*  Four DDR4 memory channels  up to 24 DIMMs  Up to 80 PCIe lanes  Two QPI links (up to 9.6 GT/s)  Six DDR4 memory channels  up to 24 DIMMs  Up to 96 PCIe lanes  Two UPI links (up to 10.4 GT/s); up to 3 UPI links in 4S and 8S configurations  Integrated Intel® Omni-Path Architecture (Fabric) DDR4 DIMMs PCIe* uplink connection for Intel® QuickAssist Technology and Intel® Ethernet** CPU Intel® UPI LBG DMI 3x16 PCIe* 1x100G Intel® OP Fabric x4 3x16 PCIe* 1x100G Intel® OP Fabric CPU ** Intel Xeon Scalable (2017)
  13. 13. 13 INTEL®XEON®SCALABLEprocessors TheFoundationforAgile,Secure,Workload-OptimizedHybridCloud MAINSTREAM Good LightTASKS SCALABLEPERFORMANCE ATLOWPOWER ENTRY SCALABLEPERFORMANCE HARDWARE-ENHANCEDSECURITY STANDARDRASSTANDARDRAS MODERATETASKS INTEL®TURBOBOOSTTECHNOLOGYAND INTEL®HYPER-THREADINGTECHNOLOGY FORMODERATEWORKLOADS FORLIGHTWORKLOADS 22CORESUPTO SOCKET SUPPORT2&4 3UPTO UPILINKS RELIABILITY,AVAILABILITY ANDSERVICEABILITYADVANCED 28CORESUPTO SOCKET SUPPORT8 1.5TBTOPLINEMEMORY CHANNELBANDWIDTH 3LINKS UPIUP 2,4& 2666DDR4 M H Z WITH UPTO TO WITH UPTO HIGHESTACCELERATOR THROUGHPUT ENTRYEfficient ENTRYPERFORMANCE,PriceSensitive
  14. 14. 14  Maximizes performance  Enables consistent, low latencies  Optimized for data sharing and memory access between all CPU cores/threads for ideal memory bandwidth and capacity  Data flows scale efficiently for 2, 4 & 8+ socket configurations  Designed for modern virtualized and hybrid cloud implementations Designedfornext-generationDataCenters Ring Architecture Mesh Architecture 2009-2017+ New in 2017
  15. 15. Re-ArchitectedL2&L3CacheHierarchy Shared L3 2.5MB/core (inclusive) Core L2 (256KB private) Core L2 (256KB private) Core L2 (256KB private) Shared L3 1.375MB/core (non-inclusive) Core L2 (1MB private) Core L2 (1MB private) Core L2 (1MB private) Previous Architectures Intel® Xeon® Scalable Processor Architecture • On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture): • Shared-distributed  shared-distributed L3 is primary cache • Private-local  private L2 becomes primary cache with shared L3 used as overflow cache • Shared L3 changed from inclusive to non-inclusive: • Inclusive (prior architectures)  L3 has copies of all lines in L2 • Non-inclusive (Skylake architecture)  lines in L2 may not exist in L3 Skylake-SPcachehierarchyarchitectedspecificallyforDatacenterusecase 15
  16. 16. Intel®Xeon®ScalableProcessorsforTechnicalComputing(HPC) powerfulandbalancedperformancefor diversehpcworkloads Powerful performance  Up to 28 cores vs. 24 cores/22 cores (on Intel® Xeon® processor E7 v4 / Intel Xeon processor E5-2600 v4 families)  Intel® AVX-512 delivers up to 2X FLOPs/clock-cycle peak performance capability optimized for HPC, data analytics, and cryptography workloads1  New Intel® Mesh architecture with 3 Intel® Ultra Path Interconnect lanes provides greater inter-CPU bandwidth for the most data- hungry, latency-sensitive applications Significantly increased memory and I/O bandwidth  Up to 1.5x gen-to-gen memory bandwidth increase per CPU (6 memory channels) for extremely large compute- and data-intensive workloads  More IO bandwidth with 48 PCIe 3.0 lanes vs. 40 lanes on Intel Xeon processor E5-2600 v4  Intel® Optane™ and Intel® 3D NAND solid state drives deliver industry-leading combination of high throughput, low latency, high quality of service (QoS), and ultra high endurance6 to break data access bottlenecks integratedinterconnectfor compellingefficiency Integrated Intel® Omni-Path Architecture designed for today’s HPC systems  Provides 100Gbps high- bandwidth and low-latency fabric for HPC clusters  Reduces number of required switches and lowers fabric costs7, freeing up budget for up to 24% more compute nodes8  Denser 48-port switch chip delivers a 33 percent increase9 over traditional InfiniBand switch, resulting in power, space and maintenance savings convergedparallelprogramming environmentforIntel®Xeon®scalable processors&Intel®XeonPHi™processors Highly integrated portfolio of superior technologies and optimized software tools ensures code portability across IA solutions  Intel AVX-512 enables converged programming environment for Intel Xeon Scalable Processor and Intel® Xeon Phi™ Processor compute nodes  Intel® Modern Code Developer Program enables the next decade of discovery  Intel® Parallel Studio XE 2017 upgrades developer toolkit for HPC and technical computing  Intel® HPC Orchestrator simplifies installation and ongoing maintenance of HPC system software stack 16 For footnotes and configurations, see slides 29-30.
  17. 17. 17 Intel®AdvancedVectorExtensions-512(AVX-512)End Customer Value: Workload-optimized performance, throughput increases, and H/W-enhanced security improvements for familiar analytics, HPC, video transcode, cryptography, and compression software. Problems Solved: 1. Achieve more work per cycle (doubles width of data registers) 2. Minimize latency & overhead (doubles the number of registers) with ultra-wide (512-bit) vector processing capabilities (that that 2x FMA processing engines are available on Intel® Xeon® Platinum and Intel® Xeon® Gold Processors) Up to 2xFLOPS/clock cycle1 Segments ProofpointsValuepillars Accelerates performance for your most demanding computational tasks Up to 4xgreater throughput2 performance security Cloud Service Providers Comms Service Providers * FLOPs = Floating Point Operations 1 Peak performance vs. Intel® AVX2. As measured by Intel® Xeon® Processor Scalable Family with Intel® AVX-512 compared to an Intel® Xeon® E5 v4 with Intel® AVX2 2 Vectorized floating-point throughput. As measured by Intel® Xeon® Processor Scalable Family with Intel® AVX-512 compared to an Intel® Xeon® E5 v4 with Intel® AVX2 Enterprise
  18. 18. • 512-bit wide vectors • 32 operand registers • 8 64b mask registers • Embedded broadcast • Embedded rounding Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle Skylake Intel® AVX-512 & FMA 64 32 Haswell / Broadwell Intel AVX2 & FMA 32 16 Sandybridge Intel AVX (256b) 16 8 Nehalem SSE (128b) 8 4 Intel AVX-512 Instruction Types AVX-512-F AVX-512 Foundation Instructions AVX-512-VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes AVX-512-BW 512-bit Byte/Word support AVX-512-DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc.) AVX-512-CD Conflict Detect : used in vectorizing loops with potential address conflicts Powerfulinstructionsetfordata-parallelcomputation 18 Intel®AdvancedVectorExtensions-512(AVX-512)
  19. 19. PerformanceandEfficiencywithIntel®AVX-512 Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 669 1178 2034 3259 760 768 791 767 3.1 2.8 2.5 2.1 0 0.5 1 1.5 2 2.5 3 3.5 0 500 1000 1500 2000 2500 3000 3500 SSE4.2 AVX AVX2 AVX512 CoreFrequency GFLOPs,SystemPower LINPACK Performance GFLOPs Power (W) Frequency (GHz) 1.00 1.74 2.92 4.83 0.00 1.00 2.00 3.00 4.00 5.00 6.00 SSE4.2 AVX AVX2 AVX512 NormalizedtoSSE4.2 GFLOPs/Watt GFLOPs / Watt 1.00 1.95 3.77 7.19 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 SSE4.2 AVX AVX2 AVX512 NormalizedtoSSE4.2 GFLOPs/GHz GFLOPs / GHz Intel®AVX-512deliverssignificantperformanceandefficiencygains 19
  20. 20. FaBRIC OMNI-PATH 20
  21. 21. Intel® Omni-Path Architecture In 30 secs 21 The Interconnect Landscape: Why Intel® OPA? 1 Source: Internal analysis based on a 256-node to 2048-node clusters configured with Mellanox FDR and EDR InfiniBand products. Mellanox component pricing from www.kernelsoftware.com Prices as of November 3, 2015. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com. Prices as of May 26, 2015. Intel® OPA (x8) utilizes a 2-1 over-subscribed Fabric. Intel® OPA pricing based on estimated reseller pricing using projected Intel MSRP pricing on day of launch. Performance I/O struggling to keep up with CPU innovation Increasing Scale From 10K nodes….to 200K+ Previous solutions reaching limits of scalability, manageability and reliability Fabric: Cluster Budget1 Fabric an increasing % of HPC hardware costs 21 3 SU14 1 2 3 SU15 1 2 3 SU16 1 2 3 SU17 1 2 3 SU18 1 2 3 SU10 1 2 3 SU11 1 2 3 SU12 1 2 3 SU13 1 2 3 SU05 1 2 3 SU06 1 2 3 SU07 1 2 3 SU08 1 2 3 SU09 1 2 3 SU01 1 2 3 SU02 1 2 3 SU03 1 2 3 SU04 1 2 3 Tomorrow 30 to 40% Today 20%-30% Goal: Keep cluster costs in check  maximize COMPUTE power per dollar
  22. 22. 7 Intel® Omni-Path Architecture The Future of High Performance Fabrics Better Scaling vs EDR 48 Radix Chip Ports Up to 26% More Servers than InfiniBand* EDR within the Same Budget1 Up to 60% Lower Power and Cooling Costs2 Configurable / Resilient Job Prioritization (Traffic Flow Optimization) No-Compromise Resiliency (Packet Integrity Protection and Dynamic Lane Scaling) Market Adoption >100 OEM and HPC Storage Vendor Offerings Expected for Platforms, Switches, and Adapters3 Intel® Omni-Path Architecture HPC’s NextGeneration Fabric 1. Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox componentpricing from www.kernelsoftware.com, with prices as of November 3, 2015.Computenode pricing based onDellPowerEdge R730 server from www.dell.com,with prices as of May 26,2015.Intel®OPA pricing based onestimated resellerpricing based on Intel MSRP pricing on ark.intel.com. 2. Assumes a 750- node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Mellanox power data based on Mellanox CS7500 DirectorSwitch, MellanoxSB7700/SB7790Edgeswitch, and MellanoxConnectX-4VPI adapter card installation documentationposted on www.mellanox.comas ofNovember 1,2015. IntelOPA power databased on productbriefs postedon www.intel.comasofNovember16, 2015.Intel®OPA pricing based onestimated reseller pricing based on Intel MSRP pricing on ark.intel.com. 3. Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standardIntel®OPA adapters. Design win countas ofNovember 1,2015 and subjectto changewithout noticebased on vendorproductplans.*Othernamesand brands maybe claimed as property of others. Intel® Scalable System Framework
  23. 23. 600 500 400 300 200 100 0 SwitchChipsRequired Nodes Intel® OPA 48-port switch InfiniBand* 36-port switch FEWER SWITCHES REQUIRED 1. Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox component pricing from www.kernelsoftware.com, with prices as of November 3, 2015. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com, with prices as of May 26, 2015. Intel® OPA pricing based on estimated reseller pricing based on Intel MSRP pricing on ark.intel.com. 2. Assumes a 750- node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Mellanox power data based on Mellanox CS7500 Director Switch, Mellanox SB7700/SB7790 Edge switch, and Mellanox ConnectX-4 VPI adapter card installation documentation posted on www.mellanox.com as of November 1, 2015. Intel OPA power data based on product briefs posted on www.intel.com as of November 16, 2015. Intel® OPA pricing based on estimated resellerpricing based onIntelMSRP pricing onark.intel.com.3Numberof switch chips required, switch density,and fabric scalability are based ona fullbisectional bandwidth (FBB) Fat-Tree configuration,using a48-portswitch for Intel®Omni-PathArchitectureand 36-portswitchASICforeither Mellanoxor Intel® True ScaleFabric. *Othernamesand brands maybe claimed asthe property ofothers. 2.3Xfabric scalability based on a27,648-nodeclusterconfiguredwith the Intel®Omni-Path Architectureusing48-portswitch ASICs,ascompared with a36-port switch chip thatcansupport upto11,664 nodes. 26%More Servers than EDR1 60%Lower Cooling Costs2 2.3XGreater Fabric Scalability3 7 Intel® Omni-Path Architecture HPC’s Next-Generation Fabric Intel® Scalable System Framework
  24. 24. Intel® Omni-Path Architecture Xeon Phi™ Processor-F (KNL-F) Maximizing Support for Heterogeneous Clusters Intel Xeon Processor (HSW, BDW & SKL) PCI Card Xeon Phi™ Processor (KNL) HFI Greater flexibility for creating compute islands depending on user requirements 24 WFR HFI Intel Xeon Processor-F (SKL-F) HFI WFR HFI Intel Xeon Processor-F (SKL-F) HFI GPU GPU GPU memory GPU memory PCI bus Intel Xeon Processor (SKL) GPU Direct v3 provided in Intel® OPA 10.3 release PCI Card PCI Card WFR HFI
  25. 25. Intel® Omni-Path Architecture Next Up for Intel® OPA: Artificial Intelligence Intel offers a complete AI Portfolio  From CPUs to software to computer vision to libraries and tools Intel® OPA offers breakthrough performance on scale-out apps  Low latency  High bandwidth  High message rate  GPU Direct RDMA support  Xeon Phi Integration 25 Things &devices Cloud DATACenter Accelerant Technologies World-class interconnect solution for shorter time to train
  26. 26. Intel® Omni-Path Architecture NVMe* over OPA Intel® OPA + Intel® SSD and Optane™ Technology  High Endurance  Low latency  High Efficiency  Complete NVMe over Fabric Solution NVMe-over-OPA status  Supported in 10.4.3 IFS release  Compliant with NVMeF spec 1.0 Target and Host system configuration: 2 x Intel® Xeon® CPU E5-2699 v3 @ 2.30Ghz, Intel® Server Board S2600WT, 128GB DDR4, CentOS 7.3.1611, kernel 4.10.12, IFS 10.4.1, NULL- BLK, FIO 2.19 options hfi1 krcvqs=8 sge_copy_mode=2 wss_threshold=70 26 *Other names and brands may be claimed as the property of others. Only Intel is delivering a total NVMe over Fabric solution! NVMe Host Driver RDMA Transport Intel® OPA HFI NVMe Host Driver NVMe Target Driver RDMA Transport NVMe Storage Intel® OPA HFI Host Target PCIe Transport ~1.5M 4k Random IOPS 99% Bandwidth Efficiency
  27. 27. STORAGE OPTANE 27
  28. 28. 9 Tighter System-Level Integration Innovative Memory-Storage Hierarchy *cache, memory or hybrid mode Compute Node Processor Memory Bus I/O Node Remote Storage Compute Today Caches Local Memory Local Storage Parallel File System (Hard Drive Storage) HigherBandwidth. LowerLatencyandCapacity Much larger memory capacities keep data in local memory Local memory is now faster & in processor package Compute Future Caches Intel® DIMMs based on 3D XPoint™ Technology Burst Buffer Node with Intel® Optane™ Technology SSDs Parallel File System (Hard Drive Storage) On-Package High Bandwidth Memory* SSD Storage Intel® Optane™ Technology SSDsI/O Node storage moves to compute node Some remote data moves onto I/O node Local Memory Intel® Scalable System Framework
  29. 29. 4 Bridging the Memory-Storage Gap Intel® Optane™ Technology Based on 3D XPoint™ SSD Intel® Optane™ SSDs 5-7x Current Flagship NAND-Based SSDs (IOPS)1 DRAM-like performance Intel® DIMMs Based on 3D-XPoint™ 1,000x Faster than NAND1 1,000x the Endurance of NAND2 Hard drive capacities 10x More Dense than Conventional Memory3 1Performancedifferencebased oncomparison between 3DXPoint™ Technologyandother industryNAND 2Densitydifference based oncomparison between 3DXPoint™ Technologyandother industryDRAM 2Endurancedifference based oncomparison between 3DXPoint™ Technologyandother industryNAND Intel® Scalable System Framework
  30. 30. 30NVM SOLUTIONS GROUP 30NVM SOLUTIONS GROUP Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications. Intel® Optane™ SSD prototype compared to the Intel® SSD DC P3700 Series (NAND) Intel® Optane™ SSDs for Data Center Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications. Intel® Optane™ SSD prototype compared to the Intel® SSD DC P3700 Series (NAND) = Ultra-high Endurance Responsive Under Load Low Latency Predictably Fast Service QoS Breakthrough Performance IOPS
  31. 31. NVM Solutions Group 31 Intel® Optane™ SSD Use Cases DRAM PCIe* PCIe Intel® 3D NAND SSDs Intel® Optane™ SSD Fast Storage and Cache Intel® Xeon® ‘memory pool’DRAM PCIe Intel® 3D NAND SSDs Intel® Optane™ SSD DDR DDR PCIe Extend Memory Intel® Xeon® *Other names and brands may be claimed as the property of others
  32. 32. NVM Solutions Group 32 5-8x faster at low Queue Depths1 Vast majority of applications generate low QD storage workloads 1. Common Configuration - Intel 2U Server System, OS CentOS 7.2, kernel 3.10.0-327.el7.x86_64, CPU 2 x Intel® Xeon® E5-2699 v4 @ 2.20GHz (22 cores), RAM 396GB DDR @ 2133MHz. Configuration – Intel® Optane™ SSD DC P4800X 375GB and Intel® SSD DC P3700 1600GB. Performance – measured under 4K 70-30 workload at QD1-16 using fio-2.15. Breakthrough Performance Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance.
  33. 33. NVM Solutions Group 33 up to 60x better at 99% QoS1 Ideal for critical applications with aggressive latency requirements 1. Common Configuration – Intel 2U Server System, OS CentOS 7.2, kernel 3.10.0-327.el7.x86_64, CPU 2 x Intel® Xeon® E5-2699 v4 @ 2.20GHz (22 cores), RAM 396GB DDR @ 2133MHz. Configuration – Intel® Optane™ SSD DC P4800X 375GB and Intel® SSD DC P3700 1600GB. QoS – measures 99% QoS under 4K 70-30 workload at QD1 using fio-2.15. Predictably Fast Service Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance.
  34. 34. NVM Solutions Group 34 Ultra Endurance MLC/TLC 2D/3D NAND SSD Intel® Optane™ SSD Endurance (DWPD) 0.5 3 30 Up to 10x more Total Bytes Written at similar capacity1 Architected for endurance scaling  ‘Write in place’ technology  Non-destructive write process Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. 1. Comparing projected Intel® Optane™ SSD 750GB specifications to actual Intel® SSD DC P4600 1.6TB specifications. Total Bytes Written (TBW) calculated by multiplying specified or projected DWPD x specified or projected warranty duration x 365 days/year.
  35. 35. NVM Solutions Group AI: NERVANA 35
  36. 36. 36 By2020… The average internet user will generate ~1.5GBoftrafficperday Smart hospitals will generate over 3,000GBperday Self driving cars will be generating over 4,000GBperday…each All numbers are approximated http://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.html http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html https://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172 http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html A connected plane will generate over 40,000GBperday A connected factory will generate over 1,000,000GBperday radar ~10-100KB persecond sonar ~10-100KB persecond gps ~50KB persecond lidar ~10-70MB persecond cameras ~20-40MB persecond Self driving cars will generate over 4,000GBperday…each Thecomingfloodofdata
  37. 37. 37 Analyticsneedsai Hindsight What Happened Insight What Happened and Why Foresight What Will Happen, When, and Why Simulation-Driven Analysis and Decision-Making Self-Learning and Completely Automated Enterprise Mature Data Lake Computerized Human Thought Simulation and Actions Towards Autonomic Enterprise Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics Cognitive Analytics AI is a large category all on its own, and a vital tool for reaching higher maturity & scale data analytics Advanced Analytics Operational Analytics TodayEmerging
  38. 38. 38 AIComputeCycleswillgrow by202012X mainframes Standards- basedservers Cloud computing Artificial intelligence Source: Intel forecast Thenextbigwave Datadeluge COMPUTEbreakthrough Innovationsurge
  39. 39. 39 MACHINE/DEEPLEARNING REASONINGSYSTEMS TOOLS&STANDARDS COMPUTERVISION Programmablesolutions Memory/storage Networking communications 5G Things &devices Cloud DATACenter Accelerant Technologies … End-to-endai Intel Has a Complete End-to-End Portfolio
  40. 40. 40 IntelStrategy:OptimizedDeepLearningEnvironment Fuel the development of vertical solutions Deliver best single node and multi-node performance Accelerate design, training, and deployment Drive optimizations across open source machine learning frameworks Nervana Cloud™ Maximum performance on Intel architectureIntel® Math Kernel Library (Intel® MKL) Training Inference Intel® MKL-DNN Intel® Nervana™ Graph © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
  41. 41. 41 ✝Codename for product that is coming soon All performance positioning claims are relative to other processor technologies in Intel’s AI datacenter portfolio *Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. AI Datacenter Allpurpose Highly-parallel Flexibleacceleration DeepLearning Crest Family✝ Deeplearningbydesign Scalable acceleration with best performance for intensive deep learning training & inference Intel® FPGA EnhancedDLInference Scalable acceleration for deep learning inference in real-time with higher efficiency, and wide range of workloads & configurations Intel® Xeon® Processor Family Training&Inference Scalable performance for widest variety of AI & other datacenter workloads – including deep learning training & inference Intel® Xeon Phi™ Processor (Knights Mill✝) FasterDLTraining Scalable performance optimized for even faster deep learning training and select highly-parallel datacenter workloads* ✝
  42. 42. MostagileAIplatform Intel®Xeon®ScalableprocessorsforAI Scalable performance for widest variety of AI & other datacenter workloads – including deep learning Built-inROI Begin your AI journey today using existing, familiar infrastructure Potentperformance Train in days HOURS with up to 113X2 perf vs. Intel Xeon E5 v3 (2.2x excluding optimized SW1) Production-ready Robust support for full range of AI deployments 1,2Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. See slide 15 for configuration details. 42
  43. 43. 4343 Intel®Xeon®Inference&trainingperformance INFERENCE THROUGHPUT Up to 2.4x Intel® Xeon® Platinum 8180 Processor higher Neon ResNet 18 inference throughput compared to Intel® Xeon® Processor E5-2699 v4 TRAINING THROUGHPUT Up to 2.2x Intel® Xeon® Platinum 8180 Processor higher Neon ResNet 18 training throughput compared to Intel® Xeon® Processor E5-2699 v4 Advance previous generation AI workload performance with Intel® Xeon® Scalable Processors Inference throughput batch size: 1 Training throughput batch size: 256 Configuration Details on Slide: 18, 20 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher.
  44. 44. 4444 Intel®Xeon®PlatformPerformance INFERENCE THROUGHPUT Up to 138x Intel® Xeon® Platinum 8180 Processor higher Intel optimized Caffe GoogleNet v1 with Intel® MKL inference throughput compared to Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256 AlexNet 256 Configuration Details on Slide: 18, 25 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. TRAINING THROUGHPUT Up to 113x Intel® Xeon® Platinum 8180 Processor higher Intel Optimized Caffe AlexNet with Intel® MKL training throughput compared to Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable Processors Optimized Frameworks Optimized Intel® MKL Libraries Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher. Hardware plus optimized software
  45. 45. 45 Scalable performance optimized for even faster deep learning training and select highly-parallel datacenter workloads* Intel®XeonPhi™processor(KnightsMill)  Delivers up to 4Xdeep learning performance over Knights Landing✝  New instructions sets deliver enhanced lower precision performance  Time-to-train reduction is the primary benchmark to judge deep learning training performance  Direct access of up to 400 GB of memory with no PCIe performance lag (vs. GPU:16GB)  Efficient scaling further reduces time-to- train when utilizing scaled Knights Mill systems  Up to 400Xdeep learning performance on existing HW via Intel SW optimization  Share deep learning software investments across Intel Platforms via Intel deep learning software tools  Binary-compatible with Intel® Xeon® processor Fastertime-to-train Efficientscaling Futureready ✝Knights Landing is the former codename for the Intel® Xeon Phi™ processor family that was released in 2016 Configuration details on final slides *Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804 Faster DLTraining Highly-parallel
  46. 46. 46 Deeplearning Bydesign Scalable acceleration with best performance for intensive deep learning training & inference, period Crestfamily  Unprecedented compute density  Large reduction in time-to-train  32 GB of in package memory via HBM2 technology  8 Tera-bits/s of memory access speed  12 bi-directional high-bandwidth links  Seamless data transfer via interconnects Customhardware Blazingdataaccess High-speedscalability 1Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804 2017
  47. 47. 47 optimizedforIntelarchitecture BigDL MLliB Aiframeworks and more frameworks enabled via Intel® Nervana™ Graph (future) See Roadmap for availability Other names and brands may be claimed as the property of others. Intel®'s reference deep learning framework committed to best performance on all hardware intelnervana.com/neon
  48. 48. NVM Solutions Group 49 Legal Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, the Intel logo, Xeon, Intel Optane, and 3D XPoint are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2017 Intel Corporation.

×