Trends towards the merge of HPC + Big Data systems

WSCAD 2016 - XVII Simpósio em Sistemas
Computacionais de Alto Desempenho
Aracaju - Sergipe – Brazil
October, 7th - 2016
Igor Freitas
igor.freitas@intel.com

WSCAD 2016
2
Big Data Analytics
HPC != Big Data ?
*Other brands and names are the property of their respective owners.
FORTRAN / C++ Applications
MPI
High Performance
Java, Python, Go, etc.*
Applications
Hadoop*
Simple to Use
SLURM
Supports large scale startup
YARN*
More resilient of hardware failures
Lustre*
Remote Storage
HDFS*, SPARK*
Local Storage
Compute & Memory Focused
High Performance Components
Storage Focused
Standard Server Components
Server Storage
SSDs
Switch
Fabric
Infrastructure
Modelo de
Programação
Resource
Manager
Sistema de
arquivos
Hardware
Server Storage
HDDs
Switch
Ethernet
Infrastructure

WSCAD 2016
Varied Resource Needs
Typical HPC
Workloads
Typical
Big Data
Workloads
3
Big Data Analytics
HPC in real time
Small Data + Small
Compute
e.g. Data analysis
Big Data +
Small Compute
e.g. Search, Streaming,
Data Preconditioning
Small Data +
Big Compute
e.g. Mechanical Design, Multi-physics
Data
Compute
High
Frequency
Trading
Numeric
Weather
Simulation
Oil & Gas
Seismic
Systemcostbalance
Video Survey Traffic
Monitor
Personal
Digital Health
Systemcostbalance
Processor Memory Interconnect Storage

WSCAD 2016
4
Trends in HPC + Big Data
Standards
Business viability
Performance
Code Modernization
(Vector instructions)
Many-core
FPGA
Usability
Faster time-to-market
Lower costs (HPC at Cloud ? )
Better products
Easy to mantain HW & SW
Portability
Open
Commom
Environments
Integrated solutions:
Storage + Network +
Processing + Memory
Public investments

WSCAD 2016
5
Business viability

WSCAD 2016
HPCisFoundationaltoInsight
Aerospace Biology Brain Modeling Chemistry/Chemical Engineering Climate Computer Aided Engineering Cosmology Cybersecurity Defense
Pharmacology Particle Physics Metallurgy Manufacturing / Design Life Sciences Government Lab Geosciences / Oil & Gas Genomics Fluid Dynamics
1Source: IDC HPC and ROI Study Update (September 2015)
2Source: IDC 2015 Q1 World Wide x86 Sever Tracker vs IDC 2015 Q1 World Wide HPC Sever Tracker
DigitalContentCreationEDAEconomics/FinancialServicesFraudDetection
SocialSciences;Literature,linguistics,marketingUniversityAcademicWeather
Business
Innovation
A New
Science
Paradigm
Fundamental
Discovery
High ROI:
$515
Average Return Per $1 of HPC
Investment1
Advancing
Science
And Our Understanding
of the Universe
Data-Driven
Analytics
Joins Theory, Experimentation, and
Computational Science
6

WSCAD 2016
Growing Challenges in HPC
“The Walls”
System Bottlenecks
Memory | I/O | Storage
Energy Efficient Performance
Space | Resiliency |
Unoptimized Software
Divergent
Infrastructure
Barriers to
Extending Usage
Resources Split Among
Modeling and Simulation | Big
Data Analytics | Machine
Learning | Visualization
HPC
Optimized
Democratization at Every
Scale | Cloud Access |
Exploration of New Parallel
Programming Models
Big
Datahpc
Machine learning
visualization
7

WSCAD 2016
HPC & the Competitiveness of Industry & Science
of USA
Public Investments
8
• Executive order from Obamas’s president for the ‘national program of
Supercomputing’
• HPC as“Top priority” to leverage USA competitiveness
”In order to maximize the benefits
of HPC for economic competitiveness
and scientific discovery, the United
States Government must create a
coordinated Federal strategy in HPC
research, development, and
deployment”
Executive Order, Barack Obama
Fonte: The White House
Office of the Press Secretary

WSCAD 2016
HPC & the Competitiveness of Industry & Science
of USA
Public Investments
9
• U.S. makes a Top 10 supercomputer available to anyone who can
'boost' America*
• Boost American competitiveness.
• Accelerate advances in science and technology.
• Develop the country's skilled high-performance computing (HPC)
workforce.
Fonte: The White House
Office of the Press Secretary

WSCAD 2016
10
China’s New Supercomputer Puts the US Even
Further Behind*
Public Investments
• Sunway TaihuLight officially became the
fastest supercomputer in the world
• What is really means for HPC:
• Innovation throught HPC
• Gov recognition of HPC competitiveness
• Software is the key !
• Performance
• Productivity
• Programmability
*Source: https://www.wired.com/2016/06/fastest-supercomputer-sunway-taihulight/

WSCAD 2016
SIMD
Vector instructions are back
11
Performance

WSCAD 2016
12
Democratizing HPC for Big Data workloads
Performance: Vector instructions
• In the 70s and 80s Vector Machines was the rule
• Why in 90s it was ‘old stuff’ ?
• By Eugene D. Brooks the reason was simple: it was customs machines*
• The near future ?
• Vectors again ! But in general purpose CPUs
• Affordable
• Easy to code
• Associated with Multi-thread programming
*Source: https://www.hpcwire.com/2016/09/26/vectors-old-became-new-supercomputing/
Gigaflops from Vector Machines vs
Parallel Machines*

WSCAD 2016
13
Vectorization solved in 1978 ?*
Source: http://lotsofcores.com/sites/lotsofcores.com/files/201404300900%20SGIUG%20Reinders%20Intel%20as%20presented.pdf

WSCAD 2016
14
Vectorization and Threading Critical on Modern Hardware
Vectorized
& Threaded
Threaded
Vectorized
Serial
Key:
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more information go to http://www.intel.com/performance Configurations at the end of this presentation.

WSCAD 2016
Frameworks , Libs
Bringing HPC performance to Big Data
15
Performance

WSCAD 2016
Intel® DAAL Overview
Industry leading performance, C++/Java/Python library for machine
learning and deep learning optimized for Intel® Architectures.
(De-)Compression
PCA
Statistical moments
Variance matrix
QR, SVD, Cholesky
Apriori
Linear regression
Naïve Bayes
SVM
Classifier boosting
Kmeans
EM GMM
Collaborative filtering
Neural Networks
Pre-processing Transformation Analysis Modeling Decision Making
Scientific/Engineering
Web/Social
Business
Validation

WSCAD 2016
Python* Landscape
Challenge#1:
Domain specialists are not professional software
programmers.
Adoption of Python
continues to grow among
domain specialists and
developers for its
productivity benefits
Challenge#2:
Python performance limits migration to
production systems

WSCAD 2016
Python* Landscape
Challenge#1:
Domain specialists are not professional software
programmers.
Adoption of Python
continues to grow among
domain specialists and
developers for its
productivity benefits
Challenge#2:
Python performance limits migration to
production systems
Intel’s solution is to…
 Accelerate Python performance
 Enable easy access
 Empower the community

WSCAD 2016
Example Performance: Intel® DAAL vs. Spark* MLLib
19

WSCAD 2016
PCA Performance Boosts Using Intel® DAAL vs. Spark* MLLib
on Intel® Architectures
20
4X
6X 6X
7X 7X
0
2
4
6
8
1M x 200 1M x 400 1M x 600 1M x 800 1M x 1000
Speedup
Table size
PCA (correlation method) on an 8-node Hadoop* cluster based on
Intel® Xeon® Processors E5-2697 v3
Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0; Hardware: Intel® Xeon® Processor E5-2699 v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of
RAM per node; Operating System: CentOS 6.6 x86_64.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark
Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include
SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.
Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

WSCAD 2016
21
What’s New: Intel® DAAL 2017
• Neural Networks
• Python* API (a.k.a. PyDAAL)
– Easy installation through Anaconda or pip
• New data source connector for KDB+
• Open source project on GitHub
Fork me on GitHub:
https://github.com/01org/daal

WSCAD 2016
 Profile Python* and Mixed Python / C++ / Fortran*
 Tune latest Intel® Xeon Phi™ processors
 Quickly see three keys to HPC performance
 Optimize memory access
 Storage analysis: I/O bound or CPU bound?
 Enhanced OpenCL* and GPU profiling
 Easier remote and command line usage
 Add custom counters to the timeline
 Preview: Application and storage performance snapshots
 Intel® Advisor: Optimize vectorization for Intel® AVX-512
(with or without hardware)
New for 2017: Python*, FLOPS, Storage, and More…
Intel® VTune™ Amplifier Performance Profiler
22
New!

WSCAD 2016
23
Optimize Memory Access
Memory Access Analysis: Intel® VTune™ Amplifier 2017
Tune data structures for performance
 Attribute cache misses to data structures
(not just the code causing the miss)
 Support for custom memory allocators
Optimize NUMA latency and scalability
 True and false sharing optimization
 Auto detect max system bandwidth
 Easier tuning of inter-socket bandwidth
Easier install, latest processors
 No special drivers required on Linux*
 Intel® Xeon Phi™ processor MCDRAM (high-
bandwidth memory) analysis
Improved!

WSCAD 2016
Are you I/O bound or CPU bound?
 Explore imbalance between I/O operations
(async and sync) and compute.
 Storage accesses mapped to
the source code.
 See when CPU is waiting for I/O.
 Measure bus bandwidth to storage.
Latency analysis
 Tune storage accesses with
latency histogram.
 Distribution of I/O over multiple devices.
24
Storage Device Analysis (HDD, SATA, or NVMe SSD)
Intel® VTune™ Amplifier
New!
Slow task
with I/O Wait
Sliders set
thresholds for
I/O Queue Depth

WSCAD 2016
25
Intel® Performance Snapshots
Three Fast Ways to Discover Untapped Performance
Is your application making good use of modern
computer hardware?
 Run a test case during your coffee break.
 High-level summary shows which apps can
benefit most from code modernization and
faster storage.
Pick a performance snapshot:
 Application: For non-MPI apps
 MPI: For MPI apps
 Storage: For systems, servers, and
workstations with directly attached storage.
New!
New!
Free download: http://www.intel.com/performance-snapshot
Also included with Intel® Parallel Studio and Intel® VTune™ Amplifier products.

WSCAD 2016
26
Stick closely with DAAL’s overall
design
– Object-oriented, namespace hierarchy,
plug&play
Seamless interfacing with NumPy
Anaconda package
– http://anaconda.org/intel/
Co-exists with the proprietary version
Apache 2.0 license
Lives on github.com
Python API (a.k.a. PyDAAL)
...
# Create a Numpy array as our input
a = np.array([[1,2,4],
[2,1,23],
[4,23,1]])
# create a DAAL Matrix using our numpy array
m = daal.Matrix(a)
# Create algorithm objects for cholesky decomposition
computing using default method
algorithm = cholesky.Batch()
# Set input arguments of the algorithm
algorithm.input.set(cholesky.data, m)
# Compute Cholesky decomposition
res = algorithm.compute()
# Get computed Cholesky decomposition
tbl = res.get(choleskyFactor)
# get and print the numpy array
print tbl.getArray()
New

WSCAD 2016
Integrated solutions
Memory + Processor + Network + Storage
27
Performance

WSCAD 2016
28
Growing Need for New Class of Memory
Performance & Lower costs: Integrated solutions
Virtualization
Big Data & Cloud
In-Memory DB
OLTP
Workstation
Supply Chain
Mgmt
Enterprise
ERP
Database
Storage
HPC
“Give me a faster
storage interface”
“Allow in-memory
data to survive soft
reset or hard reboot”
“Minimal latency for
huge memory”
“Make large memory servers
less expensive”

WSCAD 2016
4
Bridging the Memory-Storage Gap
Intel® Optane™ Technology Based on 3D XPoint™
SSD
Intel® Optane™ SSDs 5-7x Current Flagship
NAND-Based SSDs (IOPS)1
DRAM-like performance
Intel® DIMMs Based on 3D-XPoint™
1,000x Faster than NAND1
1,000x the Endurance of NAND2
Hard drive capacities
10x More Dense than Conventional
Memory3
1Performancedifferencebasedoncomparisonbetween3DXPoint™Technologyandotherindustry NAND
2 Density differencebasedoncomparisonbetween3DXPoint™Technologyandotherindustry DRAM
2 Endurancedifferencebasedoncomparisonbetween3DXPoint™Technologyandother industryNAND
Intel® Scalable
System Framework

WSCAD 2016
CPU
DDR
INTEL®DIMMS
Intel®Optane™SSD
NAND SSD
Hard Disk Drives
1000Xfaster
Than NAND1
1000Xendurance
Of NAND2
10Xdenser
Than DRAM3
30
Intel® Scalable
System Framework
Bridging the Memory-Storage Gap
Intel® Optane™ Technology
1Performancedifferencebasedoncomparisonbetween3DXPoint™Technologyandotherindustry NAND
2 Density differencebasedoncomparisonbetween3DXPoint™Technologyandotherindustry DRAM
2 Endurancedifferencebasedoncomparisonbetween3DXPoint™Technologyandother industryNAND
Data granularity:
64B cacheline

WSCAD 2016
Yesterday Today Near Future
31
Storage Evolution
Memory
&
Storage
Storage
NAND based Intel P3700
(Fultondale) for NVMe
3D XPoint™ based
Coldstream SSD for NVMe
3D XPoint™ based
Apache Pass (AEP) for DDR4
Revolutionary
Storage
Class Memory
World’s
Fastest
NVMe SSD
3D XPoint enables world’s fastest NVMe SSD and
revolutionary storage class memory

WSCAD 2016
32
Code Modernization
Democratizing HPC performance for Big Data workloads
Easy of use
Fine tuning
Vectors
Intel® Math Kernel Library
Array Notation: Intel® Cilk™ Plus
Auto vectorization
Semi-auto vectorization:
#pragma (vector, ivdep, simd)
C/C++ Vector Classes
(F32vec16, F64vec8)
Intel® Data Analytics Acceleration Library
Coprocessor
Fabric
Memory
Memory Bandwidth
~500 GB/s STREAM
Memory Capacity
Over 25x* KNC
Resiliency
Systems scalable to >100 PF
Power Efficiency
Over 25% better than card1
I/O
Up to 100 GB/s with int fabric
Cost
Less costly than discrete
parts1
Flexibility
Limitless configurations
Density
3+ KNL with fabric in 1U3
Knights Landing
*Comparison to 1st Generation Intel® Xeon Phi™ 7120P Coprocessor (formerly codenamed Knights Corner)
1Results based on internal Intel analysis using estimated power consumption and projected component pricing in the 2015timeframe. This analysis is
provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
2Comparison to a discrete Knights Landing processor and discrete fabric component.
3Theoretical density for air-cooled system; other cooling solutions and configurations will enable lower or higher density.
Server Processor

WSCAD 2016
KNL and KNL-F Processors:
 Knights Landing IS the host processor
 Boots standard off-the-shelf OS’s
Benefits:
 Higher performance density for highly
parallel applications2
 Reduced system power consumption2
 Higher perf/Watt & perf/$$3
Knights Landing Coprocessor:
 Solution for general purpose servers
and workstations
Benefits:
 Targeted for applications with larger
sections of serial work1
 Upgrade path from Knights Corner as
PCIe card
Knights Landing Processor
“Self-boot” Intel® Xeon Phi™
processor platform
1 Projections based on early product definition and as compared to prior generation Intel® Xeon Phi™ Coprocessors
2 Based on Intel internal analysis.. Lower power based on power consumption estimates between (2) HCAs compared to 15W additional power for KNL-F. Higher density based on removal of PCIe slots and associated
HCAs populated in those slots.
3 Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 2
Results based on internal Intel analysis using estimated theoretical Flops/s for KNL processors, along with estimated system power consumption and component pricing in the 2015 timeframe. See backup for complete
system configurations.
KNL-
F
KNL
Three Knights Landing Products
Knights Landing Coprocessor
Requires Intel® Xeon® processor
host
Adams Pass
Platform
KNL
Coprocessor

WSCAD 2016
DDR4
x4 DMI2 to PCH
36 Lanes PCIe* Gen3 (x16, x16, x4)
MCDRAM MCDRAM
MCDRAM MCDRAM
DDR4
TILE:
(up to
36)
Tile IMC (integrated memory controller)EDC (embedded DRAM controller) IIO (integrated I/O controller)
KNL
Package
Enhanced Intel® Atom™ cores based on
Silvermont Microarchitecture
 2D Mesh Architecture
 Out-of-Order Cores
 3X single-thread vs. KNC
ISA
Intel® Xeon® Processor Binary-Compatible (w/Broadwell)
On-package memory
Up to 16GB, ~465 GB/s STREAM at launch
Fixed Bottlenecks
Platform Memory
Up to 384GB (6ch DDR4-2400 MHz)
2VPU
Core
2VPU
Core
1MB
L2
HUB
KNL Architecture Overview
Bi-directional
tile connections
(same bit width
as Xeon core
interconnect)
34

WSCAD 2016
F
CONNECTOR
 Lower cost: cost adder expected to be lower than (2) adapters or on-board controllers
 Lower power: only ~15W TDP adder, which expected to be less than (2) adapters
 Higher density: enables denser form factor – no slots, adapters, on-board controllers
 Future-ready: sets stage for future hetero clusters (future Intel® Xeon® processor w/ int fabric)
1. KNL with TWO Fabric Adapters
(2) x16 PCIe slots
(2) x16 PCIe lanes
2. KNL w/ TWO on-board controllers
3. KNL-F with Storm Lake
Fabric
controller
Fabric
controller
QSPF
connector
QSPF
connector
1 Based on Intel internal estimates. Lower cost based on expected price delta between KNL and KNL-F processor, compared to two InfiniBand* or Storm Lake HCA via PCIe Express slots. Lower power based on power
consumption estimates between (2) HCAs (~20W)compared to 15W additional power for KNL-F over a comparable KNL processor.. Higher density based on removal of PCIe slots and HCAs populated in those slots.
QSPF
connector
QSPF
connector
QSFP
module
Same socket for KNL and KNL-F
Design common platform with keep-out zone
and to support additional 15W TDP.
KNL-F Benefits:1
Why KNL-F? (Integrated Fabric)
Dual-Port
100 GB/s bi-
directional
35

WSCAD 2016
Integrated Fabric CPU Requirements
Components required to support CPU with Integrated Fabric
with two ports
 (1) IFP Cable [supporting two ports]
 (1) 2-port “Carrier card” (two main options)
– PCB that plugs into a PCIe slot (aka “PCIe carrier card”)
– Custom OEM PCB with power and sideband cables
 PCIe “Carrier card” implementation requires:
– PCB, (2) IFT connectors, (2) IFT cages, sideband cable
PCIe carrier board, 2-port version
(sideband cable and IFT connectors and
cages on underside of the card)
(2) Internal-to-Faceplate
Processor (IFP) cable
supporting two-ports
(1) Internal Faceplate
Transition (IFT)
Connector
EACH port requires:
(1) IFT Cage
36
IFT Carrier Card design kit (including BOM and
design guide) is now posted on IBL (Doc#558210)

WSCAD 2016
Tighter Component Integration
Benefits
Bandwidth
Density
Latency
Power
Cost
Cores
Graphics
Fabric
FPGA
I/O
Memory
Intel® Scalable
System Framework
37

WSCAD 2016
Source: IDC 2014 (Worldwide High-Performance Systems Revenue by Applications) and https://software.intel.com/en-us/file/xeonphi-catalogpdf/download
CAE
Geosciences
Weather
Other
Mechanical Design
DCC & Distrib
Defense
University /
Academic
Government Lab
Bio-Sciences
EDA / IT / ISV
Economics /
Financial
Chem
Engineering
Balanced ApplicationsMemory Bandwidth Intensive Compute Intensive
CAE
Altair RADIOSS*
Ansys* Mechanical
Matevo MinFE
SIMULIA Abaqus*
Financial Services
Binomial Options Pricing Model
Binomial SP and DP
BlackScholes Merton Formula
BlackScholes SP and DP
Monte Carlo European Options Pricing
Monte Carlo RND SP and DP
Monte Carlo SP and DP
STAC A2
Xcelerit
Bioinformatics
BLAST
Bowtie 2
Burrows Wheeler Alignment (BWA)
Cry-EM Technique
MPI-HMMER 2.3
Computational Chemistry
DiRAC Codes
GAMESS
Integral Calculation Library
NEURON
NWChem
Molecular Dynamics
AMBER
BUDE
DL_POLY
GROMACS
LAMMPS
NAMD
Geophysics
ELMER/Ice
SeisSol
SPECFEM3D Cartesian
UTBench
Climate/Weather
ADCIRC
CAMS
CFSv2
COSMO
ECHAM6
HARMONIE
HBM
MPAS
NOAA NIM
WRF
Digital Content Creation
EMBREE
Superresolution processing
Energy
Acceleware* AxRTM
DownUnder GeoSolutions
ISO3DFD
RTM Petrobras
TTI 3DFD
CFD
AVBP
FrontFlow/Blue code
LBS3D
NASA Overflow
OpenFOAM
OpenLB
ROTORSIM
SU2
TAU and TRACE
software.intel.com/XeonPhiCatalog
Intel® Xeon Phi™ Application Catalog
Over 100 applications to date listed as available or in-flight
38

WSCAD 2016
Developer Tools for Knights Landing Platform
Intel Parallel Studio XE Component Supported features in PSXE 2016 Gold
Intel ® C/C++ and Fortran compilers
16.0
1) -xMIC-AVX512 compiler option enables KNL specific optimizations, including
loop optimizations and vectorization
2) Use Intel® Fortran compiler to build for MCDRAM
Intel® Math Kernel Library 11.3 Partial optimizations for all major MKL domains (BLAS, FFT, Sparse BLAS, VML,
VSL) are delivered via AVX512 optimizations.
Intel® MPI 5.1.1 and ITAC 9.1 Support for KNL platform and initial performance tuning is part of Intel MPI 5.1.1
VTune Amplifier XE 2016 (NDA
package)
Collection on KNL targets: advanced hotspots and custom event collection
based on SEP and perf; User API;
Analysis types for KNL profiling: advanced hotspots with full OpenMP analysis,
custom events (core and uncore) Intel MPI spins, general exploration
HBM profiling on Xeon with KNL Bandwidth modeling
Advisor XE 2016 (NDA Package) Survey analysis for AVX512 (includes hotspot collection and compiler static
data)
Data Analytics Acceleration Library
2016
Includes KNL-specific performance optimizations
Intel® Integrated Performance
Primitives 9.0
More than 70% of hot list functions have AVX512 optimizations

WSCAD 2016
Integrated solutions
Xeon + FPGA
40
Performance

WSCAD 2016
41
Skylake + FPGA Target Workloads
Performance and Lower costs: FPGA
FPGA Activity Workload Examples
Compute intensive algorithms
 Visual Understanding/Deep Learning classification
 Compression/decompression
 Video Motion Estimation
 Genomics (Pair HMM, Smith Waterman)
 Memory copy routines
Latency sensitive pre-filtering &
processing for CPU
 Bump in the wire network processing
 FSI market data pre-filtering
 HPC Radar data pre-processing
 Automotive video input
 Security appliance, targeted Vswitch
Evolving algorithms or stable algorithms
on low latency and inline interconnect
 New compression algorithms
 High compression ratios
 Custom crypto algorithms

WSCAD 2016
42
Skylake + FPGA on Purley
Performance and Lower costs: FPGA
PCIe3.0x16
UPI2
Prog I/F
UPI0
PCIe3.0x16
UPI1
DMIx4
DDR4
PCIe 3.0 x8
PCIe 3.0 x8
HSSI
SKL FPGA
DDR4
DDR4
DDR4
DDR4
DDR4
Cores Up to 28C with Intel® HT Technology
FPGA Altera® Arria 10 GX 1150
Socket TDP
Shared socket TDP
Up to 165W SKL & Up to 90W FPGA
Socket Socket P
Scalability Up to 2S – with SKL-SP or SKL + FPGA SKUs
PCH
Lewisburg: DMI3 – 4 lanes; 14xUSB2 ports
Up to: 10xUSB3; 14xSATA3, 20xPCIe*3 New: Innovation
Engine, 4x10GbE ports, Intel® QuickAssist Technology
For CPU For FPGA
Memory
6 channels DDR4
RDIMM, LRDIMM,
Apache Pass DIMMs
Low latency access to
system memory via UPI &
PCIe interconnect2666 1DPC,
2133, 2400 2DPC
Intel® UPI
2 channels
(10.4, 9.6 GT/s)
1 channel
(9.6 GT/s)
PCIe*
PCIe* 3.0
(8.0, 5.0, 2.5 GT/s)
PCIe* 3.0
(8.0, 5.0, 2.5 GT/s)
32 lanes per CPU
Bifurcation support:
x16, x8, x4
16 lanes per FPGA
Bifurcation support:
x8
High Speed
Serial Interface
(Different board
design based on
HSSI config)
N/A
2xPCIe 3.0 x8
Direct Ethernet
(4x10 GbE, 2x40 GbE,
10x10 GbE, 2x25 GbE)
 Power for FPGA is drawn from socket & requires modified
Purley platform specs
 Platform Modifications include Stackup, Clock, Power
Delivery, Debug, Power up/down sequence, Misc. I/O pins

WSCAD 2016
CurrentStateofSystemSoftwareEffortsinHPCEcosystem
44
THE REALITY: We, the HPC ecosystem, will not be able to get to where we
want to go without a major change in system software development.
With system margins under
pressure, unwillingness to
invest in system software
A desire to get exascale
performance & speed up
software adoption of HW
innovation
Fragmented efforts across the
ecosystem – “Everyone
building their own solution.”
New complex workloads (ML,
Big Data, etc) drive more
complexity into the software
stack

WSCAD 2016
Stable HPC System Software that:
Fuels a vibrant and efficient HPC software ecosystem
Takes advantage of hardware innovation & drives
revolutionary technologies
Eases traditional HPC application development and
testing at scale
Extends to new workloads (ML, analytics, big data)
Accommodates new environments (i.e. cloud)
A Shared Repository
DesiredFutureState
2

WSCAD 2016
OfficialMembersasof6/1/2016
Goal: A common system software platform for the HPC community that works across
multiple segments and on which ecosystem partners can collaborate and innovate
46

WSCAD 2016
HPCsystemsoftwareStackComponentView
47
 Intra-stack APIs to allow for customization/differentiation
 External APIs to develop on and around the stack

WSCAD 2016
48
OpenHPCtoIntel®HPCOrchestratorsystemsoftwareproducts
Intel HPC
Orchestrator
products
• Premium
Software
• Advanced
testing
• Support
An open source community
for HPC software
Intel seeded the community
with a pre-integrated, pre-
tested and validated HPC
system software stack & will
continue contributions along
with other members of the
community
Intel will
offer Intel-
supported
products
based on the
open source
OpenHPC
software
Intel HPC Orchestrator products are the realization of the
software portion of Intel® Scalable System Framework
Intel® Scalable System Framework

WSCAD 2016
49
Open Source acellerating HPC + Big Data
Open Standards
• PBS Pro is now Open
• OpenHPC
• Cloud for HPC
• And how about Brazil ?
• Intel Innovation Center at Rio – partnership with AMT (www.amt.com.br)
Pay less + easy of use = democratizing HPC for Big Data

WSCAD 2016
50
What about Brazil ?

WSCAD 2016
Intel’s HPC initiatives in Brazil
Code Modernization – Open source softwares
51
• Modernizing applications to increase parallelism and
scalability
• Leverage cores, caches, threads, and vector capabilities of
microprocessors and coprocessors.
• Current centers in Brazil

WSCAD 2016
Intel Modern Code Partner program
52
Intel Modern Code Partners
Code Modernization – driving developers to develop modern code to modern hardware
• Create Faster Code…Faster
• High Performance Scalable Code
• C++, C, Fortran*, Python* and Java*
• Standards-driven parallel models:
• OpenMP*, MPI, and TBB
• To teach developers how to full exploit Xeon and
Xeon Phi performance: vectors + multi-treading
More at: http://software.intel.com/moderncode
Free HPC & Big Data Workshops accross Brazil

WSCAD 2016
Code Modernization initiatives in the
Brazilian HPC Ecosystem
Oil & Gas - Reservoir Simulator
at PETROBRAS
LNCC - National Laboratory for Scientific Computing
Largest HPC cluster in Latin America
INPE/CPTEC
Code Modernization of BRAMS
• Up to 10.5x performance
gains in their
Reservoir Simulator software¹
• Up to 30x performance gain
in Oil & Gas applications²
• Up to 3.4x speedup via
AVX (vector instructions)
• Link white-paper
• Initial results – white-paper link
Health & Life Sciences
• Up to 11x speedup in Molecular Dynamics –
NCC/UNESP & LNCC – white-paper link
• Xeon only:
• Original code vs Modernized code: up to 11x speedup
• Xeon + 1 Xeon Phi (same optimized code)
• 1.14x speedup
Article link
Authors:
¹CENPES team and Gilvan Vieira - gilvandsv@gmail.com
²LNCC - Frederico Cabral - fredluiscabral@gmail.com
³NCC/UNESP - Silvio Stanzani silvio.stanzani@gmail.com

WSCAD 2016
55
Conclusions
• Likewise other products, technologies and services we have:
“Lower cost + Scale + Easy of usage”
would drive HPC to the masses
1st wave: near Bare-metal at the Cloud (lower cost + scale)
2nd wave: Frameworks offering “free performance” to unlock insights ( usability )
3rd wave: Even small and medium business will relay on HPC / Big Data to drive business

WSCAD 2016
56
Big Data Analytics
Integrated solutions: HPC && Big Data
*Other names and brands may be claimed as the property of others
HPC Big Data
FORTRAN / C++ Applications
MPI
High Performance
Python, Frameworks, Java* Applications,
Others
Hadoop* / Spark / Others
Simple to Use
Lustre* with Hadoop* Adapter
Remote Storage
Compute & Big Data Capable
Scalable Performance Components
Server
Storage
(SSDs and
Burst Buffers)
Intel®
Omni-Path
Architecture
Infrastructur
e
Programming
Model
Resource
Manager
File System
Hardware
HPC & Big Data-Aware Resource Manager

WSCAD 2016
Next steps for HPC & Big Data
New paradigm in memory and storage
Processor
Compute
Node
I/O Node
Remote
Storage
Compute
Today
Caches
Local Memory
SSD Storage
Parallel File System
(Hard Drive Storage)
HigherBandwidth.
LowerLatencyandCapacity
Some remote data moves
onto I/O node
I/O Node storage moves to
compute node
Local memory is now faster &
in processor package
Compute
Future
Caches
Non-Volatile Memory
Burst Buffer Storage
Parallel File System
(Hard Drive Storage)
In-Package High
Bandwidth Memory*
*cache, memory or hybrid mode

WSCAD 2016
Conclusions
A Holistic Architectural Approach is Required
Compute
Memory
Fabric
Storage
PERFORMANCEICAPABILITY
TIME
System
Software
Innovative Technologies Tighter Integration
Application
Modernized Code
Community
ISV
Proprietary
System
Memory
Cores
Graphics
Fabric
FPGA
I/O
58

WSCAD 2016
14
A Global Online Community
Intel® Modern Code Developer Community
Developer
zone
- Vectorization/Single Instruction, Multiple Data (SIMD)
- Multi-Threading
- Multi Node/Clustering
- Take Advantage of On-Package High-Bandwidth Memory
- Increase Memory and Power Efficiency
Topics
Experts
software.intel.com/moderncode
- Modern Code Zone
- Software Tools, Training Webinars
- How-to guides, Parallel Programming BKMs
- Remote Access to Hardware
- Support Forums
- Black Belts, & Intel Engineer Experts
- Technical Content, Training -Webinars, F2F, Forum Support
- Conference and Tradeshows: Keynotes, Presentations, BOFs,
Demos, Tutorials

WSCAD 2016
61
Machine/Deep Learning | Resources
Training Classes:
U.Oxford Class on Deep Learning
Stanford Class on Machine Learning
Google Class on Deep Learning
Intel Caffe Repo: (Support for Multi-node Training)
https://github.com/intelcaffe/caffe
Spark MLLib Repo:
http://spark.apache.org/mllib/
Intel Machine Learning Blog Posts:
Myth Busted - CPUs and Neural Network Training
Caffe Scoring on Xeon Processors
Caffe Training on Multi-node Distributed Memory Systems
Trusted Analytics Platform:
http://trustedanalytics.org/
Performance Libraries:
MKL for Neural Networks - Technical Preview
Math Kernel Library
MKL Community License
Data Analytics Acceleration Library

WSCAD 2016
62
Links
• Intel Modern Code: https://software.intel.com/pt-br/modern-code
• Intel Developer Zone
• Intel Compiler Reference 2016
• Intel Intrinsics reference
• Guide to Auto-vectorization
• Xeon Phi™ Home Page
• Xeon Phi™ CODE RECIPES
• Intel Parallel Computing Centers

WSCAD 2016
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND
INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information
and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products.
Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture
are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
63

Trends towards the merge of HPC + Big Data systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Trends towards the merge of HPC + Big Data systems

Similar to Trends towards the merge of HPC + Big Data systems (20)

Recently uploaded

Recently uploaded (20)

Trends towards the merge of HPC + Big Data systems