Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 Technical Sessions

SIGGRAPH 2019 | LOS ANGLES | 28 JULY - 1 AUGUST

Hisham Chowdhury
Software Architect, Intel Corporation
AcceleratingMachineLearning
withintel®processorgraphics

WhatisMachineLearning?
3
“Machine learning is an application of artificial intelligence (AI) that provides systems
the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer
programs that can access data and use it learn for themselves.”
*Source:expertsystem.com
Training Inference

MLUsage
4
*Source:pixelmatorpro,Apple.com

PopularCNNArchitectureandAccuracy
5
*Source:towardsdatascience.com

Machinelearningon
Intel®processorgraphics

End-to-endaicompute
datacenter gateway Edge
Many-to-many hyperscale for stream
and massive batch data processing
1-to-many with majority
streaming data from devices
1-to-1 devices with lower power and
often UX requirements
Ethernet
& Wireless
Wireless and non-IP wired
protocols
ü Secure
ü High throughput
ü Real-time
Intel® Xeon® Processors
Intel® Core™ & Atom™ Processors
Intel® FPGA
Intel® Xeon Phi™ Processors*
Crest Family (Nervana ASIC)*
Intel® Processor Graphics
Movidius Myriad (VPU)Vision
Intel® GNA (IP)*Speech

Intel®processorgraphicsinferenceLandscape
8

WindowsMachinelearning
onIntel®ProcessorGraphics
9

winml
• Load Model, Load Video/Images
• Bind input/output resource
• Evaluate Result:
• Get probability and prediction
• Transform inputs (Style Transfer, Denoising etc)
• Supports CPU, GPU, Accelerators (VPU)
10

DirectML
• low-level API for machine learning (ML)
• Hardware-accelerated machine learning primitives (called operators) are the
building blocks of DirectML
• Can get integrated part of D3D12 games, applications
• Meta Command
• DirectML provides Direct3D 12 metacommands feature which allows HW vendors to provide
the most efficient implementation for the primitives for the underlying HW
• Achieves high HW efficiency on Intel® hardware using MetaCommand
11

macOSMachinelearning
onIntel®ProcessorGraphics
12

InferenceWorkflow
*Source:mitochrome.com

InferenceArchitecture
Inference Application 1
Vision
Core ML
Accelerate and BNNS Metal Performance Shaders
CPU iGPU
Natural Language Processing GamePlayKit
• CoreML
• CPU, GPU, Accelerators
• Image analysis, natural language processing, audio
to text, identifying sounds in audio
• Built on top of low-level primitives
like Accelerate and BNNS, Metal Performance
Shaders (MPS)
• Metal Performance Shaders (MPS)
• GPU only
• Low level primitive API (MPS Graph API is also
supported) providing for ML, Image processing,
RayTracing needs
• Most efficient for underlying Intel® architecture
• Can get integrated part of Metal games,
applications and dispatched part of same GPU
command buffer

Bringingmachinelearning
trainingtotheedge
16

CreateML
• ML models now can be created directly
using CreateML on the macOS device
*Source:Apple.com

macosMLArchitecturewithTraining
Vision
Core ML
Accelerate and BNNS Metal Performance Shaders
CPU iGPU
Natural Language Processing GamePlayKit
Inference Training
Turi CreateCreate ML
Training Application 1 Training Application 2

WebMachineLearning:POC
WebML/NN
CoreML/BNNS/MPS
MacOS/iOS
WinML/DirectML
Windows
TF-Lite/NN API
Android
CPU GPU Accelerators
JS ML frameworks
Web App
Web Browser
OS ML API
new
existing
WebAssembly
ONNX Models
WebGL/WebGPU
TensorFlow Models Other Models

WebMachineLearning:withTensorflow.js
21
Platform
TensorFlow.js
(WebGL) (ms-)
TensorFlow.js
(WebML/MPS) (ms-
)
Speedup
MBP 15" 2016 2.7GHz
Intel Core i7 + Intel HD
Graphics 530 1536MB 130.810 18.371 7.120
MBP 15" 2016 2.7GHz
Intel Core i7 + AMD
Radeon Pro 455
1536MB
46.756 19.362 2.415
MBP 13" 2017 3.5GHz
Intel Core i7 + Intel Iris
Plus Graphics 650
1536MB 66.479 19.885 3.343
MBP 13" 2016 2.9GHz
Intel Core i5 + Intel Iris
Graphics 550 1536MB 71.128 18.904 3.763
Disclaimer
• Platforms used for these numbers: macbook pro 13”, 15” with Intel Graphics 530, 550, 650 and AMD Radeon Pro
455. it was run on macOS highSierra (10.13.4)
• All testing was performed at Intel. Numbers may differ based on actual hardware used and/or based on how the
benchmark is written. Intel® makes no guarantee on the specific numbers and it is intended for providing
reference

PERFORMANCEstate
windows and macOS

WebMLusingMetalPerformanceShaders(MPS)
vsWebGL,WASM(Legacy)
23
0
100
200
300
400
500
600
MobileNet (ms-) SqueezeNet (ms-) TensorFlow.js (ms-)
WebML Chromium POC
msecs (lower is better, inference time)
WASM WebGL 2 WebMLwith MPS
•Disclaimer
• Configurations used for test and perf data: MacBook Pro 13” with Intel Iris Graphics 550, 530 some with fixed 850 Mhz frequency and some with dynamic frequency
• All testing was performed at Intel® Folsom
• Numbers may differ based on actual hardware used and/or based on how the benchmark is written. Intel® makes no guarantee on the specific numbers and it is intended for providing reference

GEMMEfficiency
Intel®Gen9ProcessorGraphics
YAxis:Gflops,XAxis:MatrixDimensions
24
0
200
400
600
800
1000
1200
1400
256x256x256512x512x512
0768x0768x0768
1024x1024x1024
1280x1280x1280
1536x1536x1536
1792x1792x1792
2048x2048x2048
2304x2304x2304
2560x2560x2560
3072x3072x3072
3584x3584x3584
4096x4096x4096
fp16 GEMM
Intel Optimized HW Theoritical Max 80% HW Theoritical Max
0
100
200
300
400
500
600
700
256x256x256512x512x512
0768x0768x0768
1024x1024x1024
1280x1280x1280
1536x1536x1536
1792x1792x1792
2048x2048x2048
2304x2304x2304
2560x2560x2560
3072x3072x3072
3584x3584x3584
4096x4096x4096
fp32 GEMM
Intel Optimized HW Theoritical Max 80% HW Theoritical Max
•Disclaimer
• Configurations used for test and perf data: MacBook Pro 13” with Intel Iris Graphics 550, 530 some with fixed 850 Mhz frequency and some with dynamic frequency
• All testing was performed at Intel® Folsom
• Numbers may differ based on actual hardware used and/or based on how the benchmark is written. Intel® makes no guarantee on the specific numbers and it is intended for
providing reference

macOSMojave=>macOSCatalina
%Improvements
25
0
20
40
60
80
100
120
140
160
VGG19
VGG16InceptionV4InceptionV3
ResNet50InceptionV1
AlexNet
GoogleNetPlacesM
obilenetSqueezeNet
Denoiser
CoreML
0
10
20
30
40
50
60
70
80
90
VGG19
VGG16
InceptionV3
ResNet50
InceptionV1
AlexNet
GoogleNetPlaces
SqueezeNet
MetalPerformanceShaders
0
10
20
30
40
50
60
70
Fuji 22 MP Fuji 24 MP Canon22 MP Canon50 MP
Adobe LightRoom Enhance Detail
Disclaimer
• Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation.
Performance varies depending on system configuration. No Configurations used for test and perf data: MacBook Pro 13” with Intel Iris Graphics 530
some with dynamic frequency. Mojave numbers are from macOS10.14.5 and Catalina numbers are from macOS 10.15 beta.
• All testing was performed at Intel® Folsom. Numbers may differ based on actual hardware used and/or based on how the benchmark is written.
Intel® makes no guarantee on the specific numbers and it is intended for providing reference

WindowsOCT2018=>WindowsMAY2019
26
0
20
40
60
80
100
120
140
160
Canon22 MP Canon50 MP Fuji 24 MP
Adobe LightRoom Enhanced Detail
%improvement Windows Oct2018->May2019
Disclaimer
• Configurations used for test and perf data: Latest Windows OS and Intel® Kabylake Graphics
• All testing was performed at Intel. Numbers may differ based on actual hardware used and/or based on how the benchmark is written. Intel® makes no
guarantee on the specific numbers and it is intended for providing reference

usecasesrunningon
Intel®processorgraphics
2
7

28
Photoenhancement–PixelMatorPro
Intel GPU on MacOS using CoreML AI framework
Professionally Enhance Your Photos without Time Consuming Manual Trial and Error
Original – Nice, But Overexposed Post ML Enhance on Pixelmator Pro

29
Enhancedetails–AdobeLightroom
Intel GPU on MacOS using CoreML and on Windows using WinML AI frameworks
https://theblog.adobe.com/enhance-details/

SmartRetail–cashier-lessstore
Kiosk
Recognize who pick up what and how many, add the goods into user account’s shopping cart for payment
Smart Shelf with pressure sensor
Tracking stop position and
count gender, age of
people to generate
thermodynamic chart
Recognize goods, how
many, how much and
payment
Camera on the
shelf also could
check if goods
were displayed in
the right position
IA edge
computing
workstation
Smart
weighting station
Identify customer and associate
with account
Recognize
people’s gender
and age to push
ad
Intel GPU on
Linux using OpenVINO AI SDK

Reinforcementlearningfordevelopingagentsingames
Demonstrated on intel graphics by Unity at Game
Developers conference March 2019
A real dog uses vision and other senses to orient itself and to
decide where to go. Puppo follows the same methodology. It
collects observations about the scene such as proximity to
the target, the relative position between itself and the target
and the orientation of its own legs, so it can decide what
action to take next. In Puppo’s case, the action describes
how to rotate the joint motors in order to move.
After each action Puppo performs, we give a reward to the
agent. The reward is comprised of:
The dog learned to walk rather quickly in about 1 min.
Then, as the training continued, the dog learned to run.
https://blogs.unity3d.com/wp-
content/uploads/2018/10/DogFetchTraining.mp4?_=1
Courtesy Unity
Link to Demo
Intel GPU on Windows using DirectML AI
Framework
Save Developer Time to Deliver Game Agents; Improve Game Experience

AWSDeepracer–AIforComputervisionand
reinforcementlearningonIntelatom®processor
Intel GPU on Linux using OpenVINO AI SDK
Applicable to Teach Robots from Vacuum Cleaners to Strawberry Pickers

styletransfer

Posenet
Real-time human pose estimation in the browser
Browser based PoseNet using WebML on Intel GPU with clDNN (Winodws/Linux) and MetalPerformanceShaders
(macOS) backend

AIbaseddenoising:IntelOpenImageDenoiser

Improvementswith11th Generation
Intel®ProcessorGraphics
“icelake”

• 10 nm process
• 64 execution units (EUs) which
increases the core compute
capability by 2.67x1 over Gen9
• Gen11 addresses the
corresponding bandwidth needs
by improving compression,
increasing L3 cache as well as
increasing peak memory
bandwidth
• ~ 1 TF FP32 perf; ~2 TF FP16 perf
• Improved SharedLocalMemory
(SLM) performance (~1/4 latency
vs Gen9)
CPU
Core
System
Agent
Display
Controller
PCIe
Memory
Controller
CPU
Cores
LLC
Cache
slice
Intel® Processor Graphics Gen11
Intel® Core Processor
SoC Ring Interconnect
L3$
SliceCommon
SubSlice
EU EU
I$ & thread dispatch
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
EU EU
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
EU EU
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
EU EU
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
EU EU
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
EU EU
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
EU EU
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
EU EU
EU
EU
EU
EU
EU EU
Sampler
SLM
Dataport
[LD/ST]
Tex$
Media
Sampler
SubSlice
Geometry
Global AssetsGTI BlitterMedia Fixed Function
Raster
HiZ/Depth
Pixel Dispatch
Pixel Backend

3
9
Disclaimer
• Configurations used for test and perf data: with Intel Gen9 graphics (24 EU) and Intel Gen11 graphics (64 EU) some with fixed frequency and some with dynamic frequency. All testing was performed at Intel
• All testing was permed at Intel® Folsom
1.50
1.90
2.30
2.70
VGG16_b01 VGG16_b04 VGG16_b16 VGG19_b01 VGG19_b04 VGG19_b16 InceptionV3_b01 InceptionV3_b04 InceptionV3_b16 ResNet50_b01 ResNet50_b04 ResNet50_b16
ML Bench
x improvement Gen9 vs Gen11

ISVApplicationImprovements
40
Disclaimer
• Configurations used for test and perf data: with Intel Gen9 graphics (24 EU) and Intel Gen11 graphics (64 EU) some with fixed frequency and some with dynamic frequency. All testing was performed at Intel
• All testing was permed at Intel® Folsom
1.88
1.89
1.90
1.91
1.92
1.93
1.94
1.95
1.96
1.97
1.98
Fuji 22 MP Fuji 24 MP Canon22 MP Canon50 MP
Adobe LightRoom Enhance Detail
x improvement Gen9 vs Gen11

AI/MLpossibilities
41
Stylizea15minvideo
w/AI
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations
and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks
Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator. Any difference in system hardware or software design or configuration may affect actual performance.
System Configurations: ICL Media performance is based on projections and subject to change. Gen 9 performance
is based on KBL-R U42 system
1. Stylize video using Cyberlink PowerDirector Style Transfer leveraging Intel OpenVINO
2. 250 22MP images uses WinML, CoreML and Adobe Lightroom Classic and CC
48
Minutes 30
Minutes
Gen11
Enhancing250
imagesw/ML
1.1
hours 42
Minutes
Gen9
1
2
Cyberlink PowerDirector
Adobe Lightroom Classic & CC
Performance 1.0x 1.7-2.7x

summary
• Machine Learning is here on the Edge!!
• Use Intel® Integrated Graphics for your Machine learning acceleration
• Ships with most Windows and Mac platforms
• Intel optimized ML stack is enabled by default
• Automatic improvements delivered with OS and driver updates
• Large improvement with 11th Gen Intel® Processor Graphics
• Intel is continuously working with OSVs(Apple, Microsoft), ISVs, Open
Source Community and others to improve the Intel® Graphics
Software and Hardware for ML needs
42

references
• Intel® processor Graphics gen11 aka “Icelake”
• Apple Machine learning on Intel®
• CreateML
• CoreML
• Metal Performance Shaders
• Windows AI
• WebML
• Intel® Open Image Denoiser
• Windows May2019 ML improvements on Intel®
• Adobe Enhance Details
• Unity AI
• WinML Get Started
• DirectML
43

Acknowledgements
44
• Aaftab Munshi
• Joseph Van De Water
• Sudhir Tonse
• Ningxin Hu
• Gokul N Tonpe
• Insoo Woo
• Ben Ashbaugh
• Murali Ramadoss
• Thanh-Kevin Dang
• Jay Patel
• Prashanth Palaniappan
• Xiaoqing Wu
• Sachin Sane
• Katen Shah
• Brian Jacobosky
• Arzhange Safdarzadeh
• Anthony Bernecky
• Leland E Martin
• Antal Tungler
• Damien Triolet
• Jacek Krol
• Jacek Nowak
• Kalyan Muthukumar

LegalDisclaimer
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service
activation. Performance varies depending on system configuration. No product or compenent can be absolutely secure. Check with your
system manufacturer or retailer or learn more at [intel.com].
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors and Intel
Integrated GPU. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. For more information go to www.intel.com/benchmarks.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference
Guides for more information regarding the specific instruction sets covered by this notice.
All testing was performed at Intel® Folsom
Intel, the Intel logo, are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation.

Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 Technical Sessions

Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 Technical Sessions

More Related Content

What's hot

Similar to Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 Technical Sessions

More from Intel® Software

Recently uploaded

Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 Technical Sessions