Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2

Share

Download to read offline

Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote

Download to read offline

Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Software AI Accelerators: The Next Frontier | Software for AI Optimization Summit 2021 Keynote

  1. 1. Software AI Accelerators T h e N e x t F r o n t i e r S o f t w a r e f o r A I O p t i m i z a t i o n S u m m i t W e i L i V P & G M M a c h i n e L e a r n i n g P e r f o r m a n c e I n t e l C o r p o r a t i o n
  2. 2. 2 HARDWARE AI ACCELERATORS HW Acceleration
  3. 3. 10 - 100x SOFTWARE AI ACCELERATORS 3 Up to HW Acceleration With SW Acceleration Photo Source: NASA
  4. 4. AI HARDWARE SPECTRUM 4 GENERAL PURPOSE PURPOSE BUILT GPU ACCELERATORS CPU
  5. 5. UNSCALABLE TO SCALABLE SOFTWARE 5 Services & Solutions Applications M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S M i d d l e w a r e F r a m e w o r k s A n d R u n t i m e s L o w L e v e l L i b r a r i e s V i r t u a l i z a t i o n / O r c h e s t r a t i o n O S D r i v e r s F W I P & B I O S … GPU ACCELER AT O R [1] CPU ACCELER AT O R [N] Services & Solutions Applications Middleware, Frameworks and Runtimes GPU ACCELERATORS CPU
  6. 6. AI SOFTWARE STACK 6 Data Scientists & Developers AI/Analytics Tools, Toolkits, Verticals Deep Learning, Machine Learning, Big Data Frameworks Libraries & Compilers HW Intel® LPOT ( L o w p r e c i s i o n o p t i m i z a t i o n t o o l ) Analyt i cs Zoo Intel® oneAPI AI Analyt i cs Toolkit SigOpt P a d d l e P a d d l e T e n s o r F l o w P y t h o n / N u m b a TVM P y T o r c h M X N e t S p a r k S Q L + M L / D L s c a l e o u t M o d i n NumPy X G - B o o s t S c i k i t - L e a r n P a n d a s O p e n V I N O GPU ACCELERATORS CPU
  7. 7. KERNEL OPTIMIZATION EXAMPLE 7 Optimizations: vectorization, data reuse, parallelization Optimized convolution in oneDNN A simple program is good, but may be slow
  8. 8. GRAPH OPTIMIZATION EXAMPLE 8 Baseline S u m R e L U C o n v 1 x 1 B a t c h N o r m R e L U C o n v 3 x 3 B a t c h N o r m R e L U C o n v 1 x 1 R e L U S u m R e L U C o n v 1 x 1 B a t c h N o r m INT8 Optimized Model (generated by Intel Lo w Precision Optimization To o l) BN Folding Conv + ReLU Conv + Sum S u m R e L U C o n v 1 x 1 ’ R e L U C o n v 3 x 3 ’ R e L U C o n v 1 x 1 ’ S u m R e L U C o n v 1 x 1 ’ Sum’ Conv1x1’’ Conv3x3’’ Conv1x1’’ Sum’ Conv1x1’’ Sum’ Conv1x1’’ Conv3x3’’ Conv1x1’’’ Conv1x1’’ A0 B0 A1 B1 A2 B2 A3 B3 … … A63 B63 C0 A0 *B0 + A1 *B1+A2 *B2+A2 *B2+C0 … … C15 A60 *B60 + A61 *B61+A62 *B62+A63 *B63+C015
  9. 9. Intel Optimization for TENSORFLOW 9 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448, https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  10. 10. Intel Optimization for TENSORFLOW 10 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50v1_5/README.md; SSD-MobileNetv1, FP32/INT8, BS=448, https://github.com/IntelAI/models/blob/master/benchmarks/object_detection/tensorflow/ssd-mobilenet/README.md. Software: Tensorflow 2.4.0 for FP32 & Intel-Tensorflow (icx-base) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  11. 11. Intel Optimization for PYTORCH 11 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch- public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  12. 12. Intel Optimization for PYTORCH 12 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1.5, FP32/INT8, BS=128, https://github.com/IntelAI/models/blob/icx-launch-public/quickstart/ipex-bkc/resnet50-icx/inference; DLRM, FP32/INT8, BS=16, https://github.com/IntelAI/models/blob/icx-launch- public/quickstart/ipex-bkc/dlrm-icx/inference/fp32/README.md. Software: PyTorch v1.5 w/o DNNL build for FP32 & PyTorch v1.5 + IPEX (icx) for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex. Photo Source: NASA
  13. 13. Intel Optimization for MXNET 13 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator- mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
  14. 14. Intel Optimization for MXNET 14 IMMEDIATE PERFORMANCE BENEFITS Platinum 8380: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0xd000280, HT on, Turbo on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel 900GB SSD OS Drive; ResNet50 v1, FP32/INT8, BS=128, https://github.com/apache/incubator-mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/resnet.py; MobileNetv2, FP32/INT8, BS=128, https://github.com/apache/incubator- mxnet/blob/v2.0.0.alpha/python/mxnet/gluon/model_zoo/vision/mobilenet.py. Software: MXNet 2.0.0.alpha w/o DNNL build for FP32 & MXNet 2.0.0.alpha for both FP32 and INT8, test by Intel on 5/12/2021. Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex. Photo Source: NASA
  15. 15. Intel Extension for Scikit-learn 15 Intel Xeon Platinum 8276L CPU @ 2.20 GHz, 2 sockets, 28 cores per socket; For workloads and configurations visit www.Intel.com/PerformanceIndex. Details: https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912
  16. 16. PERFORMANCE IN KAGGLE COMPETITIONS 16 Kaggle challenge Domain Algorithm(s) Stock E2E Time (minutes) Intel Extension for Scikit-learn E2E Time (minutes) Speed up KDD Cup 1999 Computer Networks kNN 282 1.24 227.4x Credit Card Default Finance SVC 11.9 0.2 59.5x Digit Recognizer (KNN) Image Classification SVC 84.32 1.47 57.5x Melanoma Identification Image Classification kNN 99.89 2.08 48x Digit Recognizer (SVM) Image Classification PCA, SVC 125.5 4.92 25.5x What's cooking? Natural Language Processing SVC, XGBoost 35.8 2.66 13.5x Real or Not? Disaster Tweets Natural Language Processing SVC 37.8 4.27 8.9x Home Credit Default Finance Random Forest 2.9 1.44 2x Intel Xeon Gold 5218 @ 2.3 GHz (2nd generation Intel Xeon Scalable processors): 2 sockets, 16 cores per socket, HT:off, Turbo:off. For workloads and configurations visit www.Intel.com/PerformanceIndex. Details: https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a
  17. 17. GRAPH ANALYTICS WITH oneDAL 17 Triangle Counting Algorithm V = Vertices, E = Edges, speed up due to relabel in g 1.38 1.67 1.74 1.82 2.98 8.02 166.1 1 10 100 1000 Enron (V: 0.03M, E: 0.4M) Pokec (V: 1.6M, E: 30.6M) Google (V: 0.9M, E: 5.1M) Indochina-2004 (V: 7.4M, E: 151M) Wikipedia (V: 12.1M, E: 378M) Twitter (V: 61M, E: 1202M) Web (V: 50M, E: 1810M) Speed Up Data Sets Enron (V: 0.03M, E: 0.4M) Pokec (V: 1.6M, E: 30.6M) Google (V: 0.9M, E: 5.1M) Indochina-2004 (V: 7.4M, E: 151M) Wikipedia (V: 12.1M, E: 378M) Twitter (V: 61M, E: 1202M) Web (V: 50M, E: 1810M) Intel Xeon Platinum 8280 CPU @ 2.70 GHz, 2x28 cores, HT: on; For workloads and configurations visit www.Intel.com/PerformanceIndex. Data sets: https://gihub.com/sbeamer/gapbs | https://snap.Stanford.edu/data
  18. 18. E2E WORKLOAD PERFORMANCE 18 R e a d c s v E T L T r a i n T e s t S p l i t M L 0 10 20 30 40 50 60 70 80 90 100 Readcsv ETL Train Test Split ML Total Time Speed up Unoptimized Software Optimized Optimized hyperparameters CENSUS Phase-wise % breakdown CENSUS Performance improvement with hyperparameter optimizations Readcsv ETL ML PLAsTiCC Phase-wise % breakdown PLAsTiCC Performance improvement with hyperparameter optimizations 23x 0 10 20 30 40 50 60 70 Readcsv ETL ML Total Time Speed up Unoptimized Software Optimized Optimized hyperparameters 29x Higher is better Details: https://medium.com/intel-analytics-software/performance-optimizations-for-end-to-end-ai-pipelines-231e0966505a Intel® Xeon Platinum 8280L @ 28 cores; For workloads and configurations visit www.Intel.com/PerformanceIndex.
  19. 19. AI APPLICATIONS FROM PARTNERSHIPS 19 Athlete Training Telecom Network Quality Drug Discovery
  20. 20. SUMMARY AND CALL-TO-ACTION 20 Software AI Accelerators can deliver orders of magnitude performance Even more potential for the AI software community ▪ Create compiler technologies to automate kernel optimizations ▪ Increase parallelism to achieve higher compute utilization ▪ Optimize for memory bandwidth, memory size, NUMA ▪ Scale to large distributed compute Find more at: ai.intel.com
  21. 21. NOTICES & DISCLAIMERS 21 ▪ Results have been estimated or simulated. ▪ Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​. ▪ Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details. No product or component can be absolutely secure. ▪ Your costs and results may vary. ▪ Intel technologies may require enabled hardware, software or service activation. ▪ All product plans and roadmaps are subject to change without notice. ▪ Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies. ▪ © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. ​
  • HambyHamby1

    Sep. 21, 2021
  • kaichen121

    Sep. 1, 2021

Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.

Views

Total views

5,002

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

23

Shares

0

Comments

0

Likes

2

×