Enabling Ubiquitous Visual Intelligence
Through Deep Learning	

Dr. Ren Wu	

Distinguished Scientist, Baidu	

wuren@baidu.com 	

@韧在百度
Dr. Ren Wu	

•  Distinguished Scientist, Baidu	

•  HSA Chief Software Architect, AMD	

•  PI, HP Labs CUDA Research Center	

•  World Computer Xiangqi Champion	

•  AI expert	

•  Heterogeneous Computing expert	

•  Computational scientist
Eight Years Ago - 05/11/1997
Deep Blue	

A classic example of application-specific system design comprised
of an IBM supercomputer with 480 custom-madeVLSI chess chips, running
massively parallel search algorithm with highly optimized implementation.
Computer Chess and Moore’s Law
Deep Learning Works	

“We deepened our investment in
advanced technologies like Deep
Learning, which is already yielding near
term enhancements in customer ROI
and is expected to drive
transformational change over the
longer term.” 	

	

– Robin Li, Baidu CEO
Amount of data	

Performance	

Deep learning	

Old algorithms	

Deep Learning
Deep Learning vs. Human Brain	

pixels	

edges	

object parts	

(combination 	

of edges)	

object models	

Deep Architecture in the Brain
Retina
Area V1
Area V2
Area V4
pixels
Edge detectors
Primitive shape
detectors
Higher level visual
abstractions
Slide credit: Andrew Ng
Voice	

Text	

 Image	

 User
Deep Convolutional Neural Networks	
* Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster	

courtesy of Jonatan Ward, Sergey Andreev, Francisco Heredia, Bogdan Lazar, Zlatka Manevska
Big Data	
•  >2000PBStorage
•  10-100PB/dayProcessing
•  100b-1000bWebpages
•  100b-1000bIndex
•  1b-10b/dayUpdate
•  100TB~1PB/dayLog
Heterogeneous Computing	

1993 world #1	

Think Machine CM5/1024	

131 GFlops	

2013	

Samsung Note 3 smartphone	

(Qualcomm SnapDragon 800)	

129 Gflops	

2000 world #1 	

ASCI White (IBM RS/6000SP)	

6MW power, 106 tons	

12.3 TFlops	

2013	

Two MacPro workstation	

(dual AMD GPUs each)	

14 TFlops
Deep Learning: Two Step Process	

Supercomputers used for
training	

	

And then deploy the trained
models everywhere!	

Datacenters	

 Tablets, smartphones	

 Wearable devices	

 IoTs
Deep Learning: Training 	

Big data + Deep learning + High performance computing =
Intelligence	

	

Big data + Deep learning + Heterogeneous computing =
Success
Image Recognition
human vs. machine	

http://7-themes.com/6977111-cute-little-girl-play-white-dog.html
ImageNet Classification Challenge
	

 •  ImageNet dataset	

•  More than 15 million images belonging to about 22,000 categories	

•  ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)	

•  Classification task: 1.2 million images contains 1,000 categories	

•  One of the most challenging computer vision benchmarks	

•  Increasing attention both from industry and academic communities	

* Olga Russakovsky et al. ECCV 2014
ImageNet Classification Challenge
	

* courtesy of Feifei Li
ImageNet Classification Challenge
ImageNet Classification 2012-2014 	

Team	

 Year	

 Place	

 Error (top-5)	

 Uses external
data	

SuperVision	

 2012 	

 -	

 16.4%	

 no	

SuperVision	

 2012 	

 1st	

 15.3%	

 ImageNet 22k	

Clarifai	

 2013	

 -	

 11.7%	

 no	

Clarifai	

 2013	

 1st	

 11.2%	

 ImageNet 22k	

MSRA	

 2014 	

 3rd	

 7.35%	

 no	

VGG	

 2014	

 2nd	

 7.32%	

 no	

GoogLeNet	

 2014	

 1st	

 6.67%	

 no	

Slide credit: Yangqing Jia, Google	

 Invincible ?
Latest Results
Latest Results	

Team Date Top-5 test error
GoogLeNet 2014 6.66%
Deep Image 01/12/2015 5.98%
Deep Image 02/05/2015 5.33%
Microsoft 02/05/2015 4.94%
Google 03/02/2015 4.82%
Deep Image 05/10/2015 4.58%
Insights and Inspirations	

多算胜少算不胜	

	

孙⼦子 计篇 (Sun Tzu, 544-496 BC)	

	

More calculations win, few
calculation lose	

元元本本殚⻅见洽闻	

	

班固 ⻄西都赋(Gu Ban, 32-92 AD) 	

	

Meaning the more you see the
more you know	

明⾜足以察秋毫之末	

	

孟⼦子梁惠⺩王上 (Mencius, 372-289 BC)	

	

ability to see very fine details
Project Minwa (百度敏娲)	

•  Minerva + Athena + ⼥女娲	

•  Athena: Goddess of Wisdom,Warfare,
Divine Intelligence,Architecture, and Crafts	

•  Minerva: Goddess of wisdom, magic,
medicine, arts, commerce and defense	

•  ⼥女娲: 抟⼟土造⼈人, 炼⽯石补天, 婚姻, 乐器	

	

World’s Largest Artificial Neural Networks	

	

v Pushing the State-of-the-Art	

v ~ 100x bigger than previous ones	

v New kind of Intelligence?
Hardware/Software Co-design	
•  Stochastic gradient descent (SGD)	

•  High compute density	

•  Scale up, up to 100 nodes	

•  High bandwidth low latency	

•  36 nodes, 144 GPUs, 6.9TB Host, 1.7TB Device	

•  0.6 PFLOPS 	

•  Highly Optimized software stack	

•  RDMA/GPU Direct 	

•  New data partition and communication
strategies	

GPUs	
Infiniband
Minwa
Speedup (wall time for convergence) 	
Validation set accuracy for different numbers of GPUs	
0	
  
0.1	
  
0.2	
  
0.3	
  
0.4	
  
0.5	
  
0.6	
  
0.7	
  
0.8	
  
0.9	
  
0.25	
   0.5	
   1	
   2	
   4	
   8	
   16	
   32	
   64	
   128	
   256	
  
Accuracy
Time (hours)
32 GPU
16 GPU
1 GPU
Accuracy 80%
32 GPU: 8.6 hours
1 GPU: 212 hours
Speedup: 24.7x
Never have enough training
examples!	

	

Key observations 	

•  Invariant to illuminant of the scene	

•  Invariant to observers 	

Augmentation approaches	

•  Color casting	

•  Optical distortion	

•  Rotation and cropping etc	

Data Augmentation	

“⻅见多识⼲⼴广”
And the Color Constancy 	

	

Key observations 	

•  Invariant to illuminant of the scene	

•  Invariant to observers 	

Augmentation approaches	

•  Color casting	

•  Optical distortion	

•  Rotation and cropping etc	

The Color of the Dress	

“Inspired by the color constancy principal.
Essentially, this ‘forces’ our neural network to
develop its own color constancy ability.”
Data Augmentation	

Augmentation The number of possible changes
Color casting 68920
Vignetting 1960
Lens distortion 260
Rotation 20
Flipping 2
Cropping 82944(crop size is 224x224, input image
size is 512x512)
Possible variations
The Deep Image system learned from ~2 billion examples, out
of 90 billion possible candidates.
Data Augmentation vs. Overfitting
Examples	

Bathtub	
 Isopod	
Indian elephant	
 Ice bear	
Some hard cases addressed by adding our data augmentation.
Multi-scale Training	

•  Same crop size, different
resolution	

•  Fixed-size 224*224	

•  Downsized training images	

•  Reduces computational costs	

•  But not for state-of-the-art	

•  Different models trained by
different image sizes	

256*256	
512*512	
•  High-resolution model works	

•  256x256: top-5 7.96%	

•  512x512: top-5 7.42% 	

•  Multi-scale models are
complementary	

•  Fused model: 6.97%	

“明查秋毫”
Multi-scale Training	

Tricycle	
Washer	
Backpack	
Little blue heron
Tricycle
Single Model Performance	

•  One basic configuration has 16 layers	

•  The number of weights in our configuration is 212.7M	

•  About 40% bigger than VGG’s	

Team Top-5 val. error
VGG 8.0%
GoogLeNet 7.89%
BN-Inception 5.82%
MSRA, PReLU-net 5.71%
Deep Image 5.40%
Robustness
Major Differentiators 	

•  Customized built supercomputer dedicated for DL	

•  Simple, scalable algorithm + Fully optimized
software stack	

•  Larger models	

•  More Aggressive data augmentation	

•  Multi-scale, include high-resolution images	

Scalability + Insights 	

	

 	

 	

and push for extreme
Deep Learning: Deployment	

Big data + Deep learning + High performance computing =
Intelligence	

	

Big data + Deep learning + Heterogeneous computing =
Success
Owl of Minwa (百度敏鸮)	

Supercomputers	

 Datacenters	

 Tablets, smartphones	

Models trained by supercomputers	

Trained models will be deployed in many ways	

data centers (cloud), smartphones, and even wearables and IoTs	

d	

	

OpenCL based, light weight and high performance	

	

DNNs everywhere !	

knowledge, wisdom, perspicacity and erudition
DNNs Everywhere	

Supercomputers	

 Datacenters	

 Tablets, smartphones	

 Wearable devices	

IoTs	

1000s GPUs	

 100k-1m servers	

 2b (in China)	

 50b in 2020?	

Supercomputer used for training	

Trained DNNs then deployed to data centers (cloud),
smartphones, and even wearables and IoTs
Offline Mobile DNN App	

•  Image recognition on mobile device	

•  Real time and no connectivity
needed	

•  directly from video stream, what
you point is what you get	

•  Everything is done within the device	

	

•  OpenCL based, highly optimized	

•  Large deep neural network models	

•  Thousands of objects, flowers, dogs,
and bags etc	

•  Unleashed the full potential of the
device hardware	

	

•  Smart phones now, Wearables and
IoTs tomorrow
Cloud Computing: What’s Missing?	

Bandwidth?	

Latency?	

and	

Power consumption?	

*ArtemVasilyev: CNN optimizations for embedded systems and FFT	

Moving data around is expensive, very expensive!
Cloud Computing: What’s Missing?	

How about 	

privacy?
What’s Next?	

Dedicated Hardware + Heterogeneous Computing	

*MarkHorowitz
Heterogeneous Computing	

“Human mind and brain is not a single general-purpose processor
but a collection of highly specialized components, each solving a
different, specific problem and yet collectively making up who we
are as human beings and thinkers. “ - Prof. Nancy Kanwisher
© Copyright Khronos Group 2015 - Page 50
Vision Processing Power Efficiency
• Wearables will need ‘always-on’ vision
-  With smaller thermal limit / battery than phones!
• GPUs have x10 imaging power efficiency over CPU
-  GPUs architected for efficient pixel handling
• Dedicated Hardware/DSPs can be even more efficient
-  With some loss of generality
• Mobile SOCs have space for more transistors
-  But can’t turn on at same time = Dark Silicon
-  Can integrate more gates ‘for free’ if careful
how and when they are used
PowerEfficiency
Computation Flexibility
Dedicated
Hardware
GPU
Compute
Multi-core
CPU
X1
X10
X100
Potential for dedicated sensor/vision silicon to be
integrated into Mobile Processors
But how will they be programmed for
PORTABILITY and POWER EFFICIENCY?
© Copyright Khronos Group 2015 - Page 51
OpenCL Ecosystem
Implementers
Desktop/Mobile/FPGA
Working Group Members
Apps/Tools/Tests/Courseware
Single Source C++ Programming
Portable Kernel Intermediate Language
Core API and Language Specs
Everything
Connected	
Everything
Intelligent
Big data era AI era
I2
oT
Intelligent Internet of Things
Thank you!

"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Presentation from Baidu

  • 1.
    Enabling Ubiquitous VisualIntelligence Through Deep Learning Dr. Ren Wu Distinguished Scientist, Baidu wuren@baidu.com @韧在百度
  • 2.
    Dr. Ren Wu • Distinguished Scientist, Baidu •  HSA Chief Software Architect, AMD •  PI, HP Labs CUDA Research Center •  World Computer Xiangqi Champion •  AI expert •  Heterogeneous Computing expert •  Computational scientist
  • 3.
    Eight Years Ago- 05/11/1997
  • 4.
    Deep Blue A classicexample of application-specific system design comprised of an IBM supercomputer with 480 custom-madeVLSI chess chips, running massively parallel search algorithm with highly optimized implementation.
  • 5.
    Computer Chess andMoore’s Law
  • 6.
    Deep Learning Works “Wedeepened our investment in advanced technologies like Deep Learning, which is already yielding near term enhancements in customer ROI and is expected to drive transformational change over the longer term.” – Robin Li, Baidu CEO
  • 7.
    Amount of data Performance Deeplearning Old algorithms Deep Learning
  • 8.
    Deep Learning vs.Human Brain pixels edges object parts (combination of edges) object models Deep Architecture in the Brain Retina Area V1 Area V2 Area V4 pixels Edge detectors Primitive shape detectors Higher level visual abstractions Slide credit: Andrew Ng Voice Text Image User
  • 9.
    Deep Convolutional NeuralNetworks * Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster courtesy of Jonatan Ward, Sergey Andreev, Francisco Heredia, Bogdan Lazar, Zlatka Manevska
  • 10.
    Big Data •  >2000PBStorage • 10-100PB/dayProcessing •  100b-1000bWebpages •  100b-1000bIndex •  1b-10b/dayUpdate •  100TB~1PB/dayLog
  • 11.
    Heterogeneous Computing 1993 world#1 Think Machine CM5/1024 131 GFlops 2013 Samsung Note 3 smartphone (Qualcomm SnapDragon 800) 129 Gflops 2000 world #1 ASCI White (IBM RS/6000SP) 6MW power, 106 tons 12.3 TFlops 2013 Two MacPro workstation (dual AMD GPUs each) 14 TFlops
  • 12.
    Deep Learning: TwoStep Process Supercomputers used for training And then deploy the trained models everywhere! Datacenters Tablets, smartphones Wearable devices IoTs
  • 13.
    Deep Learning: Training Big data + Deep learning + High performance computing = Intelligence Big data + Deep learning + Heterogeneous computing = Success
  • 14.
    Image Recognition human vs.machine http://7-themes.com/6977111-cute-little-girl-play-white-dog.html
  • 15.
    ImageNet Classification Challenge •  ImageNet dataset •  More than 15 million images belonging to about 22,000 categories •  ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) •  Classification task: 1.2 million images contains 1,000 categories •  One of the most challenging computer vision benchmarks •  Increasing attention both from industry and academic communities * Olga Russakovsky et al. ECCV 2014
  • 16.
  • 17.
  • 18.
    ImageNet Classification 2012-2014 Team Year Place Error (top-5) Uses external data SuperVision 2012 - 16.4% no SuperVision 2012 1st 15.3% ImageNet 22k Clarifai 2013 - 11.7% no Clarifai 2013 1st 11.2% ImageNet 22k MSRA 2014 3rd 7.35% no VGG 2014 2nd 7.32% no GoogLeNet 2014 1st 6.67% no Slide credit: Yangqing Jia, Google Invincible ?
  • 19.
  • 20.
    Latest Results Team DateTop-5 test error GoogLeNet 2014 6.66% Deep Image 01/12/2015 5.98% Deep Image 02/05/2015 5.33% Microsoft 02/05/2015 4.94% Google 03/02/2015 4.82% Deep Image 05/10/2015 4.58%
  • 21.
    Insights and Inspirations 多算胜少算不胜 孙⼦子计篇 (Sun Tzu, 544-496 BC) More calculations win, few calculation lose 元元本本殚⻅见洽闻 班固 ⻄西都赋(Gu Ban, 32-92 AD) Meaning the more you see the more you know 明⾜足以察秋毫之末 孟⼦子梁惠⺩王上 (Mencius, 372-289 BC) ability to see very fine details
  • 22.
    Project Minwa (百度敏娲) • Minerva + Athena + ⼥女娲 •  Athena: Goddess of Wisdom,Warfare, Divine Intelligence,Architecture, and Crafts •  Minerva: Goddess of wisdom, magic, medicine, arts, commerce and defense •  ⼥女娲: 抟⼟土造⼈人, 炼⽯石补天, 婚姻, 乐器 World’s Largest Artificial Neural Networks v Pushing the State-of-the-Art v ~ 100x bigger than previous ones v New kind of Intelligence?
  • 23.
    Hardware/Software Co-design •  Stochasticgradient descent (SGD) •  High compute density •  Scale up, up to 100 nodes •  High bandwidth low latency •  36 nodes, 144 GPUs, 6.9TB Host, 1.7TB Device •  0.6 PFLOPS •  Highly Optimized software stack •  RDMA/GPU Direct •  New data partition and communication strategies GPUs Infiniband
  • 24.
  • 25.
    Speedup (wall timefor convergence) Validation set accuracy for different numbers of GPUs 0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   0.25   0.5   1   2   4   8   16   32   64   128   256   Accuracy Time (hours) 32 GPU 16 GPU 1 GPU Accuracy 80% 32 GPU: 8.6 hours 1 GPU: 212 hours Speedup: 24.7x
  • 26.
    Never have enoughtraining examples! Key observations •  Invariant to illuminant of the scene •  Invariant to observers Augmentation approaches •  Color casting •  Optical distortion •  Rotation and cropping etc Data Augmentation “⻅见多识⼲⼴广”
  • 27.
    And the ColorConstancy Key observations •  Invariant to illuminant of the scene •  Invariant to observers Augmentation approaches •  Color casting •  Optical distortion •  Rotation and cropping etc The Color of the Dress “Inspired by the color constancy principal. Essentially, this ‘forces’ our neural network to develop its own color constancy ability.”
  • 28.
    Data Augmentation Augmentation Thenumber of possible changes Color casting 68920 Vignetting 1960 Lens distortion 260 Rotation 20 Flipping 2 Cropping 82944(crop size is 224x224, input image size is 512x512) Possible variations The Deep Image system learned from ~2 billion examples, out of 90 billion possible candidates.
  • 29.
  • 30.
    Examples Bathtub Isopod Indian elephant Ice bear Some hard cases addressed by adding our data augmentation.
  • 31.
    Multi-scale Training •  Samecrop size, different resolution •  Fixed-size 224*224 •  Downsized training images •  Reduces computational costs •  But not for state-of-the-art •  Different models trained by different image sizes 256*256 512*512 •  High-resolution model works •  256x256: top-5 7.96% •  512x512: top-5 7.42% •  Multi-scale models are complementary •  Fused model: 6.97% “明查秋毫”
  • 32.
  • 33.
  • 34.
    Single Model Performance • One basic configuration has 16 layers •  The number of weights in our configuration is 212.7M •  About 40% bigger than VGG’s Team Top-5 val. error VGG 8.0% GoogLeNet 7.89% BN-Inception 5.82% MSRA, PReLU-net 5.71% Deep Image 5.40%
  • 35.
  • 40.
    Major Differentiators • Customized built supercomputer dedicated for DL •  Simple, scalable algorithm + Fully optimized software stack •  Larger models •  More Aggressive data augmentation •  Multi-scale, include high-resolution images Scalability + Insights and push for extreme
  • 41.
    Deep Learning: Deployment Bigdata + Deep learning + High performance computing = Intelligence Big data + Deep learning + Heterogeneous computing = Success
  • 42.
    Owl of Minwa(百度敏鸮) Supercomputers Datacenters Tablets, smartphones Models trained by supercomputers Trained models will be deployed in many ways data centers (cloud), smartphones, and even wearables and IoTs d OpenCL based, light weight and high performance DNNs everywhere ! knowledge, wisdom, perspicacity and erudition
  • 43.
    DNNs Everywhere Supercomputers Datacenters Tablets, smartphones Wearable devices IoTs 1000s GPUs 100k-1m servers 2b (in China) 50b in 2020? Supercomputer used for training Trained DNNs then deployed to data centers (cloud), smartphones, and even wearables and IoTs
  • 44.
    Offline Mobile DNNApp •  Image recognition on mobile device •  Real time and no connectivity needed •  directly from video stream, what you point is what you get •  Everything is done within the device •  OpenCL based, highly optimized •  Large deep neural network models •  Thousands of objects, flowers, dogs, and bags etc •  Unleashed the full potential of the device hardware •  Smart phones now, Wearables and IoTs tomorrow
  • 46.
    Cloud Computing: What’sMissing? Bandwidth? Latency? and Power consumption? *ArtemVasilyev: CNN optimizations for embedded systems and FFT Moving data around is expensive, very expensive!
  • 47.
    Cloud Computing: What’sMissing? How about privacy?
  • 48.
    What’s Next? Dedicated Hardware+ Heterogeneous Computing *MarkHorowitz
  • 49.
    Heterogeneous Computing “Human mindand brain is not a single general-purpose processor but a collection of highly specialized components, each solving a different, specific problem and yet collectively making up who we are as human beings and thinkers. “ - Prof. Nancy Kanwisher
  • 50.
    © Copyright KhronosGroup 2015 - Page 50 Vision Processing Power Efficiency • Wearables will need ‘always-on’ vision -  With smaller thermal limit / battery than phones! • GPUs have x10 imaging power efficiency over CPU -  GPUs architected for efficient pixel handling • Dedicated Hardware/DSPs can be even more efficient -  With some loss of generality • Mobile SOCs have space for more transistors -  But can’t turn on at same time = Dark Silicon -  Can integrate more gates ‘for free’ if careful how and when they are used PowerEfficiency Computation Flexibility Dedicated Hardware GPU Compute Multi-core CPU X1 X10 X100 Potential for dedicated sensor/vision silicon to be integrated into Mobile Processors But how will they be programmed for PORTABILITY and POWER EFFICIENCY?
  • 51.
    © Copyright KhronosGroup 2015 - Page 51 OpenCL Ecosystem Implementers Desktop/Mobile/FPGA Working Group Members Apps/Tools/Tests/Courseware Single Source C++ Programming Portable Kernel Intermediate Language Core API and Language Specs
  • 52.
    Everything Connected Everything Intelligent Big data eraAI era I2 oT Intelligent Internet of Things
  • 53.