Copyright © 2017 1
How Image Sensor and Video
Compression Parameters Impact
Vision Algorithms
Ilya Brailovskiy, PhD, Principal CV Engineer @ Amazon Lab126
May 2017
Copyright © 2017 2
The views expressed in the following slides do not represent the view of
Amazon.com. They are solely the opinions of the presenter, Ilya Brailovskiy,
based on trends and projections provided by the sources cited. No
representations are made regarding the completeness, timeliness,
suitability or validity of any information presented.
All images and representations are sourced in open Public Domain, they
are used for the express purpose of research and education not promoting
any business, products, services, use or individuals.
Disclaimer
Copyright © 2017 3
• Algorithms reaching human level accuracy
• Quick check against “real-life” footage
• What can we learn from this?
• Image sizes: some theory behind object detection
• Adding video compression
• Bitrate impact
• Resolution impact
• Conclusions
What this is about?
Copyright © 2017 4
• Feature-based algorithms
• Histogram of gradients (HOG)
• Viola-Jones family
• ACF methods
• …
• Deep Learning algorithms
• Fast(er) RCNN
• Yolo
• SSD
• …
Human and Face Detection Task
Copyright © 2017 5
Human vs algorithm object detection accuracy
ImageNet Classification top-5 error (%)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”/
CVPR 2016
Russakovsky et al., IJCV, 2015
Copyright © 2017 6
Lets see!
Yolo Darknet against
awkward 1080 p @ 30 fps
footage: recall error > 35%
(output resolution reduced to 360 p)
“You Only Look Once: Unified, Real-Time Object Detection” Joseph Redmon,
Santosh Divvala, Ross Girshick, and Ali Farhadi, CVPR 2016
Copyright © 2017 7
• Improve algorithm(s)
• Design a better algorithm (for example by increasing modalities)
• Use better/more accurate training for your models
• Provide instant/frame level feedback from CV algorithm to the camera
algorithm
• Similar to face/human autofocus embed your “road sign” detection
with auto-focus
• (Auto) Focus is just one example
• Auto Exposure and Auto White Balance
• Low light behaviors (noise, colorization, etc.)
What could be fixed?
Copyright © 2017 8
Camera options: sensors and lens
Camera resolutions: 1 Mp, 2 Mp, 4 Mp, etc.
It translates to Horizontal (H) pixels: 1280, 1920, 3840, etc.
“SmartPhone”
FOV
Wide FOV Ultra-Wide FOV
Field of View: Horizontal 75 110 130
Field of View: Vertical
47 78 101
Field of View: Diagonal
83 117 136
Typical Field of View options (FOV)
in degrees
Copyright © 2017 9
Pixel Per meter for different FOV/lenses
2 ∗ 𝑂𝑏𝑗𝑒𝑐𝑡 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∗ tan ൗ𝐹𝑂𝑉
2 = 𝐶𝑎𝑚𝑒𝑟𝑎 𝑊𝑖𝑑𝑡ℎ 𝑃𝑖𝑥𝑒𝑙𝑠 ∗
𝑂𝑏𝑗𝑒𝑐𝑡 𝑊𝑖𝑑𝑡ℎ
𝑂𝑏𝑗𝑒𝑐𝑡 𝑊𝑖𝑑𝑡ℎ 𝑃𝑖𝑥𝑒𝑙𝑠
Objects should be above
certain width and height in
pixels to be detected
(~5-10X smaller for deep
learning compared to
feature based)
Copyright © 2017 10
• JPEG
• JPEG2000
• H.264
• SVC
• VP8/VP9
• HEVC
• What’s next? S-HEVC? AV1?
Let’s add compression
Copyright © 2017 11
Example of bitrate dependency
“Video quality for face detection, recognition, and tracking”, P Korshunov, WT Ooi ACM
Transactions on Multimedia Computing, Communications, and Applications
Performance drops when
bitrate below a threshold:
~500 Kbps in this Viola Jones
example
(similar results for other
algorithms – features based
on DNN)
500 Kbps
Bitrate on X
Copyright © 2017 12
Add scaling
Original video
Compressed
video
Reduced Res
video
Reduced Res
compressed
video
Compression at
bitrate B
Compression at
bitrate B
downsample upsample
Copyright © 2017 13
Downscaled video performance
There’s a threshold on
performance drop for lower
resolutions as well.
But the threshold is lower:
~250 Kbps in this example
So as long as object size is
above the distance curve
threshold it’s safe to downscale
(can make it part of the
encoding decisions as in
SVC/VP9/AV1)
500 Kbps
250 Kbps
Bitrate on X
Bitrate on X
Copyright © 2017 14
• ITU-T Recommendation P.912: “Subjective Video Quality Assessment Methods
for Recognition Tasks”
• Experiments to validate the analysis
• Test: 12 sequences, 3 sequences out 12 are obvious
• Some screening was required to remove subjects:
• Not paying attention
• Not understanding task
• Based on study recognition depends on light levels
• Sun light: recognition is 38 times better than dark
• Objects further apart recognition drops
• Still objects recognized 2.7 times better than moving
What is human understanding, really?
Copyright © 2017 15
Summary
• Improve your models
• Train your models for your camera
• Use DNN/CNN if it’s affordable
• Build feedback loop from your CV to your camera
• Decide on FOV and resolution based on your object sizes
• Examine reduced resolution if you need to handle lower bitrates: in
many cases you might be able to get your performance back
• And be very patient with your humans and your embedded CV
algorithms ☺
Copyright © 2017 16
Q & A
The end!
Copyright © 2017 17
Reference
• Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition.” CVPR 2016
• Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, IJCV, 2015
• Viola, P. & Jones, M.J. International Journal of Computer Vision 2004 57: 137
• Dollar, P., R. Appel, S. Belongie, and P. Perona. “Fast feature pyramids for object detection.” Pattern Analysis and Machine
Intelligence, IEEE Transactions. Vol. 36, Issue 8, 2014, pp. 1532–1545
• Dollar, C. Wojeck, B. Shiele, and P. Perona. “Pedestrian detection: An evaluation of the state of the art.” Pattern Analysis and
Machine Intelligence, IEEE Transactions. Vol. 34, Issue 4, 2012, pp. 743–761
• Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks. NIPS 2015: 91-99
• Joseph Redmon, Ali Farhadi. YOLO9000: Better, Faster, Stronger. preprint arXiv:1612.08242, 2016
• “You Only Look Once: Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi,
CVPR 2016
• Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot
MultiBox Detector ECCV 2016, preprint arXiv:1512.02325, 2016
• Image Sensors and Signal Processing for Digital Still Cameras (Optical Science and Engineering) Junichi Nakamura (Editor) by
Taylor & Francis Group ISBN-10: 0849335450
• P Korshunov, WT Ooi .Video quality for face detection, recognition, and tracking, ACM Transactions on Multimedia Computing,
Communications, and Applications
• https://www.itu.int/rec/T-REC-P.912 P.912 : Subjective video quality assessment methods for recognition tasks, 03/2016

"How Image Sensor and Video Compression Parameters Impact Vision Algorithms," a Presentation from Amazon Lab126

  • 1.
    Copyright © 20171 How Image Sensor and Video Compression Parameters Impact Vision Algorithms Ilya Brailovskiy, PhD, Principal CV Engineer @ Amazon Lab126 May 2017
  • 2.
    Copyright © 20172 The views expressed in the following slides do not represent the view of Amazon.com. They are solely the opinions of the presenter, Ilya Brailovskiy, based on trends and projections provided by the sources cited. No representations are made regarding the completeness, timeliness, suitability or validity of any information presented. All images and representations are sourced in open Public Domain, they are used for the express purpose of research and education not promoting any business, products, services, use or individuals. Disclaimer
  • 3.
    Copyright © 20173 • Algorithms reaching human level accuracy • Quick check against “real-life” footage • What can we learn from this? • Image sizes: some theory behind object detection • Adding video compression • Bitrate impact • Resolution impact • Conclusions What this is about?
  • 4.
    Copyright © 20174 • Feature-based algorithms • Histogram of gradients (HOG) • Viola-Jones family • ACF methods • … • Deep Learning algorithms • Fast(er) RCNN • Yolo • SSD • … Human and Face Detection Task
  • 5.
    Copyright © 20175 Human vs algorithm object detection accuracy ImageNet Classification top-5 error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”/ CVPR 2016 Russakovsky et al., IJCV, 2015
  • 6.
    Copyright © 20176 Lets see! Yolo Darknet against awkward 1080 p @ 30 fps footage: recall error > 35% (output resolution reduced to 360 p) “You Only Look Once: Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, CVPR 2016
  • 7.
    Copyright © 20177 • Improve algorithm(s) • Design a better algorithm (for example by increasing modalities) • Use better/more accurate training for your models • Provide instant/frame level feedback from CV algorithm to the camera algorithm • Similar to face/human autofocus embed your “road sign” detection with auto-focus • (Auto) Focus is just one example • Auto Exposure and Auto White Balance • Low light behaviors (noise, colorization, etc.) What could be fixed?
  • 8.
    Copyright © 20178 Camera options: sensors and lens Camera resolutions: 1 Mp, 2 Mp, 4 Mp, etc. It translates to Horizontal (H) pixels: 1280, 1920, 3840, etc. “SmartPhone” FOV Wide FOV Ultra-Wide FOV Field of View: Horizontal 75 110 130 Field of View: Vertical 47 78 101 Field of View: Diagonal 83 117 136 Typical Field of View options (FOV) in degrees
  • 9.
    Copyright © 20179 Pixel Per meter for different FOV/lenses 2 ∗ 𝑂𝑏𝑗𝑒𝑐𝑡 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∗ tan ൗ𝐹𝑂𝑉 2 = 𝐶𝑎𝑚𝑒𝑟𝑎 𝑊𝑖𝑑𝑡ℎ 𝑃𝑖𝑥𝑒𝑙𝑠 ∗ 𝑂𝑏𝑗𝑒𝑐𝑡 𝑊𝑖𝑑𝑡ℎ 𝑂𝑏𝑗𝑒𝑐𝑡 𝑊𝑖𝑑𝑡ℎ 𝑃𝑖𝑥𝑒𝑙𝑠 Objects should be above certain width and height in pixels to be detected (~5-10X smaller for deep learning compared to feature based)
  • 10.
    Copyright © 201710 • JPEG • JPEG2000 • H.264 • SVC • VP8/VP9 • HEVC • What’s next? S-HEVC? AV1? Let’s add compression
  • 11.
    Copyright © 201711 Example of bitrate dependency “Video quality for face detection, recognition, and tracking”, P Korshunov, WT Ooi ACM Transactions on Multimedia Computing, Communications, and Applications Performance drops when bitrate below a threshold: ~500 Kbps in this Viola Jones example (similar results for other algorithms – features based on DNN) 500 Kbps Bitrate on X
  • 12.
    Copyright © 201712 Add scaling Original video Compressed video Reduced Res video Reduced Res compressed video Compression at bitrate B Compression at bitrate B downsample upsample
  • 13.
    Copyright © 201713 Downscaled video performance There’s a threshold on performance drop for lower resolutions as well. But the threshold is lower: ~250 Kbps in this example So as long as object size is above the distance curve threshold it’s safe to downscale (can make it part of the encoding decisions as in SVC/VP9/AV1) 500 Kbps 250 Kbps Bitrate on X Bitrate on X
  • 14.
    Copyright © 201714 • ITU-T Recommendation P.912: “Subjective Video Quality Assessment Methods for Recognition Tasks” • Experiments to validate the analysis • Test: 12 sequences, 3 sequences out 12 are obvious • Some screening was required to remove subjects: • Not paying attention • Not understanding task • Based on study recognition depends on light levels • Sun light: recognition is 38 times better than dark • Objects further apart recognition drops • Still objects recognized 2.7 times better than moving What is human understanding, really?
  • 15.
    Copyright © 201715 Summary • Improve your models • Train your models for your camera • Use DNN/CNN if it’s affordable • Build feedback loop from your CV to your camera • Decide on FOV and resolution based on your object sizes • Examine reduced resolution if you need to handle lower bitrates: in many cases you might be able to get your performance back • And be very patient with your humans and your embedded CV algorithms ☺
  • 16.
    Copyright © 201716 Q & A The end!
  • 17.
    Copyright © 201717 Reference • Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition.” CVPR 2016 • Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, IJCV, 2015 • Viola, P. & Jones, M.J. International Journal of Computer Vision 2004 57: 137 • Dollar, P., R. Appel, S. Belongie, and P. Perona. “Fast feature pyramids for object detection.” Pattern Analysis and Machine Intelligence, IEEE Transactions. Vol. 36, Issue 8, 2014, pp. 1532–1545 • Dollar, C. Wojeck, B. Shiele, and P. Perona. “Pedestrian detection: An evaluation of the state of the art.” Pattern Analysis and Machine Intelligence, IEEE Transactions. Vol. 34, Issue 4, 2012, pp. 743–761 • Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015: 91-99 • Joseph Redmon, Ali Farhadi. YOLO9000: Better, Faster, Stronger. preprint arXiv:1612.08242, 2016 • “You Only Look Once: Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, CVPR 2016 • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector ECCV 2016, preprint arXiv:1512.02325, 2016 • Image Sensors and Signal Processing for Digital Still Cameras (Optical Science and Engineering) Junichi Nakamura (Editor) by Taylor & Francis Group ISBN-10: 0849335450 • P Korshunov, WT Ooi .Video quality for face detection, recognition, and tracking, ACM Transactions on Multimedia Computing, Communications, and Applications • https://www.itu.int/rec/T-REC-P.912 P.912 : Subjective video quality assessment methods for recognition tasks, 03/2016