"How Image Sensor and Video Compression Parameters Impact Vision Algorithms," a Presentation from Amazon Lab126

Copyright © 2017 1
How Image Sensor and Video
Compression Parameters Impact
Vision Algorithms
Ilya Brailovskiy, PhD, Principal CV Engineer @ Amazon Lab126
May 2017

Copyright © 2017 2
The views expressed in the following slides do not represent the view of
Amazon.com. They are solely the opinions of the presenter, Ilya Brailovskiy,
based on trends and projections provided by the sources cited. No
representations are made regarding the completeness, timeliness,
suitability or validity of any information presented.
All images and representations are sourced in open Public Domain, they
are used for the express purpose of research and education not promoting
any business, products, services, use or individuals.
Disclaimer

Copyright © 2017 3
• Algorithms reaching human level accuracy
• Quick check against “real-life” footage
• What can we learn from this?
• Image sizes: some theory behind object detection
• Adding video compression
• Bitrate impact
• Resolution impact
• Conclusions
What this is about?

Copyright © 2017 4
• Feature-based algorithms
• Histogram of gradients (HOG)
• Viola-Jones family
• ACF methods
• …
• Deep Learning algorithms
• Fast(er) RCNN
• Yolo
• SSD
• …
Human and Face Detection Task

Copyright © 2017 5
Human vs algorithm object detection accuracy
ImageNet Classification top-5 error (%)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”/
CVPR 2016
Russakovsky et al., IJCV, 2015

Copyright © 2017 6
Lets see!
Yolo Darknet against
awkward 1080 p @ 30 fps
footage: recall error > 35%
(output resolution reduced to 360 p)
“You Only Look Once: Unified, Real-Time Object Detection” Joseph Redmon,
Santosh Divvala, Ross Girshick, and Ali Farhadi, CVPR 2016

Copyright © 2017 7
• Improve algorithm(s)
• Design a better algorithm (for example by increasing modalities)
• Use better/more accurate training for your models
• Provide instant/frame level feedback from CV algorithm to the camera
algorithm
• Similar to face/human autofocus embed your “road sign” detection
with auto-focus
• (Auto) Focus is just one example
• Auto Exposure and Auto White Balance
• Low light behaviors (noise, colorization, etc.)
What could be fixed?

Copyright © 2017 8
Camera options: sensors and lens
Camera resolutions: 1 Mp, 2 Mp, 4 Mp, etc.
It translates to Horizontal (H) pixels: 1280, 1920, 3840, etc.
“SmartPhone”
FOV
Wide FOV Ultra-Wide FOV
Field of View: Horizontal 75 110 130
Field of View: Vertical
47 78 101
Field of View: Diagonal
83 117 136
Typical Field of View options (FOV)
in degrees

Copyright © 2017 9
Pixel Per meter for different FOV/lenses
2 ∗ 𝑂𝑏𝑗𝑒𝑐𝑡 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∗ tan ൗ𝐹𝑂𝑉
2 = 𝐶𝑎𝑚𝑒𝑟𝑎 𝑊𝑖𝑑𝑡ℎ 𝑃𝑖𝑥𝑒𝑙𝑠 ∗
𝑂𝑏𝑗𝑒𝑐𝑡 𝑊𝑖𝑑𝑡ℎ
𝑂𝑏𝑗𝑒𝑐𝑡 𝑊𝑖𝑑𝑡ℎ 𝑃𝑖𝑥𝑒𝑙𝑠
Objects should be above
certain width and height in
pixels to be detected
(~5-10X smaller for deep
learning compared to
feature based)

Copyright © 2017 10
• JPEG
• JPEG2000
• H.264
• SVC
• VP8/VP9
• HEVC
• What’s next? S-HEVC? AV1?
Let’s add compression

Example of bitrate dependency
“Video quality for face detection, recognition, and tracking”, P Korshunov, WT Ooi ACM
Transactions on Multimedia Computing, Communications, and Applications
Performance drops when
bitrate below a threshold:
~500 Kbps in this Viola Jones
example
(similar results for other
algorithms – features based
on DNN)
500 Kbps
Bitrate on X

Add scaling
Original video
Compressed
video
Reduced Res
video
Reduced Res
compressed
video
Compression at
bitrate B
Compression at
bitrate B
downsample upsample

Downscaled video performance
There’s a threshold on
performance drop for lower
resolutions as well.
But the threshold is lower:
~250 Kbps in this example
So as long as object size is
above the distance curve
threshold it’s safe to downscale
(can make it part of the
encoding decisions as in
SVC/VP9/AV1)
500 Kbps
250 Kbps
Bitrate on X
Bitrate on X

• ITU-T Recommendation P.912: “Subjective Video Quality Assessment Methods
for Recognition Tasks”
• Experiments to validate the analysis
• Test: 12 sequences, 3 sequences out 12 are obvious
• Some screening was required to remove subjects:
• Not paying attention
• Not understanding task
• Based on study recognition depends on light levels
• Sun light: recognition is 38 times better than dark
• Objects further apart recognition drops
• Still objects recognized 2.7 times better than moving
What is human understanding, really?

Summary
• Improve your models
• Train your models for your camera
• Use DNN/CNN if it’s affordable
• Build feedback loop from your CV to your camera
• Decide on FOV and resolution based on your object sizes
• Examine reduced resolution if you need to handle lower bitrates: in
many cases you might be able to get your performance back
• And be very patient with your humans and your embedded CV
algorithms ☺

Q & A
The end!

Reference
• Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition.” CVPR 2016
• Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, IJCV, 2015
• Viola, P. & Jones, M.J. International Journal of Computer Vision 2004 57: 137
• Dollar, P., R. Appel, S. Belongie, and P. Perona. “Fast feature pyramids for object detection.” Pattern Analysis and Machine
Intelligence, IEEE Transactions. Vol. 36, Issue 8, 2014, pp. 1532–1545
• Dollar, C. Wojeck, B. Shiele, and P. Perona. “Pedestrian detection: An evaluation of the state of the art.” Pattern Analysis and
Machine Intelligence, IEEE Transactions. Vol. 34, Issue 4, 2012, pp. 743–761
• Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks. NIPS 2015: 91-99
• Joseph Redmon, Ali Farhadi. YOLO9000: Better, Faster, Stronger. preprint arXiv:1612.08242, 2016
• “You Only Look Once: Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi,
CVPR 2016
• Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot
MultiBox Detector ECCV 2016, preprint arXiv:1512.02325, 2016
• Image Sensors and Signal Processing for Digital Still Cameras (Optical Science and Engineering) Junichi Nakamura (Editor) by
Taylor & Francis Group ISBN-10: 0849335450
• P Korshunov, WT Ooi .Video quality for face detection, recognition, and tracking, ACM Transactions on Multimedia Computing,
Communications, and Applications
• https://www.itu.int/rec/T-REC-P.912 P.912 : Subjective video quality assessment methods for recognition tasks, 03/2016

"How Image Sensor and Video Compression Parameters Impact Vision Algorithms," a Presentation from Amazon Lab126

Recommended

Recommended

More Related Content

Similar to "How Image Sensor and Video Compression Parameters Impact Vision Algorithms," a Presentation from Amazon Lab126

Similar to "How Image Sensor and Video Compression Parameters Impact Vision Algorithms," a Presentation from Amazon Lab126 (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"How Image Sensor and Video Compression Parameters Impact Vision Algorithms," a Presentation from Amazon Lab126