Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Roman Storchak, CTO at DatAI

Computer vision
in the cloud
and beyond
Roman Storchak
CTO@DataI

Main tasks in CV for surveillance
● Verification (1:1, Border control)
● Age/Gender/emotion recognition ( sometimes other properties)
● Identification (1:N, Surveillance)
● Events and action recognition
Interest
&
Difficulty
ⒸDataI

Lets build one together
Definition of product (technical) success:
Our CV product have to answer these
questions:
● Who?
● Where?
● When?
● What?
ⒸDataI

Why bother, Lets use API (or SDK)
● Amazon Rekognition
● Face ++
● Meerkat
● Azure Cognitive Services
● Google CLOUD VISION ( No Face recognition)
● VeriLook SDK
● iFace SDK
● Cognitec VACS SDK
● Luxand FACE SDK
● Affectiva SDK
● Betaface SDK
ⒸDataI

Amazon Recognition
Prices:
- Rekognition: 10-12 c$/min
- S3+Data transfer or
AWS Kinesis streams >3 c$/TB
~70k$ per store per month
ⒸDataI

Module pipeline
Face detector
Face alignment
Age + Gender
models
FacePrint Face Search
Person
detector
Body Keypoint
classifier
Face selector
Action
recognition
Tracker
TRACK
with
metadata
Decisions,
BA material
ⒸDataI

Object for detection
HEAD
FACE
Upper BODY
BODY
NDA
ⒸDataI

Objects and attributes
Objects Static Attributes Dynamic
Attributes
Body Body Embedding Location
Actions
Head Count
Face Embedding//ID
Age, Gender,
Race
Emotions
Head Pose
ⒸDataI

Edge or Cloud?
https://sirinsoftware.com/blog/edge-computing/

Edge-to-Cloud
Face
detector
Face
alignment
Age + Gender
models
Face
Embedding
Face Search
Person
detector
Body
Keypoint
classifier
Face
selector
Action
recognition
Tracker
TRACK
with
metadata
Decisions,
BA material
CLOUD
On-Premise
ⒸDataI

OpenCV
+ Very simple to use
- GIL Python limitations for multithreading
NVIDIA DeepStream SDK
+ Fast & Flexible
- Only Jetson & Tesla
Video Streaming (Edge)
ⓒ nVidia DeepStream SDK
ⒸDataI

Video Streaming (Edge)
3. Gstreamer
+ plugin-based architecture
+ easy/fast Video Record (any supported format)
+ easy/fast overlay draw (using Cairo)
+ wide variety of ready-to-use plugins
- difficult for understanding
- weak support by Gstreamer community
- too many plugins has internal bugs
ⒸDataI & Taras lishchenko

Many plugins has internal bugs
ⒸDataI

General Architecture
ⒸDataI & Taras Lischenko

Simple Gstreamer Pipeline
filesrc ! decodebin ! fakesink
ⒸDataI & Taras Lishenko

Gstreamer Basics
Gstreamer Version: 1.12.4
Launch:
● gst-launch-1.0 filesrc ! decodebin ! fakesink
Check:
● gst-inspect-1.0 filesrc

Min. Gstreamer Pipeline
Supported video sources
Live:
● Web Camera
● Rtsp Camera
Not-live:
● Video File (supported format)
● Multiple Files:
○ video01.mp4, video02.mp4,
video03.mp4 ...
● acquire frames from video
source
● decode
● convert to RGB

NX
Number of simultaneous video streams?

Min & Mean FPS / Num Video Streams
PC specs (Oper, shop-000001): Intel(R)
Core(TM) i7-7700K CPU @ 4.20GHz
ⒸDataI & Taras Lishchenko

15X
Number of simultaneous video streams?

“Video” Pipelines Templates
plugins = [FILESRC, TSDEMUX, H264PARSE, IDENTITY, FPS, TEE_HEAD,
QUEUE, AVDEC_H264, VIDEOSCALE, VIDEORATE, CAPS_FULL, VIDEOCONVERT,
OBJECT_DETECTOR, OBJECT_TRACKER,
VIDEOCONVERT, DYNAMIC_OVERLAY, VIDEOCONVERT,
I420_CAPS, VIDEOCONVERT, X264ENC, SPLITMUXSINK,
TEE_TAIL, QUEUE, SPLITMUXSINK]

Problems with Gstreamer
- Buffer offset/timestamps
- when live video - offset is const, when not - offset is in range [0, maxint]
- Solution: GstIdentity (Force offset increment [0, maxint])
- timestamp - CLOCK_MONOTONIC (project requires CLOCK_REAL to sync video with
annotations)
- Solution: Store Map CLOCK_MONOTONIC -> CLOCK_REAL
- X264enc too slow (requires a lot of CPU power)
- Solution: Use plugins (h264parse) for DVR without conversion from RGB (Drawbacks: Can’t
draw on non-RGB buffer)
- Python has limitations compared to C version:
- Passing objects from Python to C buffers (Metadata)
- Solution: DIY Python wrappers for C-libs that works with Gstreamer Objects

Sync
frame-by-frame
+ Guarantee buffers order
- N-1 waiting points
- Drop buffers

CPU vs GPU

Sync
Batch mode
+ Guarantee buffers order
+ Easy to sync annotations with
frame data
- N waiting points
- Drop buffers
- Gstreamer should emit buffers
with small delay, to reduce wait
time in batch collector

Async Batch mode
+ No waiting points
+ No need for gstreamer to emit buffers faster
+ Better GPU load
- Not guaranteed order between streams
- Hard to sync frames with annotations
- additional complexity to handle queues

Object detection
src/detectors

Evolution
1. Face detection (hard to track):
a. Dlib
i. bad for small faces (need to upscale image -> decrease performance)
ii. too slow (works on CPU)
b. MTCNN
i. Bad for many faces (performance decreases with increase number of faces due to
architecture)
2. Person detection:
a. Haar cascade
i. poor quality
ii. too slow (works on CPU)
b. Blob person detector
i. not invariant to noisy images
ii. too slow due to Background Substraction
c. TinyYolov2 (Darkflow)
d. Mobilenet-SSD

CPU usage
Min Max Mean
test_all_cpu 26.0656 28.6466 27.1232
test_on_7_cpu 26.9883 32.5846 28.8077
test_on_6_cpu 26.9680 31.8769 29.0395
test_on_5_cpu 27.0186 32.4739 30.0355
test_on_4_cpu 28.0154 37.4240 32.2915
test_on_3_cpu 34.6486 44.4604 37.7983
test_on_2_cpu 48.0222 60.0206 50.0859
test_on_1_cpu 83.4291 96.7086 88.9920
Model: BodyEmbeddings
CPU: i7-7700HQ CPU @ 2.80GHz
Tensorflow (explanation):
● intra_op_parallelism_threads=[0, NUM_CORES] (0 - best)
● inter_op_parallelism_threads=[0, NUM_CORES] (0 - best)
Conclusion: Some models could be executed in
parallel if there is a Number of cores =< Half of
Total Num of cores, without huge performance loss

One Session vs Multiple Sessions
Model: Body Embeddings
Conclusion: Performance can benefit from Single Graph in Single Session only on CPU. But not
significant difference (12%)

Resize Methods
Nearest Neighbours Resize with different implementations
Conclusion: OpenCV faster when resize. Use Nearest Neighbours method to gain max performance

np.stack instead np.concatenate (Batch)
Batch size: 20
Image size: 640x360x3
Conclusion: When collect images in batch:
- put to list
- np.stack(list)

Object Tracking

Evolution
- Dlib Tracker
- IOU (base version)
- no thresholding by confidence
- Configurable track start,drop,IOU thresholds, etc.
- IOU (extended) (with Hungarian Algorithm)

Problem #1 with Body Detector & IOU Tracking
ⒸDataI

Negative classes
● asymmetric Accuracy: 0.940
● bad Accuracy: 0.951
● bad_angle Accuracy: 0.875
● bad_manual Accuracy: 0.824
● blurred Accuracy: 0.951
● many Accuracy: 0.850
● no_face Accuracy: 0.953
● no_landmarks Accuracy: 0.918
● not_inside Accuracy: 0.941
Positive class
As close to ISO/IEC
19794-5:2005 compliant
photo as possible
ⒸDataI & Oleksiy Udod

Identification pipeline
Face detector Face alignment
Age + Gender
models
Face
Embedding
Face Search
ⒸDataI

Closed set vs Open set search
Closed-set evaluation:
● cumulative match characteristics
(CMC) curves
● receiver operating characteristic
(ROC) curves.
Open-set evaluation
● detection and identification rate
(DIR) curves (TPIR,FPIR,FNIR,...)
ⒸDataI

Open-set video evaluation
Gallery: a set of images of interest.
Probes: a set of images from for
querying.
In our case probe might be:
● the best-quality face image among all
images within the same person track
● any face image from the video
ⒸDataI & Vlad KhizanovⒸDataI

Metrics for Open Set Evaluation
Definition: Query is succeed if top result has similarity greater than t.
FPIR(t) = # of success non-mate search queries / # of queries
TPIR(t) = # of success mate search queries / # of queries
MISS(t) = # of non-success search queries / # of queries
FPIR(t) + TPIR(t) + MISS(t) = 1
Note: sometimes *FPIR = 1 - FPIR
Note: here TPIR(t) = TPIR(t, 1), where 1 is a rank
Mate searches are those for which the person in the search image has a face image in the enrolled dataset
Non-mate searches are those which the person in the search image does not have a face image in the enrolled dataset.
ⒸDataI & Vlad Khizanov

Metrics on chart
Usually metrics visualized as a
curve in parametric form:
x(t) = FPIR(t)
y(t) = TPIR(t)
t = 0.0, 0.01, 0.02, …, 1.0
Note: for usage it’s useful to pick optimal
threshold.
FNIR=1-TPIR
FPIR
FPIR@FNIR
ⒸDataI & Vlad Khizanov

Extreme Value Machine
Given the conditions for the Margin Distribution
Theorem, the probability that x’ is included in the
boundary estimated by xi
is given by:
Ψ(xi , x0 ; κi , λi ,) = exp− ||xi−x 0 || λi κi (1)
where ||xi
− x’ || is the distance of x’ from sample xi
, and
κi
, λi
are Weibull shape and scale parameters
respectively obtained from fitting to the smallest
pairwise margin estimate.
https://arxiv.org/pdf/1506.06112.pdf

@135k distractors

Serving with RESTful API
Market server N
Storage
AWS
Consumers
auto
scaling
group
Local
kafka
broker
Mirror
Maker
Market server 1
Local
kafka
broker
Mirror
Maker
Consumer A
Consumer B
Consumer C
Kafka broker
1
Kafka broker
2
Kafka broker
3
N partition
N
partition
N partition
RESTful API
Model1
RESTful API
Model2..n
ⒸDataI & Konstantin Bulgakov

Serving in Kafka Streams
Market server N
Storage
AWS
Consumers with TF,
auto scaling
group
Local
kafka
broker
Mirror
Maker
Market server 1
Local
kafka
broker
Mirror
Maker
Consumer A
Consumer B
Consumer C
Kafka broker
1
Kafka broker
2
Kafka broker
3
N partition
N
partition
N
partition
ⒸDataI & Konstantin Bulgakov

Msg size
distribution
Size distribution for
14K records was
measured at the
producer side,
counting the str
length of every
message
*1 str element ~ at
least 1 byte
ⒸDataI & Olesia Stestsiuk

t2.medium vs t2x.large

Pyflame
● based on the Linux ptrace(2) system call not sys.settrace()
● no modification of the source code required
● profiling embedded Python interpreters like uWSGI.
● profiling multi-threaded Python programs.
● written in C++, with attention to speed and performance.
● Pyflame usually introduces less overhead than the builtin profile (or cProfile)
modules, and also emits richer profiling data.
Just sudo pyflame -s 600 -r 0.001 --threads -p 1493 |
./flamegraph.pl >10_min_every_milisec.svg
http://eng.uber.com/pyflame/ⒸDataI & Olesia Stestsiuk

How to read Flame Graphs
● Each box represents a function in the stack (a "stack frame").
● The y-axis shows stack depth (number of frames on the stack). The top box shows the function
that was on-CPU. Everything beneath that is ancestry. The function beneath a function is its
parent, just like the stack traces shown earlier.
● The x-axis spans the sample population. It does not show the passing of time from left to right, as
most graphs do. The left to right ordering has no meaning (it's sorted alphabetically to maximize
frame merging).
● The width of the box shows the total time it was on-CPU or part of an ancestry that was on-CPU
(based on sample count). Functions with wide boxes may consume more CPU per execution than
those with narrow boxes, or, they may simply be called more often. The call count is not shown
(or known via sampling).
● The sample count can exceed elapsed time if multiple threads were running and sampled
concurrently.

Know your product requirements

Use APIs and SDKs even if they are not free

When you are using something, know the limits of
the execution

End to end testing in data driven products is a
disaster and the best technical feedback.

Roman Storchak, PhD,
CTO @ DatAI
roman.storchak@gmail.com
https://www.linkedin.com/in/storchak/
+38(063)617-61-15

Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Roman Storchak, CTO at DatAI

Recommended

Recommended

More Related Content

Similar to Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Roman Storchak, CTO at DatAI

Similar to Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Roman Storchak, CTO at DatAI (20)

More from Provectus

More from Provectus (20)

Recently uploaded

Recently uploaded (20)

Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Roman Storchak, CTO at DatAI