Seeing is Perceiving ... Unless You're a Machine

Seeing is Perceiving
… Unless You’re a Machine
Scott Thibault, PhD

“Data Science” Applications
• Marketing
• Social media profiling
• Media coverage
• In-store behavior
• Digital asset management
• OCR
• Keyword tagging
• Face tagging
• Remote sensing
• DeepSolar
• Plant health
• Military
• Internet of things
• Surveillance
• Manufacturing
• Smart homes

What is a Digital Image? Video?
Abstractly
• 2-d array of pixels
• Pixels
• Number grayscale
• Triplet (e.g. RGB)
• N values
• Channels
Concretely
• File representations
• PNG, JPG, etc.
• MP4, MPEG, etc.
• Memory representations
• Row-major
• Column-major
• Integer, float, etc.

Image Filtering
0 0 0 9 37 82 83 86 173 173
0 1 9 38 89 170 171 172 211 212
0 13 37 90 171 212 210 211 197 199
2 23 85 170 210 198 195 196 133 127
11 41 89 214 197 133 128 125 117 111
30 88 173 213 195 122 111 112 103 100
81 172 214 195 134 115 102 101 100 100
80 176 212 199 123 112 100 100 100 100
0 0 4 19 45 60 88 127 129 173
0 4 19 49 104 130 171 191 192 211
6 18 513
3
4
4
Input Image
Output Image
out = f(neighborhood)
for any function f

Image Filtering
0 0 0 9 37 82 83 86 173 173
0 1 9 38 89 170 171 172 211 212
0 13 37 90 171 212 210 211 197 199
2 23 85 170 210 198 195 196 133 127
11 41 89 214 197 133 128 125 117 111
30 88 173 213 195 122 111 112 103 100
81 172 214 195 134 115 102 101 100 100
80 176 212 199 123 112 100 100 100 100
0 0 4 19 45 60 88 127 129 173
0 4 19 49 104 130 171 191 192 211
6 18 51 1043
3
4
4
Input Image
Output Image
for any function f

Image Filtering
0 0 0 9 37 82 83 86 173 173
0 1 9 38 89 170 171 172 211 212
0 13 37 90 171 212 210 211 197 199
2 23 85 170 210 198 195 196 133 127
11 41 89 214 197 133 128 125 117 111
30 88 173 213 195 122 111 112 103 100
81 172 214 195 134 115 102 101 100 100
80 176 212 199 123 112 100 100 100 100
0 0 4 19 45 60 88 127 129 173
0 4 19 49 104 130 171 191 192 211
6 18 51 1043
3
5
5
Input Image
Output Image
for any function f

Image Filtering
0 0 0 9 37 82 83 86 173 173
0 1 9 38 89 170 171 172 211 212
0 13 37 90 171 212 210 211 197 199
2 23 85 170 210 198 195 196 133 127
11 41 89 214 197 133 128 125 117 111
30 88 173 213 195 122 111 112 103 100
81 172 214 195 134 115 102 101 100 100
80 176 212 199 123 112 100 100 100 100
0 0 4 19 45 60 88 127 129 173
0 4 19 49 104 130 171 191 192 211
6 18 51 104 1513
3
5
5
Input Image
Output Image
for any function f

Convolution Filter
0 0 0 9 37 82 83 86 173 173
0 1 9 38 89 170 171 172 211 212
0 13 37 90 171 212 210 211 197 199
2 23 85 170 210 198 195 196 133 127
11 41 89 214 197 133 128 125 117 111
30 88 173 213 195 122 111 112 103 100
81 172 214 195 134 115 102 101 100 100
80 176 212 199 123 112 100 100 100 100
38 89 170
90 171 212
170 210 198
1 2 1
2 4 2
1 2 1
Kernel
38 178 170
180 684 424
170 420 198
∑ 2462

Convolutional Neural Networks
Input
Expected Output

Deep Learning
Model training
• Learn weights (e.g. convolution
coefficients)
• Thousands/millions of examples
• Backpropagation
• Hours to days
Inference
• Model file + weight file
• Few MB to hundreds of MB
• Milliseconds to seconds

Image Classification
Class Probability
0-2 0.000011
4-6 0.000008
8-13 0.000187
15-20 0.001444
25-32 0.313656
38-43 0.683580
48-53 0.001012
60+ 0.000101
Age Classification

Image Classification
• ImageNet – 1000 classes
• COCO – 80 classes
• Pascal VOC – 20 classes (person,
animals, vehicles, indoor objects)
• Places365
• Demographics (gender/age/race)
• Emotions (facial expression)
• Activities, e.g. sports
• Plants
• Medical diagnosis
• Marketing profiles
• Security camera event capture
• Quality control
• Policy verification
• Smart home

Object Detection
• Text, Signs
• Cars, Bikes, Trains, …
• Pedestrians, Faces, Hands
• License plates
• Logos
YOLOv3 / COCO

Video Object Tracking
• Smart cameras
• Face tracking (e.g. auto)
• Surveillance event detection
• Pedestrian/vehicle counting
• Drones

People Counting Demo – Take I

Locality Sensitive Hashing
• High-dimension data to low-
dimension vector
• Euclidean distance ~ similarity
• OpenFace
• Deep Neural Network
• Face  128d vector
• Compare to a signature

Clustering
• Machine learning algorithm
• Create groups of similar objects
• Need similarity function

Clustering
• OpenFace + Clustering =
Deduplication

People Counting Demo – Take II

Programming: OpenCV
Many neural network frameworks
• TensorFlow
• Caffe
• CNTK
• Torch
• MXNet
OpenCV support
• Caffe
• TensorFlow
• Torch
• Darknet

Programming: Python + OpenCV
import numpy as np
import cv2
# load pre-trained model
print "loading model..."
model = "res10_300x300_ssd.prototxt"
weights = "res10_300x300_ssd.caffemodel"
net = cv2.dnn.readNetFromCaffe(model, weights)
# load image and construct an input array
# by resizing to a fixed 300x300 pixels
# and then normalizing
image = cv2.imread("Katherine.jpg")
(h, w) = image.shape[:2]
resized = cv2.resize(image, (300, 300))
blob = cv2.dnn.blobFromImage(resized, 1.0,
(300, 300), (104.0, 177.0, 123.0))
print "running model..."
net.setInput(blob)
detections = net.forward()
for i in range(0, detections.shape[2]):
# confidence score of for this box
confidence = detections[0, 0, i, 2]
if confidence > 0.5:
# get coordinates of bounding box
box = detections[0, 0, i, 3:7] *
np.array([w, h, w, h])
print confidence, box

Seeing is Perceiving ... Unless You're a Machine

Recommended

Recommended

More Related Content

Similar to Seeing is Perceiving ... Unless You're a Machine

Similar to Seeing is Perceiving ... Unless You're a Machine (20)

Recently uploaded

Recently uploaded (20)

Seeing is Perceiving ... Unless You're a Machine

Editor's Notes