1. The speaker will demonstrate object detection on Android using TensorFlow and the SSD model.
2. SSD is well-suited for mobile as it is faster than other models like Faster R-CNN while maintaining reasonable accuracy.
3. The example will involve gathering image data, labeling objects, training an SSD model in TensorFlow, and integrating it into an Android app for real-time clothes detection on mobile.
DroidCon Cluj 2018 - Hands on machine learning on android
1.
2.
3. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Machine Learning
Speaker:
ANCA CIURTE - AI Team Lead at Softvision-
4. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Outline
● Why machine learning on Android?
● Mostly:
○ Some insights about Object Detection algorithms
○ Practical example in Tensorflow
○ Data gathering and labeling
○ Model training
● Hopefully:
○ It will inspire you to deeg deeper
○ It won’t confuse you too much :)
5. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Machine learning
Why machine learning on Android?
6. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Why machine learning on Android?
● Object detection
○ Is a very common Computer Vision problem
○ Identifies the objects in the image and
provides their precise location
7. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Why machine learning on Android?
● Object detection
○ Is a very common Computer Vision problem
○ Identifies the objects in the image and
provides their precise location
● Why is it useful?
○ StreetView,
○ Self-driving cars etc.
E.g.: Street view - face
blurring
E.g.: Self driving cars - pedestrian
detection
8. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Why machine learning on Android?
● Object detection
○ Is a very common Computer Vision problem
○ Identifies the objects in the image and
provides their precise location
● Why is it useful?
○ StreetView,
○ Self-driving cars etc.
● Object detection: impact of deep learning
○ Deep convnets significantly increased
accuracy and processing time
9. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Why machine learning on Android?
● Object detection
○ Is a very common Computer Vision problem
○ Identifies the objects in the image and
provides their precise location
● Why is it useful?
○ StreetView,
○ Self-driving cars etc.
● Object detection: impact of deep learning
○ Deep convnets significantly increased
accuracy and processing time
● Why on Android?
○ We are living in the era when mobile took over
○ Running on mobile makes it possible to
deliver interactive and real time applications
○ Latest released phones have great computing
power
10. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Machine learning
Some insights about Object Detection
11. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Image classification with convnets
● Dataset
○ e.g. Cifar-10 dataset:
■ consists of 60000 32x32 colour images in 10 classes,
with 6000 images per class.
■ There are 50000 training images and 10000 test images.
● Training phase
○ e.g. VGG 16 network
○ input: labeled images (x,y)
Forward propagation (Given wl , compute predictions )
12. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Intuition about the convolution
Convolution Kernel
(weights)
Input image
* =
Another way to
understand the
convolution operation:
or: Convolution layer
or: Feature Map
or: Network’s parameters
13. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Image classification with convnets
● Dataset
○ e.g. Cifar-10 dataset:
■ consists of 60000 32x32 colour images in 10 classes,
with 6000 images per class.
■ There are 50000 training images and 10000 test images.
● Training phase
○ e.g. VGG 16 network
○ input: labeled images (x,y)
● Testing phase
○ Use the trained model to classify new instances
○ Detection output: predicted class
Forward propagation (Given wl , compute predictions )
Loss function:
Backward propagation (compute wl+1 by minimizing the loss)
Repeat until
convergence
=> w*
14. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Relation between classification and object detection
● We have an accurate way of classifying images
○ e.g.: does this image contain a pedestrian?
● But how can we say WHERE is this pedestrian?
15. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Relation between classification and object detection
● We have an accurate way of classifying images
○ e.g.: does this image contain a pedestrian?
● But how can we say WHERE is this pedestrian?
Solution:
● Sliding window
○ strategy:
■ splits into fragments and classify them independently
16. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Relation between classification and object detection
● We have an accurate way of classifying images
○ e.g.: does this image contain a pedestrian?
● But how can we say WHERE is this pedestrian?
Solution:
● Sliding window
○ strategy:
■ splits into fragments and classify them independently
Classified as pedestrian:All fragments:
...
17. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
● We have an accurate way of classifying images
○ e.g.: does this image contain a pedestrian?
● But how can we say WHERE is this pedestrian?
Solution:
● Sliding window
○ strategy:
■ splits into fragments and classify them independently
○ challenges :
■ how to deal with: various object size, various aspect ratio, object overlap or multiple responses
Relation between classification and object detection
18. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
● We have an accurate way of classifying images
○ e.g.: does this image contain a pedestrian?
● But how can we say WHERE is this pedestrian?
Solution:
● Sliding window
○ strategy:
■ splits into fragments and classify them independently
○ challenges :
■ how to deal with: various object size, various aspect ratio, object overlap or multiple responses
○ problem: need to apply CNN to huge number of locations and scales, very computationally expensive!!
Relation between classification and object detection
19. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
R-CNN (Region-based convolutional neural network)
Two steps:
● Select object proposals: Selective Search Algorithm
○ it has very low precision to be used as object
detector, but it works fine as a first step in the
detection pipeline
● Apply strong CNN classifier to select proposal
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014
20. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
R-CNN (Region-based convolutional neural network)
Two steps:
● Select object proposal: Selective Search Algorithm
○ it has very low precision to be used as object
detector, but it works fine as a first step in the
detection pipeline
● Apply strong CNN classifier to select proposal
It outperforms all the previous object detection algorithms
R-CNN
21. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
R-CNN (Region-based convolutional neural network)
Two steps:
● Select object proposal: Selective Search Algorithm
○ it has very low precision to be used as object
detector, but it works fine as a first step in the
detection pipeline
● Apply strong CNN classifier to select proposal
It outperforms all the previous object detection algorithms
Limitations:
● Depend on external algorithm hypothesis
● Need to rescale object proposals to fixed resolution
● Redundant computation - all features are
independently computed even for overlapped
proposal regions
22. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Fast R-CNN
From R-CNN to Fast R-CNN:
● input: image + region proposals
● region pooling on “conv5” feature map for feature
extraction
● softmax classifier instead of SVM classifier
● End to end multi-task training:
○ the last FC layer branch into two sibling
output layers:
■ one that produces softmax
probability estimates over K object
classes
■ another layer that outputs the
bounding box coordinates for each
object.
Girshick, “Fast R-CNN”, ICCV 2015
23. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Fast R-CNN
From R-CNN to Fast R-CNN:
● input: image + region proposals
● region pooling on “conv5” feature map for feature
extraction
● softmax classifier instead of SVM classifier
● End to end multi-task training:
○ the last FC layer branch into two sibling
output layers:
■ one that produces softmax
probability estimates over K object
classes
■ another layer that outputs the
bounding box coordinates for each
object.
Advantages:
● Higher detection quality (mAP) than R-CNN
● Training is single-stage
● Training can update all network layers at once
● No disk storage is required for feature caching
Girshick, “Fast R-CNN”, ICCV 2015
24. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Faster R-CNN
Faster R-CNN = Fast R-CNN + RPN (Region Proposal
Network)
● RPN
○ removes dependency from external hypothesis
ROI generation method
○ is a convolutional network trained end-to-end
○ generates a list of high-quality region proposal
(bbox coordinates + objectness scores)
● Then RPN + Fast R-CNN are merged into a single
network by sharing their convolutional features
○ predicts the class of the objects + a refined bbox
position
○ shared convolutional features enables nearly cost-
free region proposals
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
25. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
SSD (Single shot detector)
● Extra feature layers
○ additional convolutional feature layers of different sizes are placed at
the end of base net
○ each added feature layer produce a set of detection predictions,
allowing predictions at multiple scales
○ this design lead to simple end-to-end training
Wei Liu et al., SSD: Single Shot MultiBox Detector, ECCV 2016
26. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
SSD (Single shot detector)
● Extra feature layers
○ additional convolutional feature layers of different sizes are placed at
the end of base net
○ each added feature layer produce a set of detection predictions,
allowing predictions at multiple scales
○ this design lead to simple end-to-end training
● ROIs proposal
○ output space of region proposals contains a fixed set of default boxes
over different aspect ratios and scales per feature map location
○ for each default bounding box, predict
○ the shape offsets Δ(cx, cy, w, h) and
○ the confidence for all object categories (c1, …, cp)
● Non-Maxima suppression
4x4 feature map
Wei Liu et al., SSD: Single Shot MultiBox Detector, ECCV 2016
8x8 feature map
27. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Compare modern convolutional object detectors
Lots of variables to set up ...
● base net:
○ VGG16
○ ResNet101
○ InceptionV2
○ InceptionV3
○ ResNet
○ MobileNet
● Object detection architecture:
○ R-CNN
○ Fast R-CNN
○ Faster R-CNN
○ SSD
● Input image resolution
● Number of region proposal
● Frozen weights - for fine tuning
28. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Lots of variables to set up ...
● base net:
○ VGG16
○ ResNet101
○ InceptionV2
○ InceptionV3
○ ResNet
○ MobileNet
● Object detection architecture:
○ R-CNN
○ Fast R-CNN
○ Faster R-CNN
○ SSD
● Input image resolution
● Number of region proposal
● Frozen weights - for fine tuning
Jonathan Huang et al., Speed/accuracy trade-offs for modern convolutional object detectors, CVPR 2017
Speed/accuracy trade-offs
Compare modern convolutional object detectors
29. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Lots of variables to set up ...
● base net:
○ VGG16
○ ResNet101
○ InceptionV2
○ InceptionV3
○ ResNet
○ MobileNet
● Object detection architecture:
○ R-CNN
○ Fast R-CNN
○ Faster R-CNN
○ SSD
● Input image resolution
● Number of region proposal
● Frozen weights - for fine tuning
Takeaways:
● Faster R-CNN is slower but more accurate
● SSD is much faster but not as accurate (therefore is a good choice for mobile apps)
Jonathan Huang et al., Speed/accuracy trade-offs for modern convolutional object detectors, CVPR 2017
Speed/accuracy trade-offs
Compare modern convolutional object detectors
30. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Coding time
31. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Coding time
Problem to solve:
- a mobile app for real time clothes detection
- class categories: Top, Pants, Shorts, Skirt and Dress
Frameworks:
● Tensorflow Object Detection API
- made by GOOGLE
- an open source framework built on top of TensorFlow that
makes it easy to construct, train and deploy object detection
models
- input: images + labels
- output: inference graph (.pb format)
● LabelImg
- an open source graphical image annotation tool
- annotations are saved as XML files in PASCAL VOC format,
the format used by ImageNet dataset
32. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Coding time: step by step
● Create dataset and split it into: train (70%) and test (30%) folders
● Label images with LabelImg tool (output: .xml files for each image in dataset)
● Convert .xml to .csv (use dataset/xml_to_csv.py script; output: train.csv, test.csv)
● Convert to TFRecord format
○ set paths (from ../models/research):
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/object_detection
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
○ edit generate_tfrecord.py file and change the label map + path to the train/test folder:
○ finally execute the generate_tfrecord.py script in Terminal:
python generate_tfrecord.py --csv_input=data/train_labels.csv --output_path=data/train.record
python generate_tfrecord.py --csv_input=data/test_labels.csv --output_path=data/test.record
○ output: train.record, test.record
● Training
○ create a label map: label_map.pbtxt
○ optional, but recommended :), choose a pretrained model from here
○ prepare the .config file: .../models/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config
○ run training script (from ../models/research/object_detection):
python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=Ssd_mobilenet_v1_pets.config
● Export inference graph:
python export_inference_graph.py --input_type image_tensor --pipeline_config_path pipeline.config
--trained_checkpoint_prefix=training/model.ckpt-10750 --output_directory=inference_graph
output: the model in .pb format
33. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
e-mail: anca.ciurte@softvision.ro
Q&A
34. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Integrating with Android
Speaker:
MIHALY NAGY - Android Community Influencer at Softvision
35. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Android + TensorFlow
36. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Android + TensorFlow
● Model File
● [Labels File]
● tensorflow-android dependency
● Boilerplate
● Integrate TF to process each frame
37. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Android + TensorFlow
38. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Android + TensorFlow
Bitmap
Recognition
each Frame
39. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Android + TensorFlow
Follow Along:
http://goo.gl/SYHSb7
https://github.com/code-twister/tf_example
40. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Coding time
41. ATLANTA | AUSTIN | PHILADELPHIA | BENTONVILLE | ROMANIA | INDIA | AUSTRALIA | BRAZIL | NEPAL | CANADA www.softvision.com
Thank You!
Editor's Notes
Running on mobile makes it possible to deliver interactive and real time applications in a way that’s not possible when depending on the internet connection
multile scales and aspect ratios are handles by search windows of different size and aspect, or by image scaling
From R-CNN to Fast R-CNN:
region pooling on “conv5” feature map for deature extraction
softmax classifier instead of SVM classifier
Multitask training:
the last fc layer branch into two sibling output layers:
one that produces softmax probability estimates over K object classes
another layer that outputs the bounding box coordinates for each object.
First, a CNN is applied on the whole original image with several convolutional (conv) and max pooling layers to produce a conv feature map.
Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map and fed into a sequence of fully connected (fc) layers.
fc layers finally branch into two sibling output layers:
one that produces softmax probability estimates over K object classes
another layer that outputs the bounding box coordinates for each object.
From R-CNN to Fast R-CNN:
region pooling on “conv5” feature map for deature extraction
softmax classifier instead of SVM classifier
Multitask training:
the last fc layer branch into two sibling output layers:
one that produces softmax probability estimates over K object classes
another layer that outputs the bounding box coordinates for each object.
First, a CNN is applied on the whole original image with several convolutional (conv) and max pooling layers to produce a conv feature map.
Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map and fed into a sequence of fully connected (fc) layers.
fc layers finally branch into two sibling output layers:
one that produces softmax probability estimates over K object classes
another layer that outputs the bounding box coordinates for each object.
A Region Proposal Network (RPN) takes an image
(of any size) as input and outputs a set of rectangular
object proposals, each with an objectness score.
SSD approach:
produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes
followed by a non-maximum suppression step to produce the final detections.
Network generates scores for each default box
Wei Liu et al., SSD: Single Shot MultiBox Detector, ECCV 2016
SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location
Wei Liu et al., SSD: Single Shot MultiBox Detector, ECCV 2016
There are several algorithms of Object detection
The question is: how well they compete to each other?
We define several meta parameters that influence detectors performance
Critical points on the curve that can be identified:
mAP = mean average precision
[Huang et al.] measured the influence of these metaparams on accuracy and speed
Jonathan Huang et al., Speed/accuracy trade-offs for modern convolutional object detectors, CVPR 2017
There are several algorithms of Object detection
The question is: how well they compete to each other?
We define several meta parameters that influence detectors performance
Critical points on the curve that can be identified:
mAP = mean average precision
[Huang et al.] measured the influence of these metaparams on accuracy and speed
Jonathan Huang et al., Speed/accuracy trade-offs for modern convolutional object detectors, CVPR 2017
There are several algorithms of Object detection
The question is: how well they compete to each other?
We define several meta parameters that influence detectors performance
Critical points on the curve that can be identified:
mAP = mean average precision
[Huang et al.] measured the influence of these metaparams on accuracy and speed
Jonathan Huang et al., Speed/accuracy trade-offs for modern convolutional object detectors, CVPR 2017
Recognition refers to the objects detected not the process