Андрей Белас Data Scientist, SMART business
 Эксперт в области машинного обучения, публичный
спикер.
 Создатель и ментор SMART Data Science Academy, отвечаю
за техническое развитие data science команды и
архитектуру всех data science проектов SMART business.
 Microsoft Certified Professional в направлениях:
 Big Data and Advanced Analytics
 Cloud Data Science with Azure Machine Learning
 Developing SQL Data Models.
Опыт работы:
 Deep Learning
 Computer Vision
 AI in Forecasting
 AI in Marketing
 Risk management
 Business Intelligence
К
Agenda
1. Overview
2. Business cases
3. Approaches
4. Frameworks
КImage classification
Dogs vs. Cats (Kaggle)
K classes
Task: Assign the correct class label to the whole image
КAnother good challenge
КClassification vs. Detection
КClassification vs. Detection
К
К
Бизнес процесс
1. Тегирование текущего ассортимента SKU Roshen/Конкурентов
(500 SKU)
2. Тегирование нового SKU Roshen/Конкурента
Обучение нейронной сети
распознавания (4-5 часов)
Передача
модели на
устройства
мерчандайзеров
• Контроль соответствия
планограмме
• Контроль невыкладки
• Аудит конкурентных цен,
промо и планограмм
Отчеты для менеджмента в
реальном времени
Оценка и прогноз влияния планограмм
ROSHEN и конкурентов на продажи
Примеры показателей для менеджмента
Оценка доли полки
Появление нового продукта на
всех ТП, ценников и промо
Соответствие доли полки и
собственные стойки в ТП
Контроль соответствия
планограмм, ценников
и промо
Корреляция и прогноз влияния
планограмм, промо и конкурентов
на продажи в ТП (Расширенная
прогнозная аналитика)
Рейтинг точек продаж по
показателям
Рейтинг команд
мерчандайзинга по
показателям
К
КImageNet Large Scale Visual Recognition Challenge
КClassification
Keyboard, mouse
КObject Detection
keyboard
mouse keyboard
mouse mouse
КWhere to begin
• Data
• Detection algorithm
• Evaluation approach
• Deployment
Tips
 Train on data like ones you’ll see in production
 Label your data well (don’t miss anything)
 Avoid detecting very tiny objects in the image
 https://github.com/Microsoft/VoTT
КOpen Data and Benchmarks
• Pascal VOC (20 classes, ~10-13K images)
http://host.robots.ox.ac.uk/pascal/VOC/
• MS COCO (80 classes, 123K images)
http://mscoco.org/
• ImageNet (200 classes, >500K images)
http://image-net.org/
• Cat Annotation Dataset ( 10K annotated cat images)
http://academictorrents.com/details/145ee4e1fe1acee71b122eab522d14528bbacaf7
КEvaluation
• Compute average precision (AP) separately for each class, then average over classes A detection
is a true positive if it has IoU (Interception over Union) with a ground-truth box greater than
some threshold (usually 0.5) (AP@0.5)
Evaluation Metrics
Evaluation Metrics
Evaluation Metrics – Precision vs Recall vs mAP


0
10
20
30
40
50
60
70
80
90
100
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
Precision
Recall
КThe first efficient Face Detector (Viola-Jones Algorithm, 2001)
•Their demo showed faces being detected in real time on a webcam feed, was the most
stunning demonstration of computer vision and its potential at the time.
•Soon, it was implemented in OpenCV & face detection became synonymous with Viola and
Jones algorithm.
•Hand-coded features (eyes, nose, locations and interactions)
•Bad results for non-frontal/ideal faces
КMuch more efficient detection technique (Histograms of Oriented
Gradients, 2005)
• Navneet Dalal and Bill Triggs invented "HOG" for pedestrian detection
• Their feature descriptor, Histograms of Oriented Gradients (HOG), significantly
outperformed existing algorithms in this task
• Handcoded features, just like before
• For every single pixel, we want to look at the pixels that directly surrounding it
КMuch more efficient detection technique (Histograms of Oriented
Gradients, 2005)
• Goal is, how dark is current pixel compared to surrounding pixels?
• We will then draw an arrow showing in which direction the image is getting darker:
• We repeat that process for every single pixel in the image
• Every pixel is replaced by an arrow. These arrows are called gradients
КHOG
КResNet and Transfer Learning
КBruteforce approach
• We can take a classifier like VGGNet or Inception and turn it into an object detector by sliding a
small window across the image
• At each step you run the classifier to get a prediction of what sort of object is inside the current
window.
• Using a sliding window gives several hundred or thousand predictions for that image, but you only
keep the ones the classifier is the most certain about.
• This approach works but it’s obviously going to be very slow, since you need to run the classifier
many times.
КA better approach, R-CNN (2015)
• R-CNN creates bounding boxes, or region proposals, using a process called Selective Search
• At a high level, Selective Search looks at the image through windows of different sizes, and for each
size tries to group together adjacent pixels by texture, color, or intensity to identify objects.
КR-CNN
КFast R-CNN (2015)
КFaster R-CNN (2016)
КYOLO (2016)
• YOLO takes a completely different approach.
• It’s not a traditional classifier that is repurposed to be an object detector.
• YOLO actually looks at the image just once (hence its name: You Only Look Once) but in a clever way.
• YOLO divides up the image into a grid of 13 by 13 cells
КYOLO (2016)
• Each of these cells is responsible for predicting 5 bounding boxes.
• A bounding box describes the rectangle that encloses an object.
• YOLO also outputs a confidence score that tells us how certain it is that the predicted bounding box
actually encloses some object.
• This score doesn’t say anything about what kind of object is in the box, just if the shape of the box is
any good.
КYOLO (2016)
• For each bounding box, the cell also predicts a class.
• The confidence score for the bounding box and the class prediction are combined into one final score
that tells us the probability that this bounding box contains a specific type of object.
• For example, the big fat yellow box on the left is 85% sure it contains the object “dog”:
КYOLO (2016)
• Since there are 13×13 = 169 grid cells and each cell predicts 5 bounding boxes, we end up with 845
bounding boxes in total.
• It turns out that most of these boxes will have very low confidence scores, so we only keep the boxes
whose final score is 30% or more (you can change this threshold depending on how accurate you want
the detector to be).
КNon-Maximum Suppression
КYOLO (2016)
• You Only Look Once
• So we end up with 125 channels for every grid cell.
• x, y, width, height for the bounding box’s rectangle
• the confidence score
• the probability distribution over the classes
КYOLO
КYOLOv2 (v3…)
КSSD (2016)
Полезные ссылки
• https://www.youtube.com/watch?v=NrmMk1Myrxc – Amazon Go
• https://github.com/Microsoft/VoTT - labeling tool
• https://youtu.be/Cgxsv1riJhI - How computers learn to recognize objects instantly | Joseph
Redmon
• https://pjreddie.com/darknet/yolo/ - YOLO Official
• https://youtu.be/VOC3huqHrss - YOLO demo
• https://github.com/thtrieu/darkflow - TensorFlow implementation
• http://cs231n.stanford.edu/ - Convolutional Neural Networks for Visual Recognition (Stanford)
Questions?
Andrii Belas  "Overview of object detection approaches: cases, algorithms and software"

Andrii Belas "Overview of object detection approaches: cases, algorithms and software"

  • 2.
    Андрей Белас DataScientist, SMART business  Эксперт в области машинного обучения, публичный спикер.  Создатель и ментор SMART Data Science Academy, отвечаю за техническое развитие data science команды и архитектуру всех data science проектов SMART business.  Microsoft Certified Professional в направлениях:  Big Data and Advanced Analytics  Cloud Data Science with Azure Machine Learning  Developing SQL Data Models. Опыт работы:  Deep Learning  Computer Vision  AI in Forecasting  AI in Marketing  Risk management  Business Intelligence
  • 3.
    К Agenda 1. Overview 2. Businesscases 3. Approaches 4. Frameworks
  • 5.
    КImage classification Dogs vs.Cats (Kaggle) K classes Task: Assign the correct class label to the whole image
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Бизнес процесс 1. Тегированиетекущего ассортимента SKU Roshen/Конкурентов (500 SKU) 2. Тегирование нового SKU Roshen/Конкурента Обучение нейронной сети распознавания (4-5 часов) Передача модели на устройства мерчандайзеров • Контроль соответствия планограмме • Контроль невыкладки • Аудит конкурентных цен, промо и планограмм Отчеты для менеджмента в реальном времени Оценка и прогноз влияния планограмм ROSHEN и конкурентов на продажи
  • 12.
    Примеры показателей дляменеджмента Оценка доли полки Появление нового продукта на всех ТП, ценников и промо Соответствие доли полки и собственные стойки в ТП Контроль соответствия планограмм, ценников и промо Корреляция и прогноз влияния планограмм, промо и конкурентов на продажи в ТП (Расширенная прогнозная аналитика) Рейтинг точек продаж по показателям Рейтинг команд мерчандайзинга по показателям
  • 13.
  • 14.
    КImageNet Large ScaleVisual Recognition Challenge
  • 15.
  • 16.
  • 17.
    КWhere to begin •Data • Detection algorithm • Evaluation approach • Deployment
  • 18.
    Tips  Train ondata like ones you’ll see in production  Label your data well (don’t miss anything)  Avoid detecting very tiny objects in the image  https://github.com/Microsoft/VoTT
  • 19.
    КOpen Data andBenchmarks • Pascal VOC (20 classes, ~10-13K images) http://host.robots.ox.ac.uk/pascal/VOC/ • MS COCO (80 classes, 123K images) http://mscoco.org/ • ImageNet (200 classes, >500K images) http://image-net.org/ • Cat Annotation Dataset ( 10K annotated cat images) http://academictorrents.com/details/145ee4e1fe1acee71b122eab522d14528bbacaf7
  • 20.
    КEvaluation • Compute averageprecision (AP) separately for each class, then average over classes A detection is a true positive if it has IoU (Interception over Union) with a ground-truth box greater than some threshold (usually 0.5) (AP@0.5)
  • 21.
  • 22.
  • 23.
    Evaluation Metrics –Precision vs Recall vs mAP   0 10 20 30 40 50 60 70 80 90 100 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Precision Recall
  • 24.
    КThe first efficientFace Detector (Viola-Jones Algorithm, 2001) •Their demo showed faces being detected in real time on a webcam feed, was the most stunning demonstration of computer vision and its potential at the time. •Soon, it was implemented in OpenCV & face detection became synonymous with Viola and Jones algorithm. •Hand-coded features (eyes, nose, locations and interactions) •Bad results for non-frontal/ideal faces
  • 25.
    КMuch more efficientdetection technique (Histograms of Oriented Gradients, 2005) • Navneet Dalal and Bill Triggs invented "HOG" for pedestrian detection • Their feature descriptor, Histograms of Oriented Gradients (HOG), significantly outperformed existing algorithms in this task • Handcoded features, just like before • For every single pixel, we want to look at the pixels that directly surrounding it
  • 26.
    КMuch more efficientdetection technique (Histograms of Oriented Gradients, 2005) • Goal is, how dark is current pixel compared to surrounding pixels? • We will then draw an arrow showing in which direction the image is getting darker: • We repeat that process for every single pixel in the image • Every pixel is replaced by an arrow. These arrows are called gradients
  • 27.
  • 28.
  • 29.
    КBruteforce approach • Wecan take a classifier like VGGNet or Inception and turn it into an object detector by sliding a small window across the image • At each step you run the classifier to get a prediction of what sort of object is inside the current window. • Using a sliding window gives several hundred or thousand predictions for that image, but you only keep the ones the classifier is the most certain about. • This approach works but it’s obviously going to be very slow, since you need to run the classifier many times.
  • 30.
    КA better approach,R-CNN (2015) • R-CNN creates bounding boxes, or region proposals, using a process called Selective Search • At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.
  • 31.
  • 32.
  • 33.
  • 34.
    КYOLO (2016) • YOLOtakes a completely different approach. • It’s not a traditional classifier that is repurposed to be an object detector. • YOLO actually looks at the image just once (hence its name: You Only Look Once) but in a clever way. • YOLO divides up the image into a grid of 13 by 13 cells
  • 35.
    КYOLO (2016) • Eachof these cells is responsible for predicting 5 bounding boxes. • A bounding box describes the rectangle that encloses an object. • YOLO also outputs a confidence score that tells us how certain it is that the predicted bounding box actually encloses some object. • This score doesn’t say anything about what kind of object is in the box, just if the shape of the box is any good.
  • 36.
    КYOLO (2016) • Foreach bounding box, the cell also predicts a class. • The confidence score for the bounding box and the class prediction are combined into one final score that tells us the probability that this bounding box contains a specific type of object. • For example, the big fat yellow box on the left is 85% sure it contains the object “dog”:
  • 37.
    КYOLO (2016) • Sincethere are 13×13 = 169 grid cells and each cell predicts 5 bounding boxes, we end up with 845 bounding boxes in total. • It turns out that most of these boxes will have very low confidence scores, so we only keep the boxes whose final score is 30% or more (you can change this threshold depending on how accurate you want the detector to be).
  • 38.
  • 39.
    КYOLO (2016) • YouOnly Look Once • So we end up with 125 channels for every grid cell. • x, y, width, height for the bounding box’s rectangle • the confidence score • the probability distribution over the classes
  • 40.
  • 41.
  • 42.
  • 43.
    Полезные ссылки • https://www.youtube.com/watch?v=NrmMk1Myrxc– Amazon Go • https://github.com/Microsoft/VoTT - labeling tool • https://youtu.be/Cgxsv1riJhI - How computers learn to recognize objects instantly | Joseph Redmon • https://pjreddie.com/darknet/yolo/ - YOLO Official • https://youtu.be/VOC3huqHrss - YOLO demo • https://github.com/thtrieu/darkflow - TensorFlow implementation • http://cs231n.stanford.edu/ - Convolutional Neural Networks for Visual Recognition (Stanford)
  • 44.