Typical approaches of solving classification problems require the collection of a dataset for each new class and retraining of the model. Metric learning allows you to train a model once and then easily add new classes with 5-10 reference images.
So we’ll talk about metric learning based on YouScan experience: task, data, different losses and approaches, metrics we used, pitfalls and peculiarities, things that worked and didn’t.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Yulia Honcharenko "Application of metric learning for logo recognition"
1.
2. Yulia Honcharenko
Data scientist
Yulia Honcharenko
YouScan provides real-time monitoring and analytics of brand mentions
on social networks, blogs, forums, review sites and online news.
@yuliok
3.
4. 1.Logo recognition in YouScan: old and new approach
2.Data
3.How can we measure performance: NMI, Recall@n, F1-score
4.Baseline
5.Metric learning: Triplet loss, Proxy-nca loss, Softriple
6.Small tips & tricks
7.Synthetic data
8.Results
Agenda
6. Problems
• We need ~200 labelled
images for each new class
• We need to retrain two models
for adding every new class
• F1-score and MAP decreases
with adding of new classes
• Min time for adding new logo
is ~ 3 days
7. Solution: two steps approach
Detect all potential logos
Match all potential logos to our
existing logo “standards” from our
base
11. • Public logo datasets + our own
data
• 973 train classes, 43 test classes,
• min count of images per class - 3
Data
12. Metrics: F1 score of end-to-end approach by IoU threshold
Apple
Instagram
• +1 false positive for Apple
• +1 true positive for Instagram
(IoU >= threshold)
IoU=0.7
15. Baseline: learning to fine-tune
Take the weight matrix as an example. We can write the weight matrix as
where each class has a d-dimensional weight vector. In the training
stage, for an input feature we compute its cosine similarity to each weight vector
and obtain the similarity scores for all classes, where
. We can then obtain the prediction probability for each class
by normalizing these similarity scores with a softmax function. Here, the classifier
makes a prediction based on the cosine distance between the input feature and the
learned weight vectors representing each class
Wb
[w1, w2 . . . wc]
f(xi)
[w1, w2 . . . wc] [si1, si2 . . . sic]
si,j = f(xi)T
wj/| f(xi)||wj |
16. Distance metric learning
Distance metric learning (or simply,
metric learning) aims at
automatically constructing task-
specific distance metrics from
supervised data, in a machine
learning manner.
19. • Divide our test set on N(=number
of classes) clusters by K-means
• Compute NMI (Normalized Mutual
Information) of cluster labels and
ground truth labels
Metrics: NMI of test set k-means clusters
21. The Triplet Loss minimizes the distance between an anchor and a positive,
both of which have the same identity, and maximizes the distance between
the anchor and a negative of a different identity.
Ltriplet(xa, xp, xn) = max(0,m + ||f(xa) − f(xp)||2
2 − ||f(xa) − f(xn)||2
2 )
Triplet loss
22. Intuitively we would like to have P
approximate the set of all data
points, i.e. for each x there is one
element in P which is close to x
w.r.t. the distance metric d. We call
such an element a proxy for x:
Proxy approximation error:
Proxies
23. NCA loss
The NCA ( Neighbourhood Components Analysis) loss tries to make x closer to y
than to any element in a set Z using exponential weighting:
24. Proxy- NCA loss
Just use proxies instead of using simple elements. So, algorithm will do the
next steps: sample triplet, formulate proxy triplet from sample, calculate loss
l = − log(
exp(−d(x, p(y)))
∑
p(z)∈p(Z)
exp(−d(x, p(z))
25. In conventional SoftMax loss, each
class has a representative center in
the last fully connected layer.
Examples in the same class will be
collapsed to the same center. It may
be inappropriate for the realworld
data as illustrated. In contrast,
SoftTriple loss keeps multiple
centers (e.g., 2 centers per class in
this example) in the fully connected
layer and each image will be
assigned to one of them. It is more
flexible for modeling intra-class
variance in real-world data sets.
Softriple
26. Now, we assume that each class has K centers. Then, the similarity between the example and
the class C can be defined as
xi
Si,c = maxk(xT
i wk
c)
Softriple
27. HardTriple loss improves the SoftMax loss by providing multiple centers for each class.
However, it requires the max operator to obtain the nearest center in each class while this
operator is not smooth and the assignment can be sensitive between multiple centers.
Inspired by the SoftMax loss, we can improve the robustness by smoothing the max
operator
Hardtriple
28. Compared with the SoftMax loss, we first increase the dimension of the
FC layer to include multiple centers for each class (e.g., 2 centers per
class in this example). Then, we obtain the similarity for each class by
different operators. Finally, the distribution over different classes is
computed with the similarity obtained from each class
Softriple
32. • Made from corrected old approach
predictions - biased
Validation problems
Lot of false positives
• Validation has no images without logo
Recall of new approach is definitely higher on prod
data, but not on validation data
35. • False positives from first
iterations
• Unseen data - faces, eyes
other non-logo things, that are
false positives from our
detector
• ~23k images
Class “other”
36. • New class - new “other” samples
• 5-shot learning becomes 50-shot learning
Class “other”: problems
Solution
• Add “other” class to train set
44. • Remove small images from dataset
• Add more classes
• Add +5 pixels from each side to detector prediction
Augmentations:
• Add different blures to augmentation
• Randomly add random amount of pixels from random side
Small things that helped
45. • Spatial Transformer Network
• Any synth data (text/images)
Things that didn’t help
49. • We don’t need 100-200 labeled with bounding box images anymore. We
just need 5-10 crops aka standards
• We don’t need to retrain detector and classifier, our models are universal
and works with different logos
• It’s easier to control things: we can add/delete standards if we see that
there are samples/logos our model can’t deal with (earlier we had to add
this samples in train set and retrain model)
Results
50. Please be creative when you’ll create logo for your
startup
Thank you for attention!