Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- E commerce site structure - why tax... by DPCG, Co. Digital... 1403 views
- Ramaciotti digital media marketing ... by Max Ramaciotti 1300 views
- Geetu Ambwani, Principal Data Scien... by MLconf 909 views
- Starting a Taxonomy Project (Presen... by Miraida Morales 792 views
- Animashree Anandkumar, Electrical E... by MLconf 1952 views
- Narayanan Sundaram, Research Scient... by MLconf 1259 views

1,228 views

Published on

Labels are expensive to obtain, thus selecting which products to get labels for is key to optimally use any available labeling budget, both when training and evaluating a model. At the same time, if available labels are not correctly used, incorrect or suboptimal results can be produced.

In this talk I will discuss some of the challenges and potential pitfalls of acquiring and using labels for classification in a quickly evolving environment. I will present a system that store labels, provides a way to select labels to optimize budget while providing accurate and unbias evaluations of the classification models.

Published in:
Technology

No Downloads

Total views

1,228

On SlideShare

0

From Embeds

0

Number of Embeds

16

Shares

0

Downloads

23

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Classiﬁcation Labels in a Fast Moving Environment Classiﬁcation Labels in a Fast Moving Environment Alessandro Magnani @WalmartLabs, Walmart Global eCommerce California, USA Friday 13th November, 2015
- 2. Classiﬁcation Labels in a Fast Moving Environment Classiﬁcation Model Performance Items Classiﬁer Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ correctly evaluating classiﬁcation models is critical and requires labels ◮ labeling products is expensive ◮ need to correctly and optimally use labels
- 3. Classiﬁcation Labels in a Fast Moving Environment Classiﬁcation Model Performance Items Classiﬁer Editor N sampled items true label yi estimate ˜yi accuracyEvaluation Measure accuracy common approach: ◮ sample uniformly at random N items ◮ compute accuracy 1 N N i=1 ½{˜yi =yi }
- 4. Classiﬁcation Labels in a Fast Moving Environment Practical challenges Items Classiﬁer Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ items change over time
- 5. Classiﬁcation Labels in a Fast Moving Environment Practical challenges Items Classiﬁer Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ items change over time ◮ evaluation required over multiple subsets
- 6. Classiﬁcation Labels in a Fast Moving Environment Practical challenges Items Classiﬁer Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ items change over time ◮ evaluation required over multiple subsets ◮ existing labels potentially hard to reuse
- 7. Classiﬁcation Labels in a Fast Moving Environment A motivating example compute accuracy over 1M items 1K labels budget ◮ sample 1K items and get labels yi ◮ measure accuracy 1 1K 1K i=1 ½{˜yi =yi } 1M p 1 1K
- 8. Classiﬁcation Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items ◮ use previous accuracy measure ◮ most likely inaccurate 1M 1.5M p 1 1K
- 9. Classiﬁcation Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items 500 labels extra budget ◮ sample 500 items from the 1.5M ◮ compute accuracy on new 500 labels ◮ previous 1K labels “wasted” 1M 1.5M p 1 3K
- 10. Classiﬁcation Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items 500 labels extra budget, better approach ◮ sample 500 items from new items ◮ compute accuracy on all 1.5K labels ◮ no label “wasted” 1M 1.5M p 1 1K
- 11. Classiﬁcation Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items only 250 labels extra budget? ◮ sample 250 items from new items ◮ need to account for diﬀerence in sampling ◮ accuracy: 1M 1.5M p 1 2K 1 1.5K 1K i=1 ½{˜yi =yi } + 2 250 i=1 ½{˜ynew i =ynew i }
- 12. Classiﬁcation Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive
- 13. Classiﬁcation Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive ◮ knowing how previous labels were sampled required to optimally sample new items for test
- 14. Classiﬁcation Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive ◮ knowing how previous labels were sampled required to optimally sample new items for test ◮ computing accuracy using all labels requires knowledge of sampling proﬁle
- 15. Classiﬁcation Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive ◮ knowing how previous labels were sampled required to optimally sample new items for test ◮ computing accuracy using all labels requires knowledge of sampling proﬁle ◮ overtime reusing labels can become very tricky
- 16. Classiﬁcation Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling proﬁle) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi }
- 17. Classiﬁcation Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling proﬁle) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled
- 18. Classiﬁcation Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling proﬁle) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled ◮ all labels are used
- 19. Classiﬁcation Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling proﬁle) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled ◮ all labels are used ◮ with uniform sampling this is simply “standard” accuracy
- 20. Classiﬁcation Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling proﬁle) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled ◮ all labels are used ◮ with uniform sampling this is simply “standard” accuracy ◮ very closely related to importance sampling
- 21. Classiﬁcation Labels in a Fast Moving Environment Evaluation framework given existing sampling pi and extra budget how do we sample? ◮ minimize accuracy variance with budget constraint ◮ can be formulated as an optimization problem ◮ easy to solve
- 22. Classiﬁcation Labels in a Fast Moving Environment Evaluation framework it works as you’d expect as budget grows: p p ◮ new budget (blue) used more where pi is smaller ◮ given enough budget we obtain uniform sampling
- 23. Classiﬁcation Labels in a Fast Moving Environment Extensions ◮ framework works more generally for supervised learning ◮ framework can work with a wide range of diﬀerent metrics ◮ optimal sampling can use model posterior to reduce variance ◮ this framework can be used on the training side together with active learning

No public clipboards found for this slide

Be the first to comment