Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15

1,098 views

Published on

Classification Labels in a Fast Moving Environment: Classification problems are very common in ecommerce. Collecting and storing labels from different sources is key to train and evaluate such models.

Labels are expensive to obtain, thus selecting which products to get labels for is key to optimally use any available labeling budget, both when training and evaluating a model. At the same time, if available labels are not correctly used, incorrect or suboptimal results can be produced.

In this talk I will discuss some of the challenges and potential pitfalls of acquiring and using labels for classification in a quickly evolving environment. I will present a system that store labels, provides a way to select labels to optimize budget while providing accurate and unbias evaluations of the classification models.

Published in: Technology
  • Be the first to comment

Alessandro Magnani, Data Scientist, @WalmartLabs at MLconf SF - 11/13/15

  1. 1. Classification Labels in a Fast Moving Environment Classification Labels in a Fast Moving Environment Alessandro Magnani @WalmartLabs, Walmart Global eCommerce California, USA Friday 13th November, 2015
  2. 2. Classification Labels in a Fast Moving Environment Classification Model Performance Items Classifier Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ correctly evaluating classification models is critical and requires labels ◮ labeling products is expensive ◮ need to correctly and optimally use labels
  3. 3. Classification Labels in a Fast Moving Environment Classification Model Performance Items Classifier Editor N sampled items true label yi estimate ˜yi accuracyEvaluation Measure accuracy common approach: ◮ sample uniformly at random N items ◮ compute accuracy 1 N N i=1 ½{˜yi =yi }
  4. 4. Classification Labels in a Fast Moving Environment Practical challenges Items Classifier Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ items change over time
  5. 5. Classification Labels in a Fast Moving Environment Practical challenges Items Classifier Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ items change over time ◮ evaluation required over multiple subsets
  6. 6. Classification Labels in a Fast Moving Environment Practical challenges Items Classifier Editor N sampled items true label yi estimate ˜yi accuracyEvaluation ◮ items change over time ◮ evaluation required over multiple subsets ◮ existing labels potentially hard to reuse
  7. 7. Classification Labels in a Fast Moving Environment A motivating example compute accuracy over 1M items 1K labels budget ◮ sample 1K items and get labels yi ◮ measure accuracy 1 1K 1K i=1 ½{˜yi =yi } 1M p 1 1K
  8. 8. Classification Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items ◮ use previous accuracy measure ◮ most likely inaccurate 1M 1.5M p 1 1K
  9. 9. Classification Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items 500 labels extra budget ◮ sample 500 items from the 1.5M ◮ compute accuracy on new 500 labels ◮ previous 1K labels “wasted” 1M 1.5M p 1 3K
  10. 10. Classification Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items 500 labels extra budget, better approach ◮ sample 500 items from new items ◮ compute accuracy on all 1.5K labels ◮ no label “wasted” 1M 1.5M p 1 1K
  11. 11. Classification Labels in a Fast Moving Environment A motivating example 500K items added, compute accuracy on all 1.5M items only 250 labels extra budget? ◮ sample 250 items from new items ◮ need to account for difference in sampling ◮ accuracy: 1M 1.5M p 1 2K 1 1.5K 1K i=1 ½{˜yi =yi } + 2 250 i=1 ½{˜ynew i =ynew i }
  12. 12. Classification Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive
  13. 13. Classification Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive ◮ knowing how previous labels were sampled required to optimally sample new items for test
  14. 14. Classification Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive ◮ knowing how previous labels were sampled required to optimally sample new items for test ◮ computing accuracy using all labels requires knowledge of sampling profile
  15. 15. Classification Labels in a Fast Moving Environment A motivating example What are the challenges? ◮ sampling new test labels for every measure is generally expensive ◮ knowing how previous labels were sampled required to optimally sample new items for test ◮ computing accuracy using all labels requires knowledge of sampling profile ◮ overtime reusing labels can become very tricky
  16. 16. Classification Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling profile) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi }
  17. 17. Classification Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling profile) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled
  18. 18. Classification Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling profile) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled ◮ all labels are used
  19. 19. Classification Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling profile) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled ◮ all labels are used ◮ with uniform sampling this is simply “standard” accuracy
  20. 20. Classification Labels in a Fast Moving Environment Evaluation framework ◮ pi is probability of item i to be selected for test (Bernoulli) ◮ each item carries pi and is marked if selected (store the sampling profile) ◮ accuracy: 1 i selected 1 pi i selected 1 pi ½{˜yi =yi } ◮ for evaluation to be possible pj > 0 for all j labeled/unlabeled ◮ all labels are used ◮ with uniform sampling this is simply “standard” accuracy ◮ very closely related to importance sampling
  21. 21. Classification Labels in a Fast Moving Environment Evaluation framework given existing sampling pi and extra budget how do we sample? ◮ minimize accuracy variance with budget constraint ◮ can be formulated as an optimization problem ◮ easy to solve
  22. 22. Classification Labels in a Fast Moving Environment Evaluation framework it works as you’d expect as budget grows: p p ◮ new budget (blue) used more where pi is smaller ◮ given enough budget we obtain uniform sampling
  23. 23. Classification Labels in a Fast Moving Environment Extensions ◮ framework works more generally for supervised learning ◮ framework can work with a wide range of different metrics ◮ optimal sampling can use model posterior to reduce variance ◮ this framework can be used on the training side together with active learning

×