Successfully reported this slideshow.
Your SlideShare is downloading. ×

How to evaluate & manage machine learning model #daft

More Related Content

How to evaluate & manage machine learning model #daft

  1. 1. How to Evaluate & Manage Machine Learning Model? #DAFT Shunya UETA @hurutoriya 2019-04-12
  2. 2. $ whoami ● Shunya UETA :: @hurutoriya ● Mercari, inc. Machine Learning Engineer ● Machine Learning Casual Talks Co-Organizer ● https://shunyaueta.com/
  3. 3. Machine Learning Workflow (CRISP-DM) ● Most Important Step ○ Business Understanding ○ Evaluation ● Missing things in for Production Ref: Kenneth Jensen
  4. 4. CRISP-DM for Production Ref: Jan Teichmann
  5. 5. Content Modelation 1. item listing 2, if prob score greater than threshold value, items are hied and alert to Customer Support violation items → Delete normal items → Unhide 3. Customer Support Check E.g. Contents Moderation target: Fake Brand, Game Account
  6. 6. Assummption ● ML Service runs All listing items ● Binary Classiffication ● Precision is important than recall ● We can simulate online result in offline by Faster Customer Support Check System Ref: Rendezvous Architecture for Data Science in Production
  7. 7. Before Deploy to Production New Model 2019/04/11 all listing items Cullent Model 1. prob 2. true or false Cloud Pub/Sub
  8. 8. Sad story in Machine Learning System in Production ● Gap Between Offline & Online evaluation → OK! we can’t know online result, let’s deploy! ● Data Imbalance problem High Speed Continuas Improvment 1. Easy A/B System 2. Online Offline Sanity Check
  9. 9. 新しいモデルをオンライン投入する前にやっていること New Model Compute Engine 2019/04/11 all listing items Cullent Model 1. prob 2. true or false Cloud Pub/Sub
  10. 10. Sanity Check Before Deploy to Production Threshold: 0.95 ID is_delete Model α Model β 1613431 True 0.98 0.999 5263832 True 0.97 0.43 7213438 False 0.95 0.45 3213492 True 0.70 0.98 9201420 True 0.01 0.97
  11. 11. Sanity Check Before Deploy to Production ID is_delete Model α Model β 1613431 True 0.98 0.999 5263832 True 0.97 0.43 7213438 False 0.95 0.45 3213492 True 0.70 0.98 9201420 True 0.01 0.97 Success! Cost Sensitive Threshold: 0.95
  12. 12. ID is_delete Model α Model β 1613431 True 0.98 0.999 5263832 True 0.97 0.43 7213438 False 0.95 0.45 3213492 True 0.70 0.98 9201420 True 0.01 0.97 Success! Cost Sensitive Threshold: 0.95 Fail! worsen recall Sanity Check Before Deploy to Production
  13. 13. ID is_delete Model α Model β 1613431 True 0.98 0.999 5263832 True 0.97 0.43 7213438 False 0.95 0.45 3213492 True 0.70 0.98 9201420 True 0.01 0.97 Success! Cost Sensitive Success! Cost Sensitive Threshold: 0.95 Fail! worsen recall Sanity Check Before Deploy to Production
  14. 14. ID is_delete Model α Model β 1613431 True 0.98 0.999 5263832 True 0.97 0.43 7213438 False 0.95 0.45 3213492 True 0.70 0.98 9201420 True 0.01 0.97 Success! Cost Sensitive Fail! worsen recall Success! Cost Sensitive Success! Improve Recall Threshold: 0.95 Sanity Check Before Deploy to Production
  15. 15. ID is_delete Model α Model β 1613431 True 0.98 0.999 5263832 True 0.97 0.43 7213438 False 0.95 0.45 3213492 True 0.70 0.98 9201420 True 0.01 0.97 Success! Cost Sensitive Success! Cost Sensitive Success! Improve Precision Success! Improve Recall Threshold: 0.95 Fail! worsen recall Sanity Check Before Deploy to Production
  16. 16. Sanity Check Before Deploy to Production Confidence or Probability: High ↑ Confidence or Probability: Low ↓ Deleted Items (term of violation) Improve👍 !! Bad Model ☠ Undelete Items (Normal items) Bad Model ☠ Improve 👍!!
  17. 17. Traditional Serve side Design Ref: Rendezvous Architecture for Data Science in Production MODEL {“name”: “Dog” , “prob”: “92.5” }
  18. 18. With Load Balancer Ref: Rendezvous Architecture for Data Science in Production {“name”: “Dog” , “prob”: “92.5” } Model 3 Load Balancer Model 1 Model 2
  19. 19. Sanity Check After Deploy to Production ● New Model, Old Model ○ Same prob score number of Overlap items ○ Top100, bottom100 👀grep ○ Error Analysis(False Positive sample) ○ Use False Positive to Hard Negative Sampling
  20. 20. Strong Recommend articles

×