Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epistatica at Machine Learning focused 62nd DevClub.lv

2,159 views

Published on

Epistatica is a data science spin-off from VIA SMS R&D SERVICES, searching its niche in European markets.
Dmitrijs is head of credit risk with VIA SMS R&D SERVICES, a fintech company, and member of the board at Epistatica, holds a PhD from RAS Institute for Information Transmission Problems and analyzed data for over 12 years.

Published in: Technology
  • Be the first to comment

“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epistatica at Machine Learning focused 62nd DevClub.lv

  1. 1. Machine learning in production +case studies Dmitrijs Lvovs
  2. 2. Outline • Machine learning, Data Science, Artificial Intelligence • Common algorithms • Pipeline and common pitfalls • Case studies
  3. 3. Machine learning • Machine Learning • Data Science • Artificial Intelligence
  4. 4. Machine learning • Machine Learning • Data Science • Artificial Intelligence
  5. 5. Machine learning • Process that enables a machine to perform a task similarly or above human level
  6. 6. Algorithms https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice Linear regression Logistic regression Decision tree Neural networks SVM K-means, other clustering
  7. 7. Algorithms Lessman et al. https://doi.org/10.1016/j.ejor.2015.05.030
  8. 8. Algorithms Lessman et al. https://doi.org/10.1016/j.ejor.2015.05.030
  9. 9. Algorithms • In production: credit scoring from 1950’s
  10. 10. Pipeline & Pitfalls • Get / clean the data • Model & Evaluate • Deploy • Maintain
  11. 11. Pipeline & Pitfalls • Get data: garbage in = garbage out – Ensure all data will be available at the time of prediction – Use sampling if necessary – Use the same code to get data for analysis and prediction
  12. 12. Pipeline and Pitfalls • Get data • Model & Evaluate – Select the target with business in mind – Start with simple things and set a benchmark – Improve, write a notebook – Test out of sample and out of time
  13. 13. Pipeline and Pitfalls • Get data • Model & Evaluate – Select the target with business in mind – Start with simple things and set a benchmark – Improve, write a notebook – Test out of sample and out of time wholedataset out of timetrain + out of sample time
  14. 14. Pipeline & Pitfalls • Get data • Model & Evaluate • Deploy – Simpler algorithm = simpler deployment – For regression – only weights for variables – For more advanced, usually REST API (R shown): • https://cran.r-project.org/web/packages/AzureML/index.html • https://www.opencpu.org/ • https://tensorflow.rstudio.com/tools/tfdeploy/articles/introduction.html • https://github.com/trestletech/plumber • ... • https://github.com/danaki/yshanka
  15. 15. Pipeline & Pitfalls • Deploy – OR a training+hosting tool in case budget allows $$$ or cloud is not an issue: – Budget: 10 000s - 100 000s
  16. 16. Pipeline & Pitfalls • Get data • Model & Evaluate • Deploy • Maintain – Test data -> population must be the same! – Test model -> track output, performance – Challenge model -> update the model and challenge it
  17. 17. ML in production https://en.wikipedia.org/wiki/Voyager_1#/media/File:Voyager_spacecraft.jpg
  18. 18. Case studies
  19. 19. Case: a call centre Setup: • A company that connects short-term employees with employers • Data on several thousands of calls provided, mainly contact data and indication whether the person accepted employment offer • Q: Who do we call? Result: • A model with AUC 0.8 model output took the job rate calls base 0 8% 49% 10 16% 20% 20 20% 12% 30 28% 8% 40 36% 5% 50 42% 3% 60 50% 2% 70 62% 1% 80 67% 0% 90 71% 0% 16% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 0 10 20 30 40 50 60 70 80 90 acceptedoffer model score
  20. 20. Case: student performance review Setup: • A company that records and keeps all student marks throughout the year • Data on several thousands of marks provided • The idea for the model to predict the year’s final mark for each subject • Q: What the year-end mark is going to be for each student and subject? Result: • A very simple model, 5% MAE 0 2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 predicted actual month prediction error 1 12% 2 11% 3 10% 4 8% 5 8% 6 7% 7 7% 8 6% 9 5%
  21. 21. Case: credit scoring model for online lender Setup: • A company that issues loan in a EU country • Data on several thousands of loans provided • Q: Will a customer default? Result: • An advanced ensemble machine learning pipeline yielded mere 2% gain over a logistic regression model 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 ensemble machine learning logistic regression AUC
  22. 22. Case: Will a customer deposit funds? Setup: • A company that trades currency • Data on several hundreds thousands of user agent strings provided • Q: Will a customer deposit funds to their account? Result: • An ensemble machine learning model learned to separate those who will deposit: Score Deposited Count total 0 3% 7110 10 13% 800 20 16% 341 30 25% 159 40 25% 80 50 43% 23 60 40% 10 70 67% 3 80 100% 2 90 100% 1
  23. 23. Case: Is a transaction fraudulent? Setup: • A kaggle dataset with fraudulent transactions from https://www.kaggle.com/dalpozz/creditcardfraud • Epistatica’s learning pipeline • Q: Can we build an unsupervised model? Result: • AUC 0.75 on kaggle data (0.6 hit rate with 0.3 FP rate) Validated: • One of the top consulting companies data (AUC 0.7 ) • On payment provider data (AUC 0.8) group size fraud rate fraud rate difference 67% 0.1% 302% 33% 0.3%

×