Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Having impact using Machine Learning in the real world


Published on

Having worked for several companies on machine-learning projects, I could compare the impact of each project and understand the key success factors for an impactful machine learning project. Here's my take on the subject

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Having impact using Machine Learning in the real world

  1. 1. Having impact using Machine Learning in the real world A 10-step guide for the Data Scientist to be successful in a large company
  2. 2. - How many of you work in a large corporation? - If you’re a data scientist, how much time before your model is used by business users and has a positive ROI?
  3. 3. 3.5 years in consulting on data science projects, among which: - Fraud detection in Public services 6 mth - Quality anomaly detection in Manufacturing 7 mth - Online topics monitoring for Industry 3 mth - Predictive marketing for Commercial Real Estate 3 mth - Sales prediction for Pharmaceutical 3 mth => Wide spectrum of industries are interested in ML and DS => Short-term ML projects can be impactful, not necessarily long-term Bio of a “business” data scientist My most impactful ML project
  4. 4. Fraud detection in a large public institution Context 41 M€ 100 M€ 400 M€Total (est.) Detected Prevented Amount of fraud in 2013 Digital revolution and dematerialization New risks - Applications from abroad (VPN…) - Feeling of impunity behind the computer - Simplicity of counterfeiting Opportunities - Combining different databases to cross-check - Tracking changes of situations - Use of technologies like Machine Learning Respecting 2 core values: equity in processing each application, public service money should be spent correctly
  5. 5. Confidentiality: cannot communicate on actual numbers Summary Developed Machine Learning model to predict a specific type of fraud - 40% Precision vs 2-3% benchmark rule-based - Industrialized and runs each month on 100k new cases - 150 inspectors use the alerts produced by the model on a daily basis - 3 data scientists were upskilled and both re-train and upgrade the model, business managers also upskilled on ML concepts - More budget earned for new projects in Machine Learning Data Science is R&D yes, but there are key components that increase drastically the chance of success when doing a ML project Fraud detection in a large public institution Impact of using Machine Learning
  6. 6. Preamble 2 types of data scientists 1. The Deep Learning engineer: research scientist, reading ML books, having enough data, knows his problem, nobody else relies critically on him 2. The other data scientist: has a project for a business line, does Tableau dashboard and dreams about doing real ML…
  7. 7. 1. Hybrid team around the data scientist(s) DS should be surrounded by 2 functions as a prerequisite: 1. data ingestion team/IT: need to understand data 2. business experts: need to understand expertise 3. (optional, necessary for long-term projects) IT/dev: need to put model in production So many projects Data Scientists are isolated from business experts. You cannot discover everything by yourself. Data team: usual problem is nobody can explain why there is this weird value in the data
  8. 8. Because it will 1. Ease data extraction and modelling 2. Give what to expect for the ROI 3. Clarity = mathematical validation = higher chance to succeed in modelling Let’s try Predict fraud risk today 1. You have more data on applicants who are already receiving benefits than on new ones. Using missing values for new ones as a solution? Does not seem convincing... 2. The person got a job, moved to another town, can the person become fraudster? Has fraud happened before or during these changes? Happened before for this type of fraud 3. Clarify/simplify Conclude with adequate problem statement Predicting fraud risk at the moment you receive a new application is a clear business problem. 2. Clear/simple business problem statement
  9. 9. 3. Data understanding and accessibility Nobody is going to create a dataset of true negatives for you :) => dig in the data with experts and data engineers and find a way Eliminate future spurious correlations and noise, foresee target leakage, understand which features can be relevant and eliminate others (metadata…) It’s certain you will need to extract your data again, again, and again, as you will: - change the initial population - use additional features, remove redundant features => need for relatively easy access to data
  10. 10. Transformation of data: - Categorical variables with more than 100 modalities => 100 binary columns? Why not - Dates => durations (age, time intervals…) - Imputing missing values - Time series: flatten by aggregating... Data enrichment: - Public data: demographics data from regions - Web-scraping: referential of Bank names and types related to numbers in the SWIFT code - Graph features: number of people sharing phone number, address, bank account, connected to fraudsters... 4. Data preparation. Be bold Screenshot of the graph of applicants and entities. Red nodes are existing fraud applications Use demographic information to remove region bias Referential of banks acting in France and corresponding SWIFT codes they deliver
  11. 11. 5. Modeling. Be lazy Random grid search? Random algos? Please... => Strong algo, good initial parameters, favor early stopping, limit number of features at first... Create a strong benchmark fast, so that you can industrialize a good model. Then when everything is done, think about potential R&D to improve the model (done is better than perfect) Use an adequate business-related metric: Precision vs Recall. Which is better: 100 predictions of fraud where 90 are good ones or 1000 predictions of fraud where 200 are good ones Also: which threshold for the probability of risk? .5?
  12. 12. 6. Predicting - where the mess happens You will probably make bad predictions at first, even if your model looks good on the training set - Predict on a new sample and share these with experts. They have limited time so be sure to not waste their time - Take sample with most impact: recent data, large possible frauds - Use their feedback to enrich your labelled data
  13. 13. 7. Fine-tune, remove biased features Beware of data drift problems Envision future problems (new modality in categorical feature…) Remove bias in your model and features: - Remove region (avoid learning on past and focus on untimed patterns) - Financial institution (new players in banking appear each year and fraudsters use them a lot) The model learnt from the region feature, and replicates the pattern learnt on new predictions
  14. 14. 8. Encourage adoption through interpretation Artificial Intelligence is scary, people don’t understand it, often perceived to replace jobs and not enhancing people’s capacities => You will face resistance Make AI human-friendly :) Put as much effort as you can into interpreting your model and your predictions - Feature importance, partial dependence - Predictions interpretation - Rediscover actual experts’ rules of thumbs in the model The field is cutting-edge now, no standard way to interpret a model Most and least important features Most 3 important features for each prediction of high risk of fraud Mostriskytolessrisky
  15. 15. 9. Don’t forget a clean industrialization Befriend the IT folks so your model won’t take a year to be deployed - Make clean and reusable code in a pipeline: comment/document - Save both preprocessing code and model in a file that can run standalone on new data - Create a code that can re-train using new data; think of weighting strategy for new data (putting more weight on new data for instance, as fraud practices evolve quickly), and think of when to retrain the model (every x days, every month…)
  16. 16. 10. Upskill the data science team - When you accept the offer for ML engineer at Google, can the team continue your work? - New data, new features (or features that don’t exist anymore), new production environment: things that make your model crash. Can somebody understand it?
  17. 17. Thanks 4 articles directly made from this project on: - Web scraping - Fine tuning XGBoost - Interpreting ML black box models - Graph analysis Just joined DataRobot in Hong Kong, connect with me to talk about ML in the real world