Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter

544 views

Published on

This talk describes migrating a large random forest classifier from scikit-learn to Spark's MLlib. We cut training time from 2 days to 2 hours, reduced failed runs, and track experiments better with MLflow. Kount provides certainty in digital interactions like online credit card transactions. One of our scores uses a random forest classifier with 250 trees and 100,000 nodes per tree. We used scikit-learn to train using 60 million samples that each contained over 150 features. The in-memory requirements exceeded 750 GB, took 2 days, and were not robust to disruption in our database or training execution. To migrate workflow to Spark, we built a 6-node cluster with HDFS. This provides 1.35 TB of RAM and 484 cores. Using MLlib and parallelization, the training time for our random forests are now less than 2 hours. Training data stays in our production environment, which used to require a deploy cycle to move locally-developed code onto our training server. The new implementation uses Jupyter notebooks for remote development with server-side execution. MLflow tracks all input parameters, code, and git revision number, while the performance and model itself are retained as experiment artifacts. The new workflow is robust to service disruption. Our training pipeline begins by pulling from a Vertica database. Originally, this single connection took over 8 hours to complete with any problem causing a restart. Using sqoop and multiple connections, we pull the data in 45 minutes. The old technique used volatile storage and required the data for each experiment. Now, we pull the data from Vertica one time and then reload much faster from HDFS. While a significant undertaking, moving to the Spark ecosystem converted an ad hoc and hands-on training process into a fully repeatable pipeline that meets regulatory and business goals for traceability and speed.

Speaker: Josh Johnston

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter

  1. 1. Moving a Fraud-Fighting Random Forest from scikit- learn to Spark with ML, MLflow, and Jupyter Josh Johnston Director of AI Science josh.johnston@kount.com
  2. 2. ©Kount Inc All Rights Reserved Overview Model lifecycle Our fraud-detecting model Initial method with database and scikit learn Improved method with HDFS and Spark Robust model governance
  3. 3. ©Kount Inc All Rights Reserved Manage the model lifecycle Microsoft. (2017, October 19). What is the Team Data Science Process? Retrieved March 26, 2019, from https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview Modeling • Configuration management • Performance (speed) • Accuracy • Validation Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Science is repeatable
  4. 4. Our fraud-detecting model
  5. 5. ©Kount Inc All Rights Reserved Kount protects digital innovations from… Fraudulent Account Creation Transaction/ Payment Fraud Account Takeover Fraud Authentication Friction
  6. 6. ©Kount Inc All Rights Reserved Evaluate transactions for fraud • Substantial throughput • 30-100 transactions per second • Low latency • 250 ms end-to-end system latency • ~15 ms for machine learning features and model
  7. 7. ©Kount Inc All Rights Reserved Evaluate transactions for fraud
  8. 8. ©Kount Inc All Rights Reserved
  9. 9. ©Kount Inc All Rights Reserved Approve an extra ~3K transactions and $1.2M USD per month Reduced manual reviews by 200 hours/month Reduced chargeback rate by 17% Reduced manual reviews by 20% Sleep better at night Don’t hear complaints from fraud team about review queue anymore Fraud Manager Feedback: Boost Technology™ Customer View
  10. 10. ©Kount Inc All Rights Reserved Boost Technology™ Technical View Feature Engineering • 200 GB of precomputed data Model • Random forest • 250 trees • ~100k nodes per tree • ~1GB serialized representation Model Training • ~150 features • ~60M observations
  11. 11. Initial training with database and scikit learn
  12. 12. ©Kount Inc All Rights Reserved First approach gets to production Analytics Database Model Training Service Network Storage Fetch observations Fetch lookups Observation Lookup Flat File Logging Pickled Model Train Model (Scikit Learn) Time 16 hrs 24 hrs 8 hrs Lookup compute 1 hr 12 hrs 2.5 days 400GB RAM 1TB into swap
  13. 13. ©Kount Inc All Rights Reserved What works • Trains a high value model
  14. 14. ©Kount Inc All Rights Reserved What doesn’t work • Time-intensive • Errors force restarts since everything is held in memory (and swap) • Burdens production analytics database • Pickled model ties execution environment to training environment • Traceability provided by log files and manual documentation • Ad hoc experiments with little configuration control Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why?
  15. 15. Improved training with HDFS and Spark
  16. 16. ©Kount Inc All Rights Reserved Cluster for distributed computing • Dell hardware • 6 nodes • 484 vCores • 1.35 TB RAM • Cloudera Manager • Spark 2.4 • Mostly python HDFS • Attached to 3 nodes • 171 TB usable space
  17. 17. ©Kount Inc All Rights Reserved Spark Cluster Improved approach through cluster Analytics Database HDFSsqoop data Observation Lookup Logging Zipped MLeap Model Train Model (Spark ML) Time 45 min 2 hrs 8 hrs Compute lookups MLflow Perform lookups Luigi <1/2 day
  18. 18. ©Kount Inc All Rights Reserved Remote development with Jupyter • Most criticisms of notebooks are things you COULD do, not what you MUST do • Good development practices are independent of tools Juptyer Notebook Pyspark Application Python Packages MaturityResearch Production Version Control (git) Automation
  19. 19. ©Kount Inc All Rights Reserved What works • Faster • Failures restart in the middle • Reduces burden on production analytics database • Redesign experiments without penalty • MLeap decouples evaluation environment from training environment
  20. 20. ©Kount Inc All Rights Reserved What still doesn’t work • Non-deterministic Spark ML behavior and errors • Spark pipelines rely on configurations that change based on input data
  21. 21. Tools and Processes for Model Governance
  22. 22. ©Kount Inc All Rights Reserved Tools and processes for governance Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Solution components • Data traceability • Experiment, configuration, and accuracy traceability
  23. 23. ©Kount Inc All Rights Reserved
  24. 24. ©Kount Inc All Rights Reserved
  25. 25. ©Kount Inc All Rights Reserved
  26. 26. ©Kount Inc All Rights Reserved
  27. 27. ©Kount Inc All Rights Reserved
  28. 28. ©Kount Inc All Rights Reserved
  29. 29. ©Kount Inc All Rights Reserved • Data pipelines with error handling • Repeatable and documented data transformations • Document parameters • Trace to code and data used • Record accuracy of selected and not selected models • Store final model and configurations as artifact Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why?
  30. 30. Conclusions
  31. 31. ©Kount Inc All Rights Reserved Kount’s benefits from Spark/HDFS, Luigi, and MLflow • Faster • Failures can restart in the middle • Reduce burden on production analytics database • Redesign experiments without penalty • MLeap decouples evaluation environment from training environment Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why?
  32. 32. Moving a Fraud-Fighting Random Forest from scikit- learn to Spark with ML, MLflow, and Jupyter Josh Johnston Director of AI Science josh.johnston@kount.com

×