SlideShare a Scribd company logo
Moving a Fraud-Fighting
Random Forest from scikit-
learn to Spark with ML,
MLflow, and Jupyter
Josh Johnston
Director of AI Science
josh.johnston@kount.com
©Kount Inc All Rights Reserved
Overview
Model lifecycle
Our fraud-detecting model
Initial method with database and scikit learn
Improved method with HDFS and Spark
Robust model governance
©Kount Inc All Rights Reserved
Manage the model lifecycle
Microsoft. (2017, October 19). What is the Team Data Science Process? Retrieved March 26, 2019, from
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
Modeling
• Configuration management
• Performance (speed)
• Accuracy
• Validation
Governance Questions
• Which model are you using?
• How did you train it?
• How well does it work?
After each answer: Why?
Science is repeatable
Our fraud-detecting
model
©Kount Inc All Rights Reserved
Kount protects digital innovations from…
Fraudulent
Account Creation
Transaction/
Payment Fraud
Account
Takeover Fraud
Authentication
Friction
©Kount Inc All Rights Reserved
Evaluate transactions for fraud
• Substantial throughput
• 30-100 transactions per second
• Low latency
• 250 ms end-to-end system latency
• ~15 ms for machine learning features and model
©Kount Inc All Rights Reserved
Evaluate transactions for fraud
©Kount Inc All Rights Reserved
©Kount Inc All Rights Reserved
Approve an extra ~3K transactions and $1.2M
USD per month
Reduced manual reviews by 200 hours/month
Reduced chargeback rate by 17%
Reduced manual reviews by 20%
Sleep better at night
Don’t hear complaints from fraud team about
review queue anymore
Fraud Manager Feedback:
Boost Technology™ Customer View
©Kount Inc All Rights Reserved
Boost Technology™ Technical View
Feature Engineering
• 200 GB of precomputed data
Model
• Random forest
• 250 trees
• ~100k nodes per tree
• ~1GB serialized representation
Model Training
• ~150 features
• ~60M observations
Initial training with
database and scikit
learn
©Kount Inc All Rights Reserved
First approach gets to production
Analytics
Database
Model Training
Service
Network
Storage
Fetch observations
Fetch lookups
Observation Lookup Flat File Logging
Pickled Model
Train Model
(Scikit Learn)
Time
16 hrs
24 hrs
8 hrs
Lookup compute
1 hr
12 hrs
2.5 days 400GB RAM
1TB into swap
©Kount Inc All Rights Reserved
What works
• Trains a high value model
©Kount Inc All Rights Reserved
What doesn’t work
• Time-intensive
• Errors force restarts since everything is held in memory (and swap)
• Burdens production analytics database
• Pickled model ties execution environment to training environment
• Traceability provided by log files and manual documentation
• Ad hoc experiments with little configuration control
Governance Questions
• Which model are you using?
• How did you train it?
• How well does it work?
After each answer: Why?
Improved training
with HDFS and
Spark
©Kount Inc All Rights Reserved
Cluster for distributed computing
• Dell hardware
• 6 nodes
• 484 vCores
• 1.35 TB RAM
• Cloudera Manager
• Spark 2.4
• Mostly python
HDFS
• Attached to 3 nodes
• 171 TB usable space
©Kount Inc All Rights Reserved
Spark Cluster
Improved approach through cluster
Analytics
Database
HDFSsqoop data
Observation
Lookup
Logging
Zipped MLeap Model
Train Model
(Spark ML)
Time
45 min
2 hrs
8 hrs
Compute lookups
MLflow
Perform lookups
Luigi
<1/2 day
©Kount Inc All Rights Reserved
Remote development with Jupyter
• Most criticisms of notebooks are things you COULD do, not what you
MUST do
• Good development practices are independent of tools
Juptyer Notebook
Pyspark Application
Python Packages
MaturityResearch Production
Version Control (git)
Automation
©Kount Inc All Rights Reserved
What works
• Faster
• Failures restart in the middle
• Reduces burden on production analytics database
• Redesign experiments without penalty
• MLeap decouples evaluation environment from training environment
©Kount Inc All Rights Reserved
What still doesn’t work
• Non-deterministic Spark ML behavior and errors
• Spark pipelines rely on configurations that change based on input data
Tools and Processes
for Model Governance
©Kount Inc All Rights Reserved
Tools and processes for governance
Governance Questions
• Which model are you using?
• How did you train it?
• How well does it work?
After each answer: Why?
Solution components
• Data traceability
• Experiment, configuration, and accuracy traceability
©Kount Inc All Rights Reserved
©Kount Inc All Rights Reserved
©Kount Inc All Rights Reserved
©Kount Inc All Rights Reserved
©Kount Inc All Rights Reserved
©Kount Inc All Rights Reserved
©Kount Inc All Rights Reserved
• Data pipelines with error handling
• Repeatable and documented data transformations
• Document parameters
• Trace to code and data used
• Record accuracy of selected and not selected models
• Store final model and configurations as artifact
Governance Questions
• Which model are you using?
• How did you train it?
• How well does it work?
After each answer: Why?
Conclusions
©Kount Inc All Rights Reserved
Kount’s benefits from Spark/HDFS, Luigi, and MLflow
• Faster
• Failures can restart in the middle
• Reduce burden on production analytics database
• Redesign experiments without penalty
• MLeap decouples evaluation environment from training environment
Governance Questions
• Which model are you using?
• How did you train it?
• How well does it work?
After each answer: Why?
Moving a Fraud-Fighting
Random Forest from scikit-
learn to Spark with ML,
MLflow, and Jupyter
Josh Johnston
Director of AI Science
josh.johnston@kount.com

More Related Content

What's hot

PythonによるCVアルゴリズム実装
PythonによるCVアルゴリズム実装PythonによるCVアルゴリズム実装
PythonによるCVアルゴリズム実装
Hirokatsu Kataoka
 
[DL輪読会]Learning agile and dynamic motor skills for legged robots
[DL輪読会]Learning agile and dynamic motor skills for legged robots[DL輪読会]Learning agile and dynamic motor skills for legged robots
[DL輪読会]Learning agile and dynamic motor skills for legged robots
Deep Learning JP
 
動的計画法の並列化
動的計画法の並列化動的計画法の並列化
動的計画法の並列化
Proktmr
 
GTFSオープンデータで公共交通をアップデート
GTFSオープンデータで公共交通をアップデートGTFSオープンデータで公共交通をアップデート
GTFSオープンデータで公共交通をアップデート
Masaki Ito
 
【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​
ARISE analytics
 
DLLAB Engineer Days : ONNX Export & Optimize
DLLAB Engineer Days : ONNX Export & Optimize DLLAB Engineer Days : ONNX Export & Optimize
DLLAB Engineer Days : ONNX Export & Optimize
Kazuki Kyakuno
 
複数台のKinectV2の使い方
複数台のKinectV2の使い方複数台のKinectV2の使い方
複数台のKinectV2の使い方Norishige Fukushima
 
Coqチュートリアル
CoqチュートリアルCoqチュートリアル
Coqチュートリアル
Yoshihiro Mizoguchi
 
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオンTech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Takahiro Kubo
 
協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用
協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用
協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用
Life Robotics
 
[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)
[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)
[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)
Deep Learning JP
 
大規模ネットワーク分析 篠田
大規模ネットワーク分析 篠田大規模ネットワーク分析 篠田
大規模ネットワーク分析 篠田
Kosuke Shinoda
 
【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan
【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan
【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan
Deep Learning JP
 
[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)
[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)
[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)
Hiroharu Kato
 
paizaのオンラインジャッジを支えるDockerとその周辺
paizaのオンラインジャッジを支えるDockerとその周辺paizaのオンラインジャッジを支えるDockerとその周辺
paizaのオンラインジャッジを支えるDockerとその周辺
paiza
 
#FTMA15 第七回課題 全コースサーベイ
#FTMA15 第七回課題 全コースサーベイ#FTMA15 第七回課題 全コースサーベイ
#FTMA15 第七回課題 全コースサーベイ
Yoichi Ochiai
 
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
techgamecollege
 
TrieとLOUDS??
TrieとLOUDS??TrieとLOUDS??
TrieとLOUDS??
Masahiko Hashimoto
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
Shunichi Sekiguchi
 
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
徹 上野山
 

What's hot (20)

PythonによるCVアルゴリズム実装
PythonによるCVアルゴリズム実装PythonによるCVアルゴリズム実装
PythonによるCVアルゴリズム実装
 
[DL輪読会]Learning agile and dynamic motor skills for legged robots
[DL輪読会]Learning agile and dynamic motor skills for legged robots[DL輪読会]Learning agile and dynamic motor skills for legged robots
[DL輪読会]Learning agile and dynamic motor skills for legged robots
 
動的計画法の並列化
動的計画法の並列化動的計画法の並列化
動的計画法の並列化
 
GTFSオープンデータで公共交通をアップデート
GTFSオープンデータで公共交通をアップデートGTFSオープンデータで公共交通をアップデート
GTFSオープンデータで公共交通をアップデート
 
【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​
 
DLLAB Engineer Days : ONNX Export & Optimize
DLLAB Engineer Days : ONNX Export & Optimize DLLAB Engineer Days : ONNX Export & Optimize
DLLAB Engineer Days : ONNX Export & Optimize
 
複数台のKinectV2の使い方
複数台のKinectV2の使い方複数台のKinectV2の使い方
複数台のKinectV2の使い方
 
Coqチュートリアル
CoqチュートリアルCoqチュートリアル
Coqチュートリアル
 
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオンTech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
 
協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用
協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用
協働ロボットCOROの開発における形式的仕様記述KMLの開発と適用
 
[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)
[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)
[DL輪読会] Revealing the Dark Secrets of BERT (EMNLP-IJCNLP, 2019)
 
大規模ネットワーク分析 篠田
大規模ネットワーク分析 篠田大規模ネットワーク分析 篠田
大規模ネットワーク分析 篠田
 
【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan
【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan
【DL輪読会】AuthenticAuthentic Volumetric Avatars from a Phone Scan
 
[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)
[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)
[第2回3D勉強会 研究紹介] Neural 3D Mesh Renderer (CVPR 2018)
 
paizaのオンラインジャッジを支えるDockerとその周辺
paizaのオンラインジャッジを支えるDockerとその周辺paizaのオンラインジャッジを支えるDockerとその周辺
paizaのオンラインジャッジを支えるDockerとその周辺
 
#FTMA15 第七回課題 全コースサーベイ
#FTMA15 第七回課題 全コースサーベイ#FTMA15 第七回課題 全コースサーベイ
#FTMA15 第七回課題 全コースサーベイ
 
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
 
TrieとLOUDS??
TrieとLOUDS??TrieとLOUDS??
TrieとLOUDS??
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
 
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
 

Similar to Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter

Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
Tash Bickley
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
iguazio
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
supportlogic
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
Walid Shaari
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das
 
Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...
Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...
Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...
Government Technology and Services Coalition
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
Databricks
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
jClarity
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Yong Feng
 
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
Iod session 3423   analytics patterns of expertise, the fast path to amazing ...Iod session 3423   analytics patterns of expertise, the fast path to amazing ...
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
Rachel Bland
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
SigOpt
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
motorized bike j2ee ppt explanation of project
motorized bike j2ee ppt explanation of projectmotorized bike j2ee ppt explanation of project
motorized bike j2ee ppt explanation of project
prabhat kumar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Get ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_extGet ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_ext
Oracle Developers
 

Similar to Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter (20)

Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...
Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...
Dr. Jim Murray: How do we Protect our Systems and Meet Compliance in a Rapidl...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
Iod session 3423   analytics patterns of expertise, the fast path to amazing ...Iod session 3423   analytics patterns of expertise, the fast path to amazing ...
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
motorized bike j2ee ppt explanation of project
motorized bike j2ee ppt explanation of projectmotorized bike j2ee ppt explanation of project
motorized bike j2ee ppt explanation of project
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Get ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_extGet ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_ext
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 

Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLflow, and Jupyter

  • 1. Moving a Fraud-Fighting Random Forest from scikit- learn to Spark with ML, MLflow, and Jupyter Josh Johnston Director of AI Science josh.johnston@kount.com
  • 2. ©Kount Inc All Rights Reserved Overview Model lifecycle Our fraud-detecting model Initial method with database and scikit learn Improved method with HDFS and Spark Robust model governance
  • 3. ©Kount Inc All Rights Reserved Manage the model lifecycle Microsoft. (2017, October 19). What is the Team Data Science Process? Retrieved March 26, 2019, from https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview Modeling • Configuration management • Performance (speed) • Accuracy • Validation Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Science is repeatable
  • 5. ©Kount Inc All Rights Reserved Kount protects digital innovations from… Fraudulent Account Creation Transaction/ Payment Fraud Account Takeover Fraud Authentication Friction
  • 6. ©Kount Inc All Rights Reserved Evaluate transactions for fraud • Substantial throughput • 30-100 transactions per second • Low latency • 250 ms end-to-end system latency • ~15 ms for machine learning features and model
  • 7. ©Kount Inc All Rights Reserved Evaluate transactions for fraud
  • 8. ©Kount Inc All Rights Reserved
  • 9. ©Kount Inc All Rights Reserved Approve an extra ~3K transactions and $1.2M USD per month Reduced manual reviews by 200 hours/month Reduced chargeback rate by 17% Reduced manual reviews by 20% Sleep better at night Don’t hear complaints from fraud team about review queue anymore Fraud Manager Feedback: Boost Technology™ Customer View
  • 10. ©Kount Inc All Rights Reserved Boost Technology™ Technical View Feature Engineering • 200 GB of precomputed data Model • Random forest • 250 trees • ~100k nodes per tree • ~1GB serialized representation Model Training • ~150 features • ~60M observations
  • 11. Initial training with database and scikit learn
  • 12. ©Kount Inc All Rights Reserved First approach gets to production Analytics Database Model Training Service Network Storage Fetch observations Fetch lookups Observation Lookup Flat File Logging Pickled Model Train Model (Scikit Learn) Time 16 hrs 24 hrs 8 hrs Lookup compute 1 hr 12 hrs 2.5 days 400GB RAM 1TB into swap
  • 13. ©Kount Inc All Rights Reserved What works • Trains a high value model
  • 14. ©Kount Inc All Rights Reserved What doesn’t work • Time-intensive • Errors force restarts since everything is held in memory (and swap) • Burdens production analytics database • Pickled model ties execution environment to training environment • Traceability provided by log files and manual documentation • Ad hoc experiments with little configuration control Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why?
  • 16. ©Kount Inc All Rights Reserved Cluster for distributed computing • Dell hardware • 6 nodes • 484 vCores • 1.35 TB RAM • Cloudera Manager • Spark 2.4 • Mostly python HDFS • Attached to 3 nodes • 171 TB usable space
  • 17. ©Kount Inc All Rights Reserved Spark Cluster Improved approach through cluster Analytics Database HDFSsqoop data Observation Lookup Logging Zipped MLeap Model Train Model (Spark ML) Time 45 min 2 hrs 8 hrs Compute lookups MLflow Perform lookups Luigi <1/2 day
  • 18. ©Kount Inc All Rights Reserved Remote development with Jupyter • Most criticisms of notebooks are things you COULD do, not what you MUST do • Good development practices are independent of tools Juptyer Notebook Pyspark Application Python Packages MaturityResearch Production Version Control (git) Automation
  • 19. ©Kount Inc All Rights Reserved What works • Faster • Failures restart in the middle • Reduces burden on production analytics database • Redesign experiments without penalty • MLeap decouples evaluation environment from training environment
  • 20. ©Kount Inc All Rights Reserved What still doesn’t work • Non-deterministic Spark ML behavior and errors • Spark pipelines rely on configurations that change based on input data
  • 21. Tools and Processes for Model Governance
  • 22. ©Kount Inc All Rights Reserved Tools and processes for governance Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Solution components • Data traceability • Experiment, configuration, and accuracy traceability
  • 23. ©Kount Inc All Rights Reserved
  • 24. ©Kount Inc All Rights Reserved
  • 25. ©Kount Inc All Rights Reserved
  • 26. ©Kount Inc All Rights Reserved
  • 27. ©Kount Inc All Rights Reserved
  • 28. ©Kount Inc All Rights Reserved
  • 29. ©Kount Inc All Rights Reserved • Data pipelines with error handling • Repeatable and documented data transformations • Document parameters • Trace to code and data used • Record accuracy of selected and not selected models • Store final model and configurations as artifact Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why?
  • 31. ©Kount Inc All Rights Reserved Kount’s benefits from Spark/HDFS, Luigi, and MLflow • Faster • Failures can restart in the middle • Reduce burden on production analytics database • Redesign experiments without penalty • MLeap decouples evaluation environment from training environment Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why?
  • 32. Moving a Fraud-Fighting Random Forest from scikit- learn to Spark with ML, MLflow, and Jupyter Josh Johnston Director of AI Science josh.johnston@kount.com