At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Misusing MLflow To Help Deduplicate Data At Scale
1. Mlflow + analysts:
“misusing” MLflow to help
deduplicate data at scale
Maya Livshits, SWE (@mayalivshits)
Robin Oliva-Kraft, Product Manager (@robinkraft)
Intuit
3. Intuit Confidential and Proprietary 3
Powering Prosperity
Around the World
MISSION
3
Intuit Confidential and Proprietary
4. AI-DRIVEN EXPERT PLATFORM
More Money
No Work
Complete Confidence
ONE ECOSYSTEM
Data and AI help us deliver
to customers the benefits of
More Money, No Work, and
Complete Confidence
6. Why so many duplicates?
Companies
People
Imagination Inc.
(Payroll)
Imagination, Inc.
(QuickBooks)
Cookies Inc.
(QuickBooks)
Imagination
Incorporated
(Vendor)
Mint
User
Payroll
User
Company
Owner
Credit
Karma
member
11. 4 data sources, then manual data entry
Before
Notebooks S3 IMs
Databricks
12. Ad hoc, manual tracking is ...
It’s hard to improve if you don’t have a clear record of what you’ve tried
Incomplete
• Even if you ARE tracking, not every experiment or metric gets captured
Inconsistent
• Different entities get different treatment by different people
Cumbersome
• There are 4 sources of information (S3, output tables, IMs, Databricks notebooks)
Non-discoverable
• New users have to create their own spreadsheet, their own way
13. Incomplete Complete
Inconsistent Consistent
Cumbersome Effortless
Non-discoverable Discoverable
Automated tracking is ...
Every job is tracked, in the same way, without human intervention, in a tool linked to from Entity Resolution.
18. Mlflow different APIs – different implementation
• Python api - start_run can get a run_id and continue a previous run
including adding new metrics and new parameters.
• Java api - startRun can get a parentRunId and creates a nested run within
an existing run to update a metric or parameter.
Solution
Use MlflowClient Class instead of mlflowContext class
19. MLflow Tracking different data types
entity
resolution
spark
er config api Tracking Parameters
Starts Run
Tracking Metrics
Post-run
Tracking Metrics
Databricks
ER
notebook
Manual
Tracking Artifacts
Pre-run
Tracking Parameters
20. MLflow Tracking details
entity
resolution
spark
Mlflow api
er config api
Starts Run
Summary Description
Post-run
Input count Output count Reduction
Databricks
ER notebook
Manual
False Positives False Negatives
Pre-run
IM fields
Paths
Config Resources
Matchers
Input Path Output Path Job Id ER version
Mlflow Run Id
Run Type
Job Id --> Mlflow Run Id
21. All saved in one place
Mlflow run id
Internal run id
22. Conclusions
Real user slide using Mlflow UI!
“Autotracking with Mlflow makes historical comparisons so much easier!”