Mlflow + analysts:
“misusing” MLflow to help
deduplicate data at scale
Maya Livshits, SWE (@mayalivshits)
Robin Oliva-Kraft, Product Manager (@robinkraft)
Intuit
Context
Auto-tracking with MLflow
Demo
Technical deep dive
Conclusions
Agenda
Intuit Confidential and Proprietary 3
Powering Prosperity
Around the World
MISSION
3
Intuit Confidential and Proprietary
AI-DRIVEN EXPERT PLATFORM
More Money
No Work
Complete Confidence
ONE ECOSYSTEM
Data and AI help us deliver
to customers the benefits of
More Money, No Work, and
Complete Confidence
Problem: millions of duplicate records
Why so many duplicates?
Companies
People
Imagination Inc.
(Payroll)
Imagination, Inc.
(QuickBooks)
Cookies Inc.
(QuickBooks)
Imagination
Incorporated
(Vendor)
Mint
User
Payroll
User
Company
Owner
Credit
Karma
member
7
©2021 Intuit Inc. All rights reserved.
Entity Resolution
Generate a unique representation of a real-world “thing”
• Feature generation: pre-process data for matching
• Matching: identify records that are related to the same real thing
• Mastering: collapse related records into one unique representation
Use case #1: Keep customers & CS Agents happy and secure, by ensuring Agents can
quickly and accurately authenticate callers.
This is easier and quicker when there are no duplicates.
Auto-tracking using MLflow
9
©2021 Intuit Inc. All rights reserved.
Problem statement
I am a data analyst
Trying to be productive with the Entity Resolution tool
But it’s hard to build on past work
Because people track quality metrics manually (if at all) & data is missing
Which makes me feel concerned that my output isn’t what it could be.
10
©2021 Intuit Inc. All rights reserved.
Analyst user base is not MLflow's target audience
• This is not their day job
• SQL skills, maybe some Python
• Not software/data/ML engineers or data scientists
• Self-serve UX
4 data sources, then manual data entry
Before
Notebooks S3 IMs
Databricks
Ad hoc, manual tracking is ...
It’s hard to improve if you don’t have a clear record of what you’ve tried
Incomplete
• Even if you ARE tracking, not every experiment or metric gets captured
Inconsistent
• Different entities get different treatment by different people
Cumbersome
• There are 4 sources of information (S3, output tables, IMs, Databricks notebooks)
Non-discoverable
• New users have to create their own spreadsheet, their own way
Incomplete Complete
Inconsistent Consistent
Cumbersome Effortless
Non-discoverable Discoverable
Automated tracking is ...
Every job is tracked, in the same way, without human intervention, in a tool linked to from Entity Resolution.
We’re using this
Demo
Not your typical MLflow architecture
Scala
NodeJS
Python
DBK
notebook
Mlflow different APIs
Scala
NodeJS Mlflow REST API
Mlflow Java API
Mlflow Python API
Python
DBK
notebook
Mlflow Java API
Mlflow different APIs – different implementation
• Python api - start_run can get a run_id and continue a previous run
including adding new metrics and new parameters.
• Java api - startRun can get a parentRunId and creates a nested run within
an existing run to update a metric or parameter.
Solution
Use MlflowClient Class instead of mlflowContext class
MLflow Tracking different data types
entity
resolution
spark
er config api Tracking Parameters
Starts Run
Tracking Metrics
Post-run
Tracking Metrics
Databricks
ER
notebook
Manual
Tracking Artifacts
Pre-run
Tracking Parameters
MLflow Tracking details
entity
resolution
spark
Mlflow api
er config api
Starts Run
Summary Description
Post-run
Input count Output count Reduction
Databricks
ER notebook
Manual
False Positives False Negatives
Pre-run
IM fields
Paths
Config Resources
Matchers
Input Path Output Path Job Id ER version
Mlflow Run Id
Run Type
Job Id --> Mlflow Run Id
All saved in one place 
Mlflow run id
Internal run id
Conclusions
Real user slide using Mlflow UI!
“Autotracking with Mlflow makes historical comparisons so much easier!”
Q&A
Maya Livshits (@mayalivshits)
Robin Oliva-Kraft (@robinkraft)
Intuit
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Misusing MLflow To Help Deduplicate Data At Scale

  • 1.
    Mlflow + analysts: “misusing”MLflow to help deduplicate data at scale Maya Livshits, SWE (@mayalivshits) Robin Oliva-Kraft, Product Manager (@robinkraft) Intuit
  • 2.
  • 3.
    Intuit Confidential andProprietary 3 Powering Prosperity Around the World MISSION 3 Intuit Confidential and Proprietary
  • 4.
    AI-DRIVEN EXPERT PLATFORM MoreMoney No Work Complete Confidence ONE ECOSYSTEM Data and AI help us deliver to customers the benefits of More Money, No Work, and Complete Confidence
  • 5.
    Problem: millions ofduplicate records
  • 6.
    Why so manyduplicates? Companies People Imagination Inc. (Payroll) Imagination, Inc. (QuickBooks) Cookies Inc. (QuickBooks) Imagination Incorporated (Vendor) Mint User Payroll User Company Owner Credit Karma member
  • 7.
    7 ©2021 Intuit Inc.All rights reserved. Entity Resolution Generate a unique representation of a real-world “thing” • Feature generation: pre-process data for matching • Matching: identify records that are related to the same real thing • Mastering: collapse related records into one unique representation Use case #1: Keep customers & CS Agents happy and secure, by ensuring Agents can quickly and accurately authenticate callers. This is easier and quicker when there are no duplicates.
  • 8.
  • 9.
    9 ©2021 Intuit Inc.All rights reserved. Problem statement I am a data analyst Trying to be productive with the Entity Resolution tool But it’s hard to build on past work Because people track quality metrics manually (if at all) & data is missing Which makes me feel concerned that my output isn’t what it could be.
  • 10.
    10 ©2021 Intuit Inc.All rights reserved. Analyst user base is not MLflow's target audience • This is not their day job • SQL skills, maybe some Python • Not software/data/ML engineers or data scientists • Self-serve UX
  • 11.
    4 data sources,then manual data entry Before Notebooks S3 IMs Databricks
  • 12.
    Ad hoc, manualtracking is ... It’s hard to improve if you don’t have a clear record of what you’ve tried Incomplete • Even if you ARE tracking, not every experiment or metric gets captured Inconsistent • Different entities get different treatment by different people Cumbersome • There are 4 sources of information (S3, output tables, IMs, Databricks notebooks) Non-discoverable • New users have to create their own spreadsheet, their own way
  • 13.
    Incomplete Complete Inconsistent Consistent CumbersomeEffortless Non-discoverable Discoverable Automated tracking is ... Every job is tracked, in the same way, without human intervention, in a tool linked to from Entity Resolution.
  • 14.
  • 15.
  • 16.
    Not your typicalMLflow architecture Scala NodeJS Python DBK notebook
  • 17.
    Mlflow different APIs Scala NodeJSMlflow REST API Mlflow Java API Mlflow Python API Python DBK notebook Mlflow Java API
  • 18.
    Mlflow different APIs– different implementation • Python api - start_run can get a run_id and continue a previous run including adding new metrics and new parameters. • Java api - startRun can get a parentRunId and creates a nested run within an existing run to update a metric or parameter. Solution Use MlflowClient Class instead of mlflowContext class
  • 19.
    MLflow Tracking differentdata types entity resolution spark er config api Tracking Parameters Starts Run Tracking Metrics Post-run Tracking Metrics Databricks ER notebook Manual Tracking Artifacts Pre-run Tracking Parameters
  • 20.
    MLflow Tracking details entity resolution spark Mlflowapi er config api Starts Run Summary Description Post-run Input count Output count Reduction Databricks ER notebook Manual False Positives False Negatives Pre-run IM fields Paths Config Resources Matchers Input Path Output Path Job Id ER version Mlflow Run Id Run Type Job Id --> Mlflow Run Id
  • 21.
    All saved inone place  Mlflow run id Internal run id
  • 22.
    Conclusions Real user slideusing Mlflow UI! “Autotracking with Mlflow makes historical comparisons so much easier!”
  • 23.
    Q&A Maya Livshits (@mayalivshits) RobinOliva-Kraft (@robinkraft) Intuit
  • 24.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.