Misusing MLflow To Help Deduplicate Data At Scale

Mlﬂow + analysts:
“misusing” MLﬂow to help
deduplicate data at scale
Maya Livshits, SWE (@mayalivshits)
Robin Oliva-Kraft, Product Manager (@robinkraft)
Intuit

Context
Auto-tracking with MLflow
Demo
Technical deep dive
Conclusions
Agenda

Intuit Confidential and Proprietary 3
Powering Prosperity
Around the World
MISSION
3
Intuit Confidential and Proprietary

AI-DRIVEN EXPERT PLATFORM
More Money
No Work
Complete Confidence
ONE ECOSYSTEM
Data and AI help us deliver
to customers the benefits of
More Money, No Work, and
Complete Confidence

Problem: millions of duplicate records

Why so many duplicates?
Companies
People
Imagination Inc.
(Payroll)
Imagination, Inc.
(QuickBooks)
Cookies Inc.
(QuickBooks)
Imagination
Incorporated
(Vendor)
Mint
User
Payroll
User
Company
Owner
Credit
Karma
member

7
©2021 Intuit Inc. All rights reserved.
Entity Resolution
Generate a unique representation of a real-world “thing”
• Feature generation: pre-process data for matching
• Matching: identify records that are related to the same real thing
• Mastering: collapse related records into one unique representation
Use case #1: Keep customers & CS Agents happy and secure, by ensuring Agents can
quickly and accurately authenticate callers.
This is easier and quicker when there are no duplicates.

9
Problem statement
I am a data analyst
Trying to be productive with the Entity Resolution tool
But it’s hard to build on past work
Because people track quality metrics manually (if at all) & data is missing
Which makes me feel concerned that my output isn’t what it could be.

10
Analyst user base is not MLflow's target audience
• This is not their day job
• SQL skills, maybe some Python
• Not software/data/ML engineers or data scientists
• Self-serve UX

4 data sources, then manual data entry
Before
Notebooks S3 IMs
Databricks

Ad hoc, manual tracking is ...
It’s hard to improve if you don’t have a clear record of what you’ve tried
Incomplete
• Even if you ARE tracking, not every experiment or metric gets captured
Inconsistent
• Different entities get different treatment by different people
Cumbersome
• There are 4 sources of information (S3, output tables, IMs, Databricks notebooks)
Non-discoverable
• New users have to create their own spreadsheet, their own way

Incomplete Complete
Inconsistent Consistent
Cumbersome Effortless
Non-discoverable Discoverable
Automated tracking is ...
Every job is tracked, in the same way, without human intervention, in a tool linked to from Entity Resolution.

Not your typical MLflow architecture
Scala
NodeJS
Python
DBK
notebook

Mlflow different APIs
Scala
NodeJS Mlflow REST API
Mlflow Java API
Mlflow Python API
Python
DBK
notebook
Mlflow Java API

Mlflow different APIs – different implementation
• Python api - start_run can get a run_id and continue a previous run
including adding new metrics and new parameters.
• Java api - startRun can get a parentRunId and creates a nested run within
an existing run to update a metric or parameter.
Solution
Use MlflowClient Class instead of mlflowContext class

MLflow Tracking different data types
entity
resolution
spark
er config api Tracking Parameters
Starts Run
Tracking Metrics
Post-run
Tracking Metrics
Databricks
ER
notebook
Manual
Tracking Artifacts
Pre-run
Tracking Parameters

MLflow Tracking details
entity
resolution
spark
Mlflow api
er config api
Starts Run
Summary Description
Post-run
Input count Output count Reduction
Databricks
ER notebook
Manual
False Positives False Negatives
Pre-run
IM fields
Paths
Config Resources
Matchers
Input Path Output Path Job Id ER version
Mlflow Run Id
Run Type
Job Id --> Mlflow Run Id

All saved in one place
Mlflow run id
Internal run id

Conclusions
Real user slide using Mlflow UI!
“Autotracking with Mlflow makes historical comparisons so much easier!”

Q&A
Maya Livshits (@mayalivshits)
Robin Oliva-Kraft (@robinkraft)
Intuit

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Misusing MLflow To Help Deduplicate Data At Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Misusing MLflow To Help Deduplicate Data At Scale

Similar to Misusing MLflow To Help Deduplicate Data At Scale (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Misusing MLflow To Help Deduplicate Data At Scale