Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mlflow + analysts:
“misusing” MLflow to help
deduplicate data at scale
Maya Livshits, SWE (@mayalivshits)
Robin Oliva-Kraft,...
Context
Auto-tracking with MLflow
Demo
Technical deep dive
Conclusions
Agenda
Intuit Confidential and Proprietary 3
Powering Prosperity
Around the World
MISSION
3
Intuit Confidential and Proprietary
AI-DRIVEN EXPERT PLATFORM
More Money
No Work
Complete Confidence
ONE ECOSYSTEM
Data and AI help us deliver
to customers th...
Problem: millions of duplicate records
Why so many duplicates?
Companies
People
Imagination Inc.
(Payroll)
Imagination, Inc.
(QuickBooks)
Cookies Inc.
(QuickBook...
7
©2021 Intuit Inc. All rights reserved.
Entity Resolution
Generate a unique representation of a real-world “thing”
• Feat...
Auto-tracking using MLflow
9
©2021 Intuit Inc. All rights reserved.
Problem statement
I am a data analyst
Trying to be productive with the Entity Res...
10
©2021 Intuit Inc. All rights reserved.
Analyst user base is not MLflow's target audience
• This is not their day job
• ...
4 data sources, then manual data entry
Before
Notebooks S3 IMs
Databricks
Ad hoc, manual tracking is ...
It’s hard to improve if you don’t have a clear record of what you’ve tried
Incomplete
• Eve...
Incomplete Complete
Inconsistent Consistent
Cumbersome Effortless
Non-discoverable Discoverable
Automated tracking is ...
...
We’re using this
Demo
Not your typical MLflow architecture
Scala
NodeJS
Python
DBK
notebook
Mlflow different APIs
Scala
NodeJS Mlflow REST API
Mlflow Java API
Mlflow Python API
Python
DBK
notebook
Mlflow Java API
Mlflow different APIs – different implementation
• Python api - start_run can get a run_id and continue a previous run
inc...
MLflow Tracking different data types
entity
resolution
spark
er config api Tracking Parameters
Starts Run
Tracking Metrics...
MLflow Tracking details
entity
resolution
spark
Mlflow api
er config api
Starts Run
Summary Description
Post-run
Input cou...
All saved in one place 
Mlflow run id
Internal run id
Conclusions
Real user slide using Mlflow UI!
“Autotracking with Mlflow makes historical comparisons so much easier!”
Q&A
Maya Livshits (@mayalivshits)
Robin Oliva-Kraft (@robinkraft)
Intuit
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Misusing MLflow To Help Deduplicate Data At Scale

Download to read offline

At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.

  • Be the first to like this

Misusing MLflow To Help Deduplicate Data At Scale

  1. 1. Mlflow + analysts: “misusing” MLflow to help deduplicate data at scale Maya Livshits, SWE (@mayalivshits) Robin Oliva-Kraft, Product Manager (@robinkraft) Intuit
  2. 2. Context Auto-tracking with MLflow Demo Technical deep dive Conclusions Agenda
  3. 3. Intuit Confidential and Proprietary 3 Powering Prosperity Around the World MISSION 3 Intuit Confidential and Proprietary
  4. 4. AI-DRIVEN EXPERT PLATFORM More Money No Work Complete Confidence ONE ECOSYSTEM Data and AI help us deliver to customers the benefits of More Money, No Work, and Complete Confidence
  5. 5. Problem: millions of duplicate records
  6. 6. Why so many duplicates? Companies People Imagination Inc. (Payroll) Imagination, Inc. (QuickBooks) Cookies Inc. (QuickBooks) Imagination Incorporated (Vendor) Mint User Payroll User Company Owner Credit Karma member
  7. 7. 7 ©2021 Intuit Inc. All rights reserved. Entity Resolution Generate a unique representation of a real-world “thing” • Feature generation: pre-process data for matching • Matching: identify records that are related to the same real thing • Mastering: collapse related records into one unique representation Use case #1: Keep customers & CS Agents happy and secure, by ensuring Agents can quickly and accurately authenticate callers. This is easier and quicker when there are no duplicates.
  8. 8. Auto-tracking using MLflow
  9. 9. 9 ©2021 Intuit Inc. All rights reserved. Problem statement I am a data analyst Trying to be productive with the Entity Resolution tool But it’s hard to build on past work Because people track quality metrics manually (if at all) & data is missing Which makes me feel concerned that my output isn’t what it could be.
  10. 10. 10 ©2021 Intuit Inc. All rights reserved. Analyst user base is not MLflow's target audience • This is not their day job • SQL skills, maybe some Python • Not software/data/ML engineers or data scientists • Self-serve UX
  11. 11. 4 data sources, then manual data entry Before Notebooks S3 IMs Databricks
  12. 12. Ad hoc, manual tracking is ... It’s hard to improve if you don’t have a clear record of what you’ve tried Incomplete • Even if you ARE tracking, not every experiment or metric gets captured Inconsistent • Different entities get different treatment by different people Cumbersome • There are 4 sources of information (S3, output tables, IMs, Databricks notebooks) Non-discoverable • New users have to create their own spreadsheet, their own way
  13. 13. Incomplete Complete Inconsistent Consistent Cumbersome Effortless Non-discoverable Discoverable Automated tracking is ... Every job is tracked, in the same way, without human intervention, in a tool linked to from Entity Resolution.
  14. 14. We’re using this
  15. 15. Demo
  16. 16. Not your typical MLflow architecture Scala NodeJS Python DBK notebook
  17. 17. Mlflow different APIs Scala NodeJS Mlflow REST API Mlflow Java API Mlflow Python API Python DBK notebook Mlflow Java API
  18. 18. Mlflow different APIs – different implementation • Python api - start_run can get a run_id and continue a previous run including adding new metrics and new parameters. • Java api - startRun can get a parentRunId and creates a nested run within an existing run to update a metric or parameter. Solution Use MlflowClient Class instead of mlflowContext class
  19. 19. MLflow Tracking different data types entity resolution spark er config api Tracking Parameters Starts Run Tracking Metrics Post-run Tracking Metrics Databricks ER notebook Manual Tracking Artifacts Pre-run Tracking Parameters
  20. 20. MLflow Tracking details entity resolution spark Mlflow api er config api Starts Run Summary Description Post-run Input count Output count Reduction Databricks ER notebook Manual False Positives False Negatives Pre-run IM fields Paths Config Resources Matchers Input Path Output Path Job Id ER version Mlflow Run Id Run Type Job Id --> Mlflow Run Id
  21. 21. All saved in one place  Mlflow run id Internal run id
  22. 22. Conclusions Real user slide using Mlflow UI! “Autotracking with Mlflow makes historical comparisons so much easier!”
  23. 23. Q&A Maya Livshits (@mayalivshits) Robin Oliva-Kraft (@robinkraft) Intuit
  24. 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.

Views

Total views

51

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×