0
Cleanliness is next to Godliness
Deduplicating Your Customer Data
Parts of the talk
Talking about Data Quality
Techniques for Deduplication
Processing, Timing and Mindset
Part 1 Part 2 Par...
Once upon a time...
Age of information
Large amounts of data inputted by humans
Humans make mistakes...
Information is a significant raw material for
businesses around the world.
Making data-based decisions
Wrong information leads to wrong decisions
Information as products
Bad and unimpressive produc...
Gathering Data from Humans
Paper forms
• Spelling mistakes
• Unclear questions
• Bare minimum information
• OCR
Web forms
...
Tainting Existing Data
Changes in procedures
• Didn’t update older data
• Different data structures
• Different ways of ha...
Some Industry Jargon
Single View of Customer
• Marketing Campaigns
Single Version of the Truth
• Strategy
Getting Correct ...
Consider this
You start a direct mail marketing campaign
And this happens...
Dear Mr ----- O’Brien,
We are delighted to inform you
that we have an amazing offer
specifically for you..
Avoiding Embarrassing Mistakes
• Marketing/PR
• Accounting
• Shipping
• Strategy
How much is it worth?
• 30% ROI (big consultancy)
• 10-25% Loss of revenue for bad data quality
• Competitive advantage
• ...
MFI Group
Founded 1964
Upgraded ERP systems early 2000’s
Due to issues with data quality in 2004
• £46m in lost sales, £16...
Recap
Data Quality is a big subject
Avoid embarrassing mistakes
Keep company running efficiently
Good for reports
What Deduplication is used for
Increasing data quality
Compressing data
Pre-stage data cleansing needed
Matching
Techniques
• Address
• Name
• Fuzzy
• DOB
Business Rules
Quality Matching
Ask the Data
Address Matching
Databases
• Royal Mail (PAF)
• Council Address Data
• Do Your Own
Fill in missing parts
House Number, Bui...
Name Matching
Name, Full name
Forename, Firstname,
Lastname, Surname
Initial
Middle name(s)
Title, Suffix
Qualification
Lo...
SQL example
SELECT c1.*, c2.*
FROM customers c1 INNER JOIN customers c2
ON c1.address_id = c2.address_id
WHERE c1.surname ...
Title Forename Middle Surname DOB
MR MARK MADANES 05/10/1963
MR MARK MADANES 04/10/1963
Title Forename Middle Surname DOB
MR CIARAN GERARD O’NEILL 26/07/1971
MR CIARAN M O’NEILL 26/07/1971
Title Forename Middle Surname DOB
MS JAN PHILMORE 15/10/1954
MR JAN PHILMORE 00/00/0000
Title Forename Middle Surname DOB
MR ALBERTO CARLOS 00/00/0000
MR ALBERT O CARLOS 00/00/0000
Fuzzy Matching
Levenshtein
select levenshtein(‘jonathan’,’jonathon’) -> 1
Download from: http://www.artfulsoftware.com/inf...
Fuzzy Matching
Soundex
select soundex('jonathan') -> J535
Metaphone
echo metaphone('jonathan') -> JNON
Title Forename Middle Surname DOB
SAMUEL JOHNSTONE 00/00/0000
MR SAMUEL JOHNSTON 00/00/0000
Business Rules
Certain Level of Correctness
Generic Rules and Source Specific Rules
Business Rules
Example
• Middle name: Adam Smith vs. Adam E Smith
• Title: Miss vs. Ms vs. Lady
• Initial: A Smith vs. Ada...
Things to Watch Out for
Same father/son or mother/daughter names
Twins with same DOB
Initial for a forename
Mixing of fore...
Quality Matching
Analyze data sources
How recent the data is
Ask the Data
Name popularity
Number of sources
• Example: 4 sources vs. 1 source say this spelling is
right
Consider Using a Democratic System
Opposite of hieratical (if-then-else) system
If rules order is problematic
Business Rul...
Recap
Find address
Find duplicates
Try to make a decision for deduplication
• Business Rules
• Ask the Data
Processing
CPU/Disk/Memory bound
Sequential or parallel
Processing Data
Extra data
Result table
Temp data
Timing
On insert
A few minutes after insert (events)
Scheduled tasks
Pre-fetch
When user asks for it
New Data User Request...
Using Your Team
DBAs
Database Developers/ETL experts
Data Analysts
Developers
Testers
Mindset
Never 100%
Best Effort
Pareto Principle
Continuous Improvement
Cost
Benefits
Final Recap
Continuous Improvements
Which duplicate is the correct one?
Combine business rules + ask the data
Questions & Answers
Contact Information:
MySQL-related questions about presentation?
Non-profit or Medical?
contact@jonathanlevin.co.uk
Cleanliness is next to Godliness
Upcoming SlideShare
Loading in...5
×

Cleanliness is next to Godliness

3,578

Published on

Published in: Technology, Economy & Finance
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,578
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Cleanliness is next to Godliness"

  1. 1. Cleanliness is next to Godliness Deduplicating Your Customer Data
  2. 2. Parts of the talk Talking about Data Quality Techniques for Deduplication Processing, Timing and Mindset Part 1 Part 2 Part 3 Timeline
  3. 3. Once upon a time... Age of information
  4. 4. Large amounts of data inputted by humans
  5. 5. Humans make mistakes...
  6. 6. Information is a significant raw material for businesses around the world.
  7. 7. Making data-based decisions Wrong information leads to wrong decisions Information as products Bad and unimpressive products Information for logistics Company may shut down
  8. 8. Gathering Data from Humans Paper forms • Spelling mistakes • Unclear questions • Bare minimum information • OCR Web forms • Bypassing filters
  9. 9. Tainting Existing Data Changes in procedures • Didn’t update older data • Different data structures • Different ways of handling data Importing sources of (bad) data
  10. 10. Some Industry Jargon Single View of Customer • Marketing Campaigns Single Version of the Truth • Strategy Getting Correct Reports
  11. 11. Consider this You start a direct mail marketing campaign And this happens...
  12. 12. Dear Mr ----- O’Brien, We are delighted to inform you that we have an amazing offer specifically for you..
  13. 13. Avoiding Embarrassing Mistakes • Marketing/PR • Accounting • Shipping • Strategy
  14. 14. How much is it worth? • 30% ROI (big consultancy) • 10-25% Loss of revenue for bad data quality • Competitive advantage • Avoid going out of business
  15. 15. MFI Group Founded 1964 Upgraded ERP systems early 2000’s Due to issues with data quality in 2004 • £46m in lost sales, £16m extra deliveries + technical costs and £20m for the actual system. Administration 2008 (Comeback 2010)
  16. 16. Recap Data Quality is a big subject Avoid embarrassing mistakes Keep company running efficiently Good for reports
  17. 17. What Deduplication is used for Increasing data quality Compressing data Pre-stage data cleansing needed
  18. 18. Matching Techniques • Address • Name • Fuzzy • DOB Business Rules Quality Matching Ask the Data
  19. 19. Address Matching Databases • Royal Mail (PAF) • Council Address Data • Do Your Own Fill in missing parts House Number, Building Number, House Name, Flat Number, Company Name, Street, Locality, Town, City, County, Country and Postcode
  20. 20. Name Matching Name, Full name Forename, Firstname, Lastname, Surname Initial Middle name(s) Title, Suffix Qualification Lord James Jonah William Smith 3rd
  21. 21. SQL example SELECT c1.*, c2.* FROM customers c1 INNER JOIN customers c2 ON c1.address_id = c2.address_id WHERE c1.surname = c2.surname AND c1.forename = c2.forename AND (c1.middlename = c2.middlename XOR (c1.middlename = ‘’ XOR c2.middle=name‘’));
  22. 22. Title Forename Middle Surname DOB MR MARK MADANES 05/10/1963 MR MARK MADANES 04/10/1963
  23. 23. Title Forename Middle Surname DOB MR CIARAN GERARD O’NEILL 26/07/1971 MR CIARAN M O’NEILL 26/07/1971
  24. 24. Title Forename Middle Surname DOB MS JAN PHILMORE 15/10/1954 MR JAN PHILMORE 00/00/0000
  25. 25. Title Forename Middle Surname DOB MR ALBERTO CARLOS 00/00/0000 MR ALBERT O CARLOS 00/00/0000
  26. 26. Fuzzy Matching Levenshtein select levenshtein(‘jonathan’,’jonathon’) -> 1 Download from: http://www.artfulsoftware.com/infotree/queries.php?&bw=1280#552
  27. 27. Fuzzy Matching Soundex select soundex('jonathan') -> J535 Metaphone echo metaphone('jonathan') -> JNON
  28. 28. Title Forename Middle Surname DOB SAMUEL JOHNSTONE 00/00/0000 MR SAMUEL JOHNSTON 00/00/0000
  29. 29. Business Rules Certain Level of Correctness Generic Rules and Source Specific Rules
  30. 30. Business Rules Example • Middle name: Adam Smith vs. Adam E Smith • Title: Miss vs. Ms vs. Lady • Initial: A Smith vs. Adam Smith (same address) • Surnames: O`Brien vs. O’Brien vs. O’Brien • More Surname: McDonald vs. Mc Donald vs. Mac Donald
  31. 31. Things to Watch Out for Same father/son or mother/daughter names Twins with same DOB Initial for a forename Mixing of forename with middle name Changing surname after marriage
  32. 32. Quality Matching Analyze data sources How recent the data is
  33. 33. Ask the Data Name popularity Number of sources • Example: 4 sources vs. 1 source say this spelling is right
  34. 34. Consider Using a Democratic System Opposite of hieratical (if-then-else) system If rules order is problematic Business Rules + Asking the Data
  35. 35. Recap Find address Find duplicates Try to make a decision for deduplication • Business Rules • Ask the Data
  36. 36. Processing CPU/Disk/Memory bound Sequential or parallel
  37. 37. Processing Data Extra data Result table Temp data
  38. 38. Timing On insert A few minutes after insert (events) Scheduled tasks Pre-fetch When user asks for it New Data User Request Points in Time
  39. 39. Using Your Team DBAs Database Developers/ETL experts Data Analysts Developers Testers
  40. 40. Mindset Never 100% Best Effort Pareto Principle Continuous Improvement Cost Benefits
  41. 41. Final Recap Continuous Improvements Which duplicate is the correct one? Combine business rules + ask the data
  42. 42. Questions & Answers
  43. 43. Contact Information: MySQL-related questions about presentation? Non-profit or Medical? contact@jonathanlevin.co.uk
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×