Cleanliness is next to Godliness


Published on

Published in: Technology, Economy & Finance
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cleanliness is next to Godliness

  1. 1. Cleanliness is next to Godliness Deduplicating Your Customer Data
  2. 2. Parts of the talk Talking about Data Quality Techniques for Deduplication Processing, Timing and Mindset Part 1 Part 2 Part 3 Timeline
  3. 3. Once upon a time... Age of information
  4. 4. Large amounts of data inputted by humans
  5. 5. Humans make mistakes...
  6. 6. Information is a significant raw material for businesses around the world.
  7. 7. Making data-based decisions Wrong information leads to wrong decisions Information as products Bad and unimpressive products Information for logistics Company may shut down
  8. 8. Gathering Data from Humans Paper forms • Spelling mistakes • Unclear questions • Bare minimum information • OCR Web forms • Bypassing filters
  9. 9. Tainting Existing Data Changes in procedures • Didn’t update older data • Different data structures • Different ways of handling data Importing sources of (bad) data
  10. 10. Some Industry Jargon Single View of Customer • Marketing Campaigns Single Version of the Truth • Strategy Getting Correct Reports
  11. 11. Consider this You start a direct mail marketing campaign And this happens...
  12. 12. Dear Mr ----- O’Brien, We are delighted to inform you that we have an amazing offer specifically for you..
  13. 13. Avoiding Embarrassing Mistakes • Marketing/PR • Accounting • Shipping • Strategy
  14. 14. How much is it worth? • 30% ROI (big consultancy) • 10-25% Loss of revenue for bad data quality • Competitive advantage • Avoid going out of business
  15. 15. MFI Group Founded 1964 Upgraded ERP systems early 2000’s Due to issues with data quality in 2004 • £46m in lost sales, £16m extra deliveries + technical costs and £20m for the actual system. Administration 2008 (Comeback 2010)
  16. 16. Recap Data Quality is a big subject Avoid embarrassing mistakes Keep company running efficiently Good for reports
  17. 17. What Deduplication is used for Increasing data quality Compressing data Pre-stage data cleansing needed
  18. 18. Matching Techniques • Address • Name • Fuzzy • DOB Business Rules Quality Matching Ask the Data
  19. 19. Address Matching Databases • Royal Mail (PAF) • Council Address Data • Do Your Own Fill in missing parts House Number, Building Number, House Name, Flat Number, Company Name, Street, Locality, Town, City, County, Country and Postcode
  20. 20. Name Matching Name, Full name Forename, Firstname, Lastname, Surname Initial Middle name(s) Title, Suffix Qualification Lord James Jonah William Smith 3rd
  21. 21. SQL example SELECT c1.*, c2.* FROM customers c1 INNER JOIN customers c2 ON c1.address_id = c2.address_id WHERE c1.surname = c2.surname AND c1.forename = c2.forename AND (c1.middlename = c2.middlename XOR (c1.middlename = ‘’ XOR c2.middle=name‘’));
  22. 22. Title Forename Middle Surname DOB MR MARK MADANES 05/10/1963 MR MARK MADANES 04/10/1963
  23. 23. Title Forename Middle Surname DOB MR CIARAN GERARD O’NEILL 26/07/1971 MR CIARAN M O’NEILL 26/07/1971
  24. 24. Title Forename Middle Surname DOB MS JAN PHILMORE 15/10/1954 MR JAN PHILMORE 00/00/0000
  25. 25. Title Forename Middle Surname DOB MR ALBERTO CARLOS 00/00/0000 MR ALBERT O CARLOS 00/00/0000
  26. 26. Fuzzy Matching Levenshtein select levenshtein(‘jonathan’,’jonathon’) -> 1 Download from:
  27. 27. Fuzzy Matching Soundex select soundex('jonathan') -> J535 Metaphone echo metaphone('jonathan') -> JNON
  28. 28. Title Forename Middle Surname DOB SAMUEL JOHNSTONE 00/00/0000 MR SAMUEL JOHNSTON 00/00/0000
  29. 29. Business Rules Certain Level of Correctness Generic Rules and Source Specific Rules
  30. 30. Business Rules Example • Middle name: Adam Smith vs. Adam E Smith • Title: Miss vs. Ms vs. Lady • Initial: A Smith vs. Adam Smith (same address) • Surnames: O`Brien vs. O’Brien vs. O’Brien • More Surname: McDonald vs. Mc Donald vs. Mac Donald
  31. 31. Things to Watch Out for Same father/son or mother/daughter names Twins with same DOB Initial for a forename Mixing of forename with middle name Changing surname after marriage
  32. 32. Quality Matching Analyze data sources How recent the data is
  33. 33. Ask the Data Name popularity Number of sources • Example: 4 sources vs. 1 source say this spelling is right
  34. 34. Consider Using a Democratic System Opposite of hieratical (if-then-else) system If rules order is problematic Business Rules + Asking the Data
  35. 35. Recap Find address Find duplicates Try to make a decision for deduplication • Business Rules • Ask the Data
  36. 36. Processing CPU/Disk/Memory bound Sequential or parallel
  37. 37. Processing Data Extra data Result table Temp data
  38. 38. Timing On insert A few minutes after insert (events) Scheduled tasks Pre-fetch When user asks for it New Data User Request Points in Time
  39. 39. Using Your Team DBAs Database Developers/ETL experts Data Analysts Developers Testers
  40. 40. Mindset Never 100% Best Effort Pareto Principle Continuous Improvement Cost Benefits
  41. 41. Final Recap Continuous Improvements Which duplicate is the correct one? Combine business rules + ask the data
  42. 42. Questions & Answers
  43. 43. Contact Information: MySQL-related questions about presentation? Non-profit or Medical?