1. Data, how to get it clean
and keep it clean?
The best way to make money is to stop wasting it!
2. Agenda:
Who are DQ
Setting the scene
Acceptable Quality
Data Defects
Get it Clean
Keep it Clean
Q&A via web chat
Close
3. Setting the scene…
Who are
we ?
What do
we do ?
How do
we do it
?
What’s in
it for our
clients ?
4. UK B2C Data – annual rates of change…
UK Population is 63.23 M
UK Households 26.4 M
• Over 3.25 M (5.1%) people move house
• 0.584 M (0.9%) people pass away
• 0.813 M (1.3%) Births
• 0.290 M (0.5%) Marry
• 0.130 M (0.2%) Divorce
• 0.500 M (1.9%) Changes by Royal Mail
• 0.250 M (1.4%) people sign up to MPS
½ life of B2C data 1 to 1.2 years
5. UK B2B Data – annual rates of change…
4.934 M trading businesses in the UK
• 3.10 M (62.8%) sole proprietorships
• 0.43 M (8.8%) partnerships
• 1.40 M (28.4%) limited companies
• 0.60 M (12.2%) dormant businesses
5.7 M company or individual details changes:
• 1 moves every 6 Minutes
• 1 fails every 4 minutes
On average a person changes jobs 11 times
during their career
Over 1.1 M (22.3%) businesses are registered with the CTPS
2.43 M employees of UK businesses:
• 99.9% of businesses employ less than 250 staff
• 99.2% of businesses employ less than 50 people who employ 59% of total
staff
@ 24% p.a. ½ life attrition = 3 years
@ 35% p.a. ½ life attrition = 2 years
6. Data decay – the impacts…
Financial:
• £220 M per-annum wasted on inaccurate mailings
• £95 M per-annum wasted by companies mailing people who have moved addresses
• It costs more to mail a moved or deceased individual than to suppress them
• Increase response rates – the same return with less mail
Brand:
• Duplicates and incorrect details cause a negative perception
• Mailing deceased individuals or bereaved families causes significant distress
• Mailing someone who no longer lives at an address does not impress
Compliance:
• Best practice – comply with Direct Marketing Association guidelines
• Calling a consumer who has registered their objection to receiving direct marketing phone calls is illegal
• Mailing a consumer who has registered their objection to receiving direct mail is bad management, contravenes the
DMA Code of Practice and could be illegal
Environment
• Protect the environment – help cut down on wasteful mailing
8. The Data Quality Delusion
Everyone
understand the
importance of
data quality
Everyone agrees
data quality is
important
Everyone cares
about data
quality
Everyone knows
what actions to
take to improve
data quality
10. Open Area
Known to others and
known to self
Blind Area
Known to others not
known to self
Hidden Area
Not known to others
and known to self
Unknown Area
Unknown to others
and unknown to self
Johari
Window
Johari Window - You don‟t know what you
don‟t know...
Self
Others
Expand the Open Area
ReduceBlindArea
Reduce the Hidden Area ?
Johari
Window
11. Acceptable levels of data quality?
All data has some level of
quality, the question is at
what level is it
unacceptable?
How does
anyone
know?
Who‟s
responsible?
How much
is low
quality data
actually
costing?
Unacceptable
Acceptable
12. All data has some level of quality, the
question is at what level is it unacceptable.
Temp
< 37°C
Hyperthermia
Temp
= 37°C
Normal
Temp
> 37°C
Abnormal
Temp
> 37.8°C
Get help
13. How can we end up with bad data?
A Boy's name
beginning
with the
letter J:
"Gerald.."
A word
beginning
with Z:
"Xylophone.."
A part of the
body
beginning
with N:
"Knee..“
A mode of
transport
that you can
walk in: "Your
shoes.."
14. Getting your data clean and keeping it clean
Identify, correct, prevent
15. Get it Clean the basics
About “CURING” data defects
Batch process automation
Mass defect identification
• Mastering & Merging
• Manual review
Time consuming
More costly than prevention
16. Keep it Clean the basics
Prevention better than cure
Ongoing process
• People
• Process
• Technology
Costs of prevention many times
lower than cure!
17. Waging war on error…
Findingdefects
Definingstandards
Correctingdata
Preventingerror
Monitoringdefects
Referencedata
Internaldata
18. Boolean Logic & Dates
DD/MM/YY v MM/DD/YY
•10/10/09 = 10/10/09
•99/99/99 was accepted as a
valid date structure yet it‟s
clearly wrong
Is it European
format
DD/MM/YYYY or US
format
MM/DD/YYYY?
Precision
•DD/MM/YY or
DD/MM/YYYY
OK to
Mail =
Y
Not OK
to Mail
= Y
OK to
Mail =
N
Not OK
to Mail
= N
19. Numbers in Text and Shared Numbers
Systems
Contain:
• 0‟s and/or O‟s
• 1‟s and/or I‟s
• Tel numbers with
9 x 000 000 000
Same product
– different
numbers in 2
systems
• Same Part number 99 000 1111
• 99 000 1111 = 1 days cold ration pack
• 99 000 1111 = Radio valves
• Leasing Agreement numbers
• ID Counters shared across systems
• SKU‟s
• Tank & Aircraft Parts
20. Misinterpretation & Standards
M = Male in one
system and
Married in
another
S = Single in
one system
and
Separated in
another
Gender
•9 variants in
the gender
field of a hotel
project
Padhraic, Pádraig or Páraic
Lane, LN, Ln, Road, Rd, Rd. etc.
MI or Michigan
US or USA or United States
GB or UK or United Kingdom
Mr. or Mister
Hants or Hampshire
21. Dislocation, misfielding
Address A Address B
123 Arcasia Avenue 123 Arcasia Ave
Fareham
Hampshire Fareham
PO16 8XT Hants
PO16 8XT
Person A Person B
Martin
P Martin P
Doyle Doyle
02392 988303 +1 312-253-7873
+1 312-253-7873 02392 988303
22. Anomalies & Congruence
eMail does
not tally with
name parts
Currency does
not tally with
location
Goods
shipped
before order
Values not in
application
pick lists
(metadata)
Default
values used
Notes (memo)
fields used
without
validation
rules
23. DQ Studio – identifying and fixing
• Product demonstration by:
• Martin Kerr
• How to connect, identify and
correct defects…
24. DQ Studio
Classify
•Is the data in your database what you think it
is?
Compare
•How similar is value A to value B in % similarity
Format
•Email
•I.P.
•Postcode
•Telephone
•URL
Generate:
•phonetic tokens
•pattern tokens
Transform data
•13 Categories
•5 Spoken Languages
Validate
•Email
•I.P. Address
•Postal code
•Telephone
•URL
25. DQ Studio
Derive:
• Job Title
• Role
• Level
• Gender
• Male, female, unknown
• Telephone
• Country
• Location
• Number Type
Parse:
• Email
• I.P. Address
• Telephone
Verify
• Locations (240 Countries)
• Phones
• Businesses
• Contacts
27. Matching – What is it?
• Identification and
management of records
which:
• Are the same
• Might be the same
• Are not the same
•PAF Batch
•PAF Lookup
•No Way
•Gone Away
•Passed Away
•Append
•Table v Table•Table v Itself
Dedupe
X-
Match
X-Ref
API
X-Ref
Data
28. How is it done?
Black White
Manually
•Internally
•External Bureau service
Automatically •Software
Using black and
white magic...
•Black = Matches
•White = Non Matches
•Grey = Ambiguous
Carefully to
avoid:
•Too many matches
•Too few matches
•Errors in matches
29. The grey areas - When is a match a match?
Bob = Bobby = Rob= Robert
= Robby= Roberto?
Thomson = Thompson =
Tomson = Thomson?
Xerox = Zerocks? PO16 8XT = P0I6 8XT?
+44 (0) 2392 988303 =
O2392 9883O3?
10TH Feb 2009 = 10/02/09
= 02/10/2009?
Hants = Hampshire =
Hamps?
martin.doyle@dqglobal.com
=
doyle.martin@dqglobal.com
30. Grey to Black or Grey to White
• Transformations (Synonyms)
• Phonetics
• String comparisons
• Intelligence
• Rules
• Spelling
• Typo‟s
• Logic
• Experience
• Lookups
31. Mastering Perfection & merging?
Problems:
• Which data survives?
• Which data gets re-assigned?
• Which data gets stored?
• Which data gets thrown away
Solutions:
• Define the record master
• Define the field merge rules
• Use technology to automate
processes
• Humanise exceptions
35. Cleaning up your business systems:
Back-up your data
Define pick lists
Ensure legacy data conforms to picklists
Delete any temporary fields set-up for test and still in the
production system
Delete or archive old data
Identify contacts with no email and/or no telephone #
Identify and correct contacts with bogus phone numbers
Identify records whose email bounces
Identify businesses without contacts
Archive linked documents which are „n‟ years old,
however, take care with legal including: invoices and
contracts
User admin – delete any users who no longer access
systems
Review any prospects, suspects or opportunities not
properly closed i.e. > „n‟ weeks from opening
36. Actions to consider…
Change attitudes to “ABC” thinking
Think prevention not cure
Apply DQ processes
Verify, Format & Validate
Suppress records
Merge duplicates
Append missing data for segmentation
Govern and Comply
Measure & Manage
Get a CXO sponsor
Prune & Consolidate & Remove competition
Common dictionary of terms
Define customer value, and lifetime?
37. In conclusion…
Identify
•recognise there is
a problem?
Qualify
•gather evidence,
what, when,
where and how
large is the
problem?
Quantify
•what‟s
specifically doing
the damage?
Accept
•acknowledge the
scale of the task?
Define
•the goals and
what will be
measured?
Perform
•carry out the
tasks agreed in
the order or
significance
38. Questions…
• Build a better business based on trusted
data…
• Contact DQ Global
• www.DQGlobal.com
• Talk to a consultant
• sales@DQGlobal.com
• +44 2392 988303 (Europe)
• +1 314-253-7873(North America)
Editor's Notes
Its inherently true that at some level everyone understand the importance of data qualityGenerally, everyone agrees data quality is importantNot true that everyone cares about data qualityCertainly not true that everyone knows what actions to take to improve data quality
Idea is to maximise the Open Area so that we all know as much as possible...This is why data profiling is critical to success in DQ Projects, if you dont know where you are you can’t plot a journey to where you’re going.
Without some means of measurement - how does anyone know?Without governance how does anyone know who’s responsible?Without Measurement and Governance and an understanding of the downstream impacts of data quality, how does any business know how much low quality data is actually costing?
MDIt doesn’t matter what the room temperature is, its always room temperature – Stephen Wright.In our scenario, if this scale related to body temperature, then too cold hyperthermia could be an issue, too hot and feverish then all sorts of complications are possible.
Answers from a game show... Called Family Fortunes... Where the hard of thinking gave these answers which I thought were applicable to our context.A Boy's name beginning with the letter J: "Gerald.."A word beginning with Z: "Xylophone.."A part of the body beginning with N: "Knee..“And now you know why SoundEX does not work well as a matching algorithm...A mode of transport that you can walk in: "Your shoes..“ - That’s what happens with free text fields in databases – no validation!