Bcs 20080228 Ku

567 views

Published on

A Discussion of the 5 types of data and information quality defects and the ways in which they can arise. First given at a BCS meeting a solent University 200803

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
567
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bcs 20080228 Ku

  1. 1. The quality of information and data is strained International Association for Information and Data Quality Keith Underdown Convenor, British Community of Practice International Association for Information and Data Quality
  2. 2. Shameless Plug International Association for  Information & Data Quality www.iaidq.org ◦ Student Membership—$25 ◦ Personal Membership—$85 International Association for Information and Data Quality ◦ Corporate Membership Available ◦ Extensive Conference Discounts www.justgiving.com/keithunderdown  ◦ My fundraising page ◦ Reward me if you enjoy my presentation International Association for Information and Data Quality
  3. 3. Data “Everybody knows what data is”!  ◦ “Define:data” in a Google search gives 41 results ◦ Mix of International Association for Information and Data Quality  “data processing” biased  Philosophical  Irrelevant (Data is an android in Startrek TNG) My Preference:  A collection of facts held in a formalized manner suitable for processing by automatic or human means. International Association for Information and Data Quality
  4. 4. Fundamental Data Quality The facts in the case can be:  ◦ Inaccurate ◦ Incomplete ◦ Inconsistent International Association for Information and Data Quality ◦ Invalid ◦ Incomprehensible International Association for Information and Data Quality
  5. 5. The Five “I’s”  Incomplete Data ◦ mandatory fields with null, empty string, etc…  Invalid Data ◦ values outside the allowed value set or fails tests against rules  Inconsistent Data International Association inconsistencyand Data Quality ◦ intra-record for Information ◦ inter-record inconsistency ◦ Inter-datastore Consistency  Inaccurate Data: ◦ Statistical outliers & other “sore thumbs”  E.g. Price 10 times higher than similar models Incomprehensible Data  ◦ without full and accurate context International Association for Information and Data Quality
  6. 6. Incomplete Data Facts essential to business process are  missing ◦ Implies that data validation incorrect ◦ Often arises during bulk import of data International Association for Information and Data Quality  Data not immediately available so validation relaxed  Follow-up not completed  Database field cannot be made mandatory International Association for Information and Data Quality
  7. 7. Example Change in Law made knowledge of  Social Security number mandatory ◦ Too expensive to go to customers ◦ Populate at need International Association for Information and Data Quality ◦ Telephone agents used their own Customer failed to fill in DoB field  ◦ Data entry clerk guessed! ◦ Customer has high value transaction turned down ◦ Lots of adverse publicity International Association for Information and Data Quality
  8. 8. How can we avoid these? Plan for their absence  ◦ When creating new databases plan to populate fields ◦ When bulk updates required bite the bullet International Association for Information and Data Quality ◦ Ensure agents have time and understand the need to collect data  Check for likely “cheats” International Association for Information and Data Quality
  9. 9. Invalid Data Data that fails genuine business rules  Or  Fails unstated real world validation  ◦ Company name info spills over into International Association for Information and Data Quality address fields International Association for Information and Data Quality
  10. 10. Examples 01222 535681 looks like a valid phone  no. ◦ But Cardiff is an exception  029 2053 5681 International Association for might work it out Quality  Human being Information and Data  Power dialler won’t 02/03/08  ◦ US=3rd February 08 ◦ UK= 2nd March 08 ◦ Which century? International Association for Information and Data Quality
  11. 11. How do we avoid these Make field syntax as tight as possible  ◦ E.g. Always use date-stamp fields for dates ◦ Use external validation systems International Association Address File and Data Quality  E.g. Postal for Information ◦ Use masks to validate input patterns  Use carefully, still allows cheating ◦ Use drop-down lists from reference tables International Association for Information and Data Quality
  12. 12. Inconsistent Data  Intra-record inconsistency: ◦ Gender=“m”, Marital-Status=“Wife”;  inter-record inconsistency ◦ R1: VIN=VF7N1KFXF36772582; International Association forMark=T87BRB Quality Registration Information and Data ◦ R2: VIN=VF7N1KFXF36772582; Registration Mark=CC04PNL  Inter-datastore inconsistency ◦ E.g. Customer data in many data stores International Association for Information and Data Quality
  13. 13. How do we avoid these? “Common sense validation”  ◦ Men cannot be wives But: what is correct value?  So: don’t over-specify  International Association for Information and Data Quality ◦ Marital status? ◦ Better: Relationship Status Legally Married  In Civil Partnership  Unmarried  Divorced  International Association for Information and Data Quality
  14. 14. Careful of surrogate keys Entities can often be identified in  different ways ◦ NI Number ◦ NHS Number International Association for Information and Data Quality These are surrogate keys  All key fields should be unique  VIN example could not have arisen if  field required to be unique Nor would have SSN example earlier  International Association for Information and Data Quality
  15. 15. Root Cause Often historically poor data quality  ◦ NI numbers poorly administered  Many to many relationships! Keys not unique in practice  International Association for Information and Data Quality  Allows for new errors in data entry International Association for Information and Data Quality
  16. 16. An Aside—Checksums Checksums ancient technique to  validate input data ◦ Additional digit attached to key ◦ Derived from key bytes International Association for Information and Data Quality ◦ Mis-keying always generates mismatch Not part of key so store separately if  at all Better to generate key automatically  validate against existing  International Association for Information and Data Quality
  17. 17. Inaccurate Data  Statistical outliers & other “sore thumbs” ◦ E.g. Price 10 times higher than similar model ◦ River Temperature >100° C ◦ Gas Bill orders of magnitude too high International Association for Information and Data Quality Transposed Digits  ◦ Accountancy packages have lots of tricks to find these Spurious Accuracy  ◦ Wall length in mm ◦ Averages computed to too many places International Association for Information and Data Quality
  18. 18. Incomprehensible Data The facts could meet all the previous  strictures but still be useless They must be put in context  International Association for Information and Data Quality International Association for Information and Data Quality
  19. 19. Data in Context 3.142 is a fact   Gertie 3.142 2005-02-02 is data Name Height Measurement Date  International Association for Information and Data Quality Gertie 3.142 2005-02-02 is becoming “Data in Context”  Still need ◦ units for Height (metres) ◦ Date rules (ISO 8601) ◦… International Association for Information and Data Quality
  20. 20. No Context => Expensive errors Mars Climate Orbiter  ◦ Discrepancies observed in approach but not formally noted ◦ Spacecraft vanished during insertion into orbit International Association for Information and Data Quality ◦ Engineers specified forces to applied in lb Force (poundal) not Newtons ◦ Factor of 4.45 difference! ◦ They did it again for Mars Polar Lander! International Association for Information and Data Quality
  21. 21. More examples Redefining field usage on the fly  ◦ 2-byte field in database but highest value <256 ◦ Project team seeks to avoid cost of inserting new field International Association for Information and Data Quality ◦ Redefines field in code to be two 1-byte fields ◦ Existing reports start giving odd results but nobody notices ◦ Wrong business decisions made International Association for Information and Data Quality
  22. 22. Information Information is  ◦ What sentient beings use to:  Facilitate decision-making  Communicate International Association for are sentient (so far)  Only humans Information and Data Quality ◦ Information only exists when humans in value chain Machine-machine communication  ◦ Data in context International Association for Information and Data Quality
  23. 23. What is Quality Information Conveys the right “impression”  ◦ Trespassing on Conrad’s territory ◦ We’ll look at some graphical examples Takes into account cultural differences  International Association for Information and Data Quality ◦ “Wait while the red light flashes” International Association for Information and Data Quality
  24. 24. Phone Number example again 01222 331988   I see that and “know” that it is wrong  I could programme the rule to convert an erroneously converted number International Association for Information and Data Quality  01222 => 029  Prefix subscriber number with 20  But 029 is officially the code for Wales and other prefixes will appear, 21 already in use. International Association for Information and Data Quality
  25. 25. Information Presentation Which of these companies would you  rather buy into? International Association for Information and Data Quality 1 2 3 4 5 6 7 8 International Association for Information and Data Quality
  26. 26. Illegality US accounting rules now outlaw chart  manipulations Money Laundering rules  Managers could go to prison International Association for Information and Data Quality Basel II and Sarbanes-Oxley • Directors could go to prison International Association for Information and Data Quality
  27. 27. Data Quality is Free Poor Data Quality costs 10-30% of  Turnover routinely Particular issues can be catastrophic  ◦ Regulator can fine companies International Association for Information and Data Quality ◦ People can sue ◦ Officers and directors could go to jail Data Quality is better then Free  But needs to be worked at  International Association for Information and Data Quality
  28. 28. No IQ without DQ Cannot have good Information Quality  ◦ Without good quality data Information Quality is a business issue  ◦ Needs complete commitment International Association for Information and Data Quality ◦ Very strong management process Information is the Third Asset  ◦ It is not a cost centre ◦ It is not reflected on the bottom line ◦ Yet International Association for Information and Data Quality
  29. 29. Any Questions? keith.underdown@iaidq.org

×