Scalable Data Quality

1,650 views

Published on

Scalable Data Quality

  1. 1. Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003
  2. 2. Presentation Outline <ul><li>The Cost - It’s always funny when it’s someone else… </li></ul><ul><li>Quality - Quality Principles and The Knowledge Worker </li></ul><ul><li>Data Quality </li></ul><ul><li>Data Development </li></ul><ul><li>Metadata Management </li></ul><ul><li>Profile and Baseline Statistics </li></ul><ul><li>Vendor Tools </li></ul><ul><li>Wrap-up </li></ul><ul><li>Some light reading </li></ul>
  3. 3. The Cost <ul><li>1. The Cost… </li></ul>
  4. 4. The Cost… <ul><ul><li>“ Quality is free. What’s expensive is finding out how to do it it right the first time.” Philip Crosby </li></ul></ul><ul><ul><li>A major credit service was ordered to pay $25M for release of hundreds of customers names of one bank to another bank because a confidentiality indicator was not set. </li></ul></ul><ul><ul><li>Stock value of major health care insurer dropped 40% because analysts reported the insurer was unable to determine which “members” were active paying policy holders. Stock value dropped $3.7 Billion in 72 hours. </li></ul></ul><ul><ul><li>The sale/merger of a major cable provider was delayed 9 months while the target company determined how many households it had under contract. Three separate processes calculated three values with a 80% discrepancy. </li></ul></ul>
  5. 5. The Cost… <ul><ul><li>The US Attorney General determined that 14% of all health care dollars or approximately $23 billion were the result of fraud or inaccurate billing. </li></ul></ul><ul><ul><li>DMA estimates that greater than 20% of customer information housed by it’s members database is inaccurate or unusable. </li></ul></ul><ul><ul><li>A major Telco has over 350 CUSTERMER tables with CUSTOMER data repeated as many as 40 times with 22 separate systems capable of generating or modifying customer data. </li></ul></ul><ul><ul><li>A major communications company could not invoice 2.6% of it’s customers because the addresses provided were non-deliverable. Total Cost > $85M annually. </li></ul></ul><ul><ul><li>A State Government annually sends out 300,000 motor vehicle registration notices with up to 20% undeliverable addresses. </li></ul></ul><ul><ul><li>A B2B office supply company calculated that it saves an average of 70 cents per line item through web sales based on data entry validation at the source of order entry. </li></ul></ul>
  6. 6. The Cost TDWI - 2002
  7. 7. The Regulatory Challenges <ul><li>No more “Corporate” data </li></ul><ul><ul><li>New Privacy Regulations </li></ul></ul><ul><ul><ul><li>Direct Marketing Access </li></ul></ul></ul><ul><ul><ul><li>Telemarketing Access </li></ul></ul></ul><ul><ul><ul><li>Opt – In / Opt - Out </li></ul></ul></ul><ul><ul><li>New Customer Managed Data Regulations </li></ul></ul><ul><ul><ul><li>HIPAA </li></ul></ul></ul><ul><ul><li>New Security Regulations </li></ul></ul><ul><ul><ul><li>Insurance Black List Validation </li></ul></ul></ul><ul><ul><ul><li>Bank Transfer Validation </li></ul></ul></ul><ul><ul><li>Business Management </li></ul></ul><ul><ul><ul><li>Certification of Financial Statements </li></ul></ul></ul><ul><ul><ul><li>Sarbanes – Oxley </li></ul></ul></ul><ul><ul><li>These have teeth… </li></ul></ul>
  8. 8. Sources of Data Non-Quality TDWI - 2002
  9. 9. Data Development and Data Quality <ul><li>2. Data Quality </li></ul>
  10. 10. Data Quality Process <ul><li>There is a formal process to quantitatively evaluate the quality of corporate data assets. </li></ul><ul><li>The process outlined here is based on Larry English’s Total data Quality Management (TdQM) </li></ul><ul><ul><li>Audit the current data resources </li></ul></ul><ul><ul><li>Assess the Quality </li></ul></ul><ul><ul><li>Measure Non-quality Costs </li></ul></ul><ul><ul><li>Reengineer and Cleanse data assets </li></ul></ul><ul><ul><li>Update Data Quality Process </li></ul></ul>
  11. 11. Determination of Data Quality TDWI -2002
  12. 12. Data Quality Process <ul><li>Develop a Data Quality Process to Quantitatively evaluate the Quality of Corporate Data Assets. </li></ul><ul><li>Establish the Metadata Repository </li></ul><ul><li>Implement Data Development and Standardization Process </li></ul><ul><li>Profile and Baseline your Data </li></ul><ul><li>Use the Metadata to Improve your Data Quality </li></ul><ul><li>Revise The Data Quality Process </li></ul>
  13. 13. Determination of Data Quality TDWI - 2002
  14. 14. Customer Satisfaction Profile <ul><li>Determine who are the most consistent users of specific Data entities. Select a sample set of attributes to be reviewed. </li></ul><ul><li>Publish the metadata report for the selected attributes </li></ul><ul><li>Select representatives in from the various business areas and knowledge workers to review the selected attributes and metadata. </li></ul><ul><li>Distribute the questionnaires, retrieve and score the results. </li></ul><ul><li>Report on and distribute the results. </li></ul>
  15. 15. Data Quality Assessments           The Example data is correct           The Data Steward is correct           The Refresh Frequency is correct           The data has value to the business           The business rules are correct and complete           List of valid codes are complete and correct           Data Mnemonics are consistent and Understandable           Data Domain Values are correct and complete           Data Definitions conform to standards Business Names are Clear and Understandable Not Used Unusable Below Expectation Meets Expectation Above Expectation Attribute Mnemonic:     Date Status Version Attribute Name:
  16. 16. Quality Assessment Results Acceptability Threshold Problem Metadata
  17. 17. Quality Assessment Results Improving… Acceptability Threshold
  18. 18. Quality Assessment Results Got it Right! Acceptability Threshold
  19. 19. Data Development and Standardization <ul><li>4. Building Data </li></ul>
  20. 20. Data Standardization Process <ul><li>Data Development and Approval Process </li></ul>Integrated Data Model (Data Elements) Resolved Issues Data Requirements Issue Resolution Stewards Architect Proposal Package Data Architect Data Architect Functional Review Technical Review Data Administrator Issues MDR
  21. 21. Data Standardization Process <ul><li>Proposal Package – Data Model, Descriptive Information, Organization Information, Integration Information, Tool Specific Information </li></ul><ul><li>Technical Review – Model Compliance, Metadata Complete and accurate </li></ul><ul><li>Functional Review – Integration with Enterprise Model </li></ul><ul><li>Issue Resolution – Maintenance and Management </li></ul><ul><li>Total Process < 30 days </li></ul><ul><li>All based on an integrated web accessible application. Results integrated to the Enterprise Metadata Repository. </li></ul>
  22. 22. Data Standardization <ul><li>Getting to a single view of the truth </li></ul><ul><li>Getting to a corporate owned process of data and information management. </li></ul>Describe Existing Data Assets Addition of new business needs
  23. 23. Data Development Process <ul><li>There is a formal process for development, certification, modification and retirement of data. </li></ul><ul><ul><li>Data Requirements lead directly to Physical Data Structures. </li></ul></ul><ul><ul><li>Data Products lead directly to Information Products. </li></ul></ul><ul><li>The Data Standards Evaluation Guide </li></ul><ul><ul><li>Enterprise level not subject area specific </li></ul></ul><ul><ul><ul><li>I can use “customer” throughout the organization </li></ul></ul></ul><ul><ul><ul><li>I can use “Quarterly Earnings” throughout the organization </li></ul></ul></ul><ul><ul><li>All the data objects have a common minimal set of attributes dependent upon their type. </li></ul></ul><ul><ul><ul><li>All data elements have a name, business name, data type, length, size or precision, collection of domain values etc. </li></ul></ul></ul><ul><ul><li>There are clear unambiguous examples of the data use </li></ul></ul><ul><ul><li>The data standards are followed by all development and management teams. </li></ul></ul><ul><ul><li>The same data standards are used to evaluate internally derived data as well as vendor acquired data. </li></ul></ul>
  24. 24. Data Standardization Process <ul><li>Standardization is the basis of modeling – Why Model? </li></ul><ul><ul><li>Find out what you are doing so you can do it better </li></ul></ul><ul><ul><li>Discover data </li></ul></ul><ul><ul><li>Identify sharing partners for processes and data </li></ul></ul><ul><ul><li>Build framework for database that supports business </li></ul></ul><ul><ul><li>Establish data stewards </li></ul></ul><ul><ul><li>Identify and eliminate redundant processes and data </li></ul></ul><ul><li>Check out ICAM / IDEF… </li></ul>
  25. 25. An example Process Model Conduct Procurement Funding Statutes, Regulations & Policies Notification to Vendor Company Support Team Purchase Officer <ul><ul><ul><ul><ul><li>A0 </li></ul></ul></ul></ul></ul>Purchase Solicitation Announcement Purchase Performance Analysis Acquisition Guidance Communication from Contractor Industry Resource Data Requirement Package Proposed Programs & Procurement Issues
  26. 26. Zachman Framework Process Enforced rules Business events Trained people Communications facilities Executable programs Converted data Function System (Working systems)   Rule specification in program logic Timing definitions Screens, security architecture (who can see what?) Network architecture Detailed Program Design Data design (denormalized), physical storage design Detailed Representation Business rule design &quot;Control flow&quot; diagram (control structure) User interface (how the system will behave); security design System architecture (hardware, software types) System design: structure chart, pseudo-code Data architecture (tables and columns); map to legacy data Technology Model Business rule model Dependency diagram, entity life history (process structure) Human interface architecture (roles, data, access) Distributed system architecture Essential Data flow diagram; application architecture Data model (converged entities, fully normalized) Model of the Information System Business plan Business master schedule Organization chart, with roles; skill sets; security issues. Logistics network (nodes and links) Business process model (physical data flow diagram) Entity relationship diagram (including m:m, n-ary, attributed relationships) Model of the Business List of business goals / strategies List of business events / cycles List of organizational units List of locations where the enterprise operates List of processes the enterprise performs List of things important to the enterprise Objectives / Scope
  27. 27. The Data Model… <ul><li>Data Model- A description of the organization of data in a manner that reflects the information structure of an enterprise </li></ul><ul><li>Logical Data Model - User perspective of enterprise information. Independent of target database or database management system </li></ul><ul><li>Entity – Person, Place, Thing or Concept </li></ul><ul><li>Attribute – Detail descriptive information associated with an Entity </li></ul><ul><li>Relation – The applied business rule to one or more entities </li></ul><ul><li>Element - A named identifier of each of the entities and their </li></ul><ul><li>attributes that are to be represented in a database. </li></ul>
  28. 28. Rules to Entities and Attributes <ul><ul><li>There is more than one state . </li></ul></ul><ul><ul><li>Each state may contain multiple cities . </li></ul></ul><ul><ul><li>Each city is always associated with a state. </li></ul></ul><ul><ul><li>Each city has a population . </li></ul></ul><ul><ul><li>Each city may maintain multiple roads . </li></ul></ul><ul><ul><li>Each road has a repair status . </li></ul></ul><ul><ul><li>Each state has a motto . </li></ul></ul><ul><ul><li>Each state adopts a state bird </li></ul></ul><ul><ul><li>Each state bird has a color . </li></ul></ul>STATE STATE Code CITY Name CITY POPULATION Quantity CITY ROAD Name CITY ROAD REPAIR Status STATE MOTTO Text STATE BIRD Name STATE BIRD COLOR Name
  29. 29. Resulting Populated Tables (3NF) STATE VA “ “ MD “ “ AZ “ “ IL “ “ MA “ “ State Code State Motto STATE CITY VA Alexandria 200K MD Annapolis 50K MD Baltimore 1500K AZ Tucson 200K IL Springfield 40K MA Springfield 45K State Code City Name City Pop. STATE BIRD Cardinal Red Oriole Black Cactus Wren Brown Chickadee Brown State Bird State Bird Color State Bird Name Cardinal Oriole Cactus Wren Cardinal Chickadee VA Alexandria Route 1 2 VA Alexandria Franconia 1 MD Annapolis Franklin 1 MD Baltimore Broadway 3 AZ Tucson Houghton 2 AZ Tucson Broadway 2 IL Springfield Main 3 MA Springfield Concord 1 State Code City Name City Road Name City Road Repair Status CITY ROAD
  30. 30. Example Entity, Attributes, Relationships <ul><li>Becomes Road Kill on/ Kills </li></ul>State Model STATE Code (FK) CITY Name (FK) CITY ROAD Name CITY ROAD REPAIR Status STATE Code STATE MOTTO Text STATE BIRD Name (FK) Maintains STATE STATE CITY CITY ROAD Contains STATE BIRD Name STATE BIRD COLOR Name STATE BIRD Adopted by/ Adopts STATE Code (FK) CITY Name CITY POPULATION Quantity
  31. 31. Data Standardization <ul><li>Data Element Standardization -The process of documenting, reviewing, and approving unique names, definitions, characteristics, and representations of data elements according to established procedures and conventions. </li></ul>Prime Word Property Modifier(s) Class Word Class word Modifier(s) Generic Element Required 1 0 - n 0 - n Standard Data Element Structure
  32. 32. The Generic Element <ul><li>The Generic Element - The part of a data element that establishes a structure and limits the allowable set of values of a data element. Generic elements classify the domains of data elements. Generic elements may have specific or general domains. </li></ul><ul><li>Examples – Code, Amount, Weight, Identifier </li></ul><ul><li>Domains – The range of values associated with an element. General Domains can be infinite ranges as with an ID number or Fixed as with a State Code. </li></ul>
  33. 33. Standardized Data Element Person Eye Color Code PR-EY-CLR-CD The code that represents the natural pigmentation of a person’s iris EXAMPLE BK . . . . . . . . . . . . . . Black BL . . . . . . . . . . . . . . Blue BR . . . . . . . . . . . . . . Brown GR . . . . . . . . . . . . . . Green GY . . . . . . . . . . . . . . Gray HZ . . . . . . . . . . . . . . Hazel VI . . . . . . . . . . . . . . Violet Domain values: Element Name: Access Name: Definition Text: Authority Reference Text: Steward Name: U.S. Code title 10, chapter 55 USD (P&R)
  34. 34. Standards <ul><li>Name Standards </li></ul><ul><ul><li>Comply with format </li></ul></ul><ul><ul><li>Single concept, clear, accurate and self explanatory </li></ul></ul><ul><ul><li>According to functional requirements not physical </li></ul></ul><ul><ul><li>considerations </li></ul></ul><ul><ul><li>Upper and lower case alphabetic characters, </li></ul></ul><ul><ul><li>hyphens (-) and spaces ( ) </li></ul></ul><ul><ul><li>No abbreviations or acronyms, conjunctions, plurals, articles, verbs or class words used as modifiers or prime words </li></ul></ul><ul><li>Definition Standards </li></ul><ul><ul><li>What the data is, not HOW, WHERE or WHEN used or WHO uses </li></ul></ul><ul><ul><li>Add meaning to name </li></ul></ul><ul><ul><li>One interpretation, no multiple purpose phrases, unfamiliar technical program, abbreviations or acronyms </li></ul></ul>
  35. 35. Integration of the Data Through Metadata <ul><li>Data Integration by Subject area </li></ul>Subject Area 1 Subject Area 2 Subject Area 3
  36. 36. Data Model Integration <ul><li>Brings together (joins) two or more approved Data Model views </li></ul><ul><li>Adds to the scope and usability of the Corporate Data Model (EDM) </li></ul><ul><li>Continues to support the activities of the department that the individual models were intended to support </li></ul><ul><li>Enables the sharing of information between the functional areas or components which the Data Models support </li></ul>
  37. 37. Enterprise Data Model Use of Enterprise Data Model ORGANIZATION System Models Standard Metadata & Schemas Component Views Component Models Functional Views Functional Models SECURITY- LEVEL ORGANIZATION- SECURITY-LEVEL PERSON-SECURITY- LEVEL
  38. 38. Metadata Management <ul><li>5. Metadata Management </li></ul>
  39. 39. Data in Context! Mr. End User Sets Context For His Data.
  40. 40. Metadata <ul><li>Metadata is the data about data… Huh? </li></ul><ul><li>Metadata is the descriptive information used to set the context and limits around a specific piece of data. </li></ul><ul><ul><li>The metadata lets data become discreet and understandable by all communities that come in contact with a data element. </li></ul></ul><ul><ul><li>Metadata is the intersection of certain facts about data that lets the data become unique. </li></ul></ul><ul><ul><li>It makes data unique, understood and unambiguous. </li></ul></ul><ul><ul><li>The accumulation of Metadata creates a piece of data. The more characteristics about the data you have the more unique and discreet the data can be. </li></ul></ul>
  41. 41. Relevant Metadata <ul><ul><li>Technical - Information on the physical warehouse and data. </li></ul></ul><ul><ul><li>Operational / Business - Rules on the data and content </li></ul></ul><ul><ul><li>Administrative - Security, Group identification etc. </li></ul></ul><ul><ul><li>The meta model is the standard content defining the attributes of any given data element in any one of these models. The content should address the needs of each community who comes in contact with the data element. The meta model components make the data element unique to each community and sub community. </li></ul></ul>
  42. 42. Acquiring the Metadata <ul><li>Data Modeling Tools – API and Extract to Repository </li></ul><ul><li>Reverse Engineered RDBMS – Export Extract </li></ul><ul><li>ETL Tools – Data mapping, Source to Target Mapping </li></ul><ul><li>Scheduling Tools – Refresh Rates and Schedules </li></ul><ul><li>Business Intelligence Tools – Retrieval Use </li></ul><ul><li>Current Data Dictionary </li></ul>
  43. 43. Technical Metadata <ul><li>Physical Descriptive Qualities </li></ul><ul><ul><li>Standardized Name </li></ul></ul><ul><ul><li>Mnemonic </li></ul></ul><ul><ul><li>Data type </li></ul></ul><ul><ul><li>Length </li></ul></ul><ul><ul><li>Precision </li></ul></ul><ul><ul><li>Data definition </li></ul></ul><ul><ul><li>Unit of Measure </li></ul></ul><ul><ul><li>Associated Domain Values </li></ul></ul><ul><ul><li>Transformation Rules </li></ul></ul><ul><ul><li>Derivation Rule </li></ul></ul><ul><ul><li>Primary and Alternate Source </li></ul></ul><ul><ul><li>Entity Association </li></ul></ul><ul><ul><li>Security And Stability Control </li></ul></ul>
  44. 44. Administrative and Operational Metadata <ul><li>Relates the Business perspective to the end user and Manages Content </li></ul><ul><ul><li>Retention period </li></ul></ul><ul><ul><li>Update frequency </li></ul></ul><ul><ul><li>Primary and Optional Sources </li></ul></ul><ul><ul><li>Steward for Element </li></ul></ul><ul><ul><li>Associated Process Model </li></ul></ul><ul><ul><li>Modification History </li></ul></ul><ul><ul><li>Associated Requirement Document </li></ul></ul><ul><ul><li>Business relations </li></ul></ul><ul><ul><li>Aggregation Rules </li></ul></ul><ul><ul><li>Subject area oriented to insure understanding by end user </li></ul></ul>
  45. 45. The Simple Metamodel Entity Entity Alias Individual View Attribute Encoding / Lookup Tables Attribute Alias Attribute Default Relationship Source System Subject Area Individual
  46. 46. The Common Meta Model Based on Tannenbaum
  47. 47. The Common Warehouse Metamodel
  48. 48. Required Data Element Technical Metadata <ul><li>Name </li></ul><ul><li>Mnemonic </li></ul><ul><li>Definition </li></ul><ul><li>Data value source list text </li></ul><ul><li>Decimal place count quantity </li></ul><ul><li>Authority reference text </li></ul><ul><li>Domain definition text </li></ul><ul><li>Domain value identifiers </li></ul><ul><li>Domain value definition text </li></ul><ul><li>High and low range identifiers </li></ul><ul><li>Maximum character count quantity </li></ul><ul><li>Proposed attribute functional data steward </li></ul><ul><li>Functional area identification code </li></ul><ul><li>Unit measure name </li></ul><ul><li>Data type name </li></ul><ul><li>Security classification code </li></ul><ul><li>Creation Date </li></ul>
  49. 49. Use of The Enterprise Tools Enterprise Data Dictionary System (EDDS) Enterprise Data Model (EDM) Migration/New Information systems Prime Words Data Elements Metadata Entities Attributes Relationships (business rules) Database Tables Database Columns Database Rows Enterprise Data Repository (EDR) Database Dictionary Associations and Table Joins
  50. 50. Profile the Data <ul><li>6. Profile and Baseline Data </li></ul>
  51. 51. Audit Data <ul><li>Establish Table Statistics </li></ul><ul><ul><li>Total Size in bytes including Indexes </li></ul></ul><ul><ul><li>When was it last refreshed </li></ul></ul><ul><ul><li>Is referential Integrity applied </li></ul></ul><ul><li>Establish Row Statistics </li></ul><ul><ul><li>How many rows </li></ul></ul><ul><ul><li>How many Columns / Table </li></ul></ul><ul><li>Establish Column Statistics </li></ul><ul><ul><li>How Many Unique Values </li></ul></ul><ul><ul><li>How many Null Values </li></ul></ul><ul><ul><li>How Many Values outside defined domain </li></ul></ul><ul><ul><li>If a Key value, how many duplicates </li></ul></ul>
  52. 52. Some Simple Statistics <ul><li>In Oracle – Run Analyze against tables, partitions, indexes and clusters. Allows you to determine sample size as a specific % of the total size or the specific number of rows. Default is a 1064 row set. </li></ul><ul><li>Example – 5.7 million rows in Transaction Table </li></ul><ul><li>“ analyze table transaction estimate statistics;” </li></ul><ul><ul><li>Statistics are estimated using a 1064 sample </li></ul></ul><ul><li>“ analyze table transaction estimate statistics sample 20 percent” </li></ul><ul><ul><li>Statistics are estimated using 1.14 million rows </li></ul></ul><ul><li>Statistics are store in several views </li></ul>number of distinct values in the column num_distinct user_tab_col_statistics number of distinct values in the column num distinct user_part_col_statistics the number of distinct values in the indexed column distinct_keys user indexes total rows when analyzed num_tows user_tables Contents Column Name View
  53. 53. Getting Statistics <ul><li>Get the Statistics… </li></ul><ul><li>SQL > select table_name, num_rows </li></ul><ul><li>from user_tables where num_rows is not null </li></ul><ul><li>TABLE_NAME NUM_ROWS </li></ul><ul><li>--------------------------------------------------------- </li></ul><ul><li>Transaction 5790230 </li></ul><ul><li>Account 1290211 </li></ul><ul><li>Product 308 </li></ul><ul><li>Location 2187 </li></ul><ul><li>Vendors 4203 </li></ul>Alternatively, you can use select count by table SQL > select count(*) from transaction; Count(*) --------------------- 5790230
  54. 54. Getting Statistics <ul><li>To determine Unique counts of a column in a table: </li></ul><ul><ul><li>SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE]; </li></ul></ul><ul><li>To Determine the number of NULL values in a column in a table: </li></ul><ul><ul><li>SQL> SELECT COUNT(DISTINCT [COLUMN] ) FROM [TABLE] </li></ul></ul><ul><ul><li>WHERE [TABLE]_NAME IS NULL; </li></ul></ul><ul><li>To Determine if there are values outside a domain range </li></ul><ul><li>SQL> SELECT COUNT(DISTINCT [COLUMN]) </li></ul><ul><li>FROM [TABLE] </li></ul><ul><li> WHERE [TABLE]_STATUS NOT IN (‘val1’,’val2’,’val3’); </li></ul><ul><li> </li></ul>
  55. 55. Getting Usage Statistics <ul><li>What tables are being used? </li></ul><ul><li>With audit on, audit data is loaded to DBA_AUDIT_OBJECT </li></ul><ul><ul><li>Create a table with columns for object_name, owner and hits. </li></ul></ul><ul><ul><li>Insert the data from DBA_AUDIT_OBJECT to you new table. </li></ul></ul><ul><ul><li>Clear out the data in DBA_AUDIT_OBJECT </li></ul></ul><ul><ul><li>Write the following report: </li></ul></ul><ul><ul><ul><li>col obj_name form a30 </li></ul></ul></ul><ul><ul><ul><li>col owner form a20 </li></ul></ul></ul><ul><ul><ul><li>col hits form 99,990 </li></ul></ul></ul><ul><ul><ul><li>select obj_name, owner, hits from aud_summary; </li></ul></ul></ul><ul><ul><ul><li>OBJ_NAME OWNER HITS </li></ul></ul></ul><ul><ul><ul><li>-------------------------------------------------------------------- </li></ul></ul></ul><ul><ul><ul><li>Region Finance 1,929 </li></ul></ul></ul><ul><ul><ul><li>Transaction Sales 18,916,344 </li></ul></ul></ul><ul><ul><ul><li>Account Sales 4,918,201 </li></ul></ul></ul>
  56. 56. Baseline Statistics <ul><li>Based on the statistics collected </li></ul><ul><ul><li>Use these as a baseline and save to meta data repository operational metadata </li></ul></ul><ul><ul><li>Compare with planned statistics generated with Knowledge Workers </li></ul></ul><ul><ul><li>Generate and publish reports covering the data </li></ul></ul><ul><ul><li>Use these as baseline statistics </li></ul></ul><ul><li>Regenerate the statistics on a fixed period basis. </li></ul><ul><ul><li>Compare and track with time </li></ul></ul>
  57. 57. Quality Assessment <ul><li>Establish Baseline KPI For Data Quality </li></ul><ul><li>Perform Statistics on Sample Sets </li></ul><ul><li>Compare results </li></ul>Time to Develop and Implement Corrective Measures and push process change upstream
  58. 58. Total Error Tracking over Time <ul><li>Set Error reduction Schedule </li></ul><ul><li>Track Errors over time </li></ul><ul><li>Note when new systems or impact processes are added </li></ul>
  59. 59. Performance Assessment Performance Statistics for Monthly DW Load
  60. 60. Daily Performance Statistics Daily Performance Cycles EOM Performance Cycles Normal Daily Load -Daily On Demand Reporting -Batch Reporting End Of Month Daily Performance Cycle -EOM Loads Impacting Normal Daily Reporting - Additional CPU / Swap / Cache/ DB overhead
  61. 61. Monitoring Your Data <ul><li>Source Statistics – Data Distribution, Row Counts etc. </li></ul><ul><li>Schedule Exceptions – Load Abends, Unavailable source or target </li></ul><ul><li>System Statistics – Configuration Stats, System Performance logs </li></ul><ul><li>Change Control – Model / Process change History </li></ul><ul><li>Test Criteria And Scenarios – Scripts, data statistics, test performance </li></ul><ul><li>Meta Model – Metadata (domain values, operational metadata etc.) </li></ul><ul><li>Load Statistics – Value Distribution, row counts, load rates </li></ul><ul><li>Test / Production Data Statistics – Data Distribution, Row counts, model revisions, refresh history </li></ul><ul><li>Query Performance – End user query performance and statistics </li></ul><ul><li>End User Access – Who’s accessing, when, what query / service requested, when to they access, what business are they associated with </li></ul><ul><li>Web Logs – Monitor External user access and performance </li></ul><ul><li>End User Feedback – Comments, complaints and whines. </li></ul>
  62. 62. Monitoring your Data Monitor Point Typical Analytical Environment
  63. 63. A reason for metadata <ul><li>  </li></ul>Source Systems ODS Staging Data Warehouse Data Marts Analytics End User Access Info. Distribution Metadata
  64. 64. Metadata and Monitoring <ul><li>The Metadata provides a objective criteria based evaluation of the data from a quality / integrity standpoint. </li></ul><ul><li>The Metadata provides standards for data use and quality assurance at all levels from the enterprise to the individual. </li></ul><ul><li>The Metadata ensures continuity in the data independently from the applications and users accessing it. Applications come and go but data is forever… </li></ul><ul><li>Metadata forces us to understand the data that we are using prior to its use. </li></ul><ul><li>Metadata promotes corporate development and retention of data assets. </li></ul>
  65. 65. The Leaky Pipe… Existing Processes & Systems Increased Processing Costs Inability to relate customer Data Poor Exception Management Lost Confidence in Analytical Systems Inability to React to Time to Market Pressures Unclear Definitions of the Business Decreased Profits <ul><li>Gets Worse Everyday </li></ul><ul><li>Must Plug the Holes NOW </li></ul><ul><li>Easy ROI Justification </li></ul>
  66. 66. Vendor Tools <ul><li>7. Vendor Tools and </li></ul><ul><li>Metadata </li></ul>
  67. 67. Vendor Metadata <ul><li>CASE Tools – ERWin, Designer 2000, Power Designer </li></ul><ul><ul><li>Technical Metadata </li></ul></ul><ul><li>RDBMS – Oracle, Informix, DB2 </li></ul><ul><ul><li>Technical Metadata </li></ul></ul><ul><ul><li>Operational Statistics – Row Counts, Domain Value Deviation, Utilization Rates, Security </li></ul></ul><ul><li>ETL </li></ul><ul><ul><li>Transformation Mappings </li></ul></ul><ul><ul><li>Exceptions Management </li></ul></ul><ul><ul><li>Recency </li></ul></ul><ul><li>BI </li></ul><ul><ul><li>Utilization </li></ul></ul><ul><li>ERP </li></ul><ul><ul><li>Source of Record </li></ul></ul>
  68. 68. Current Metadata Management <ul><li>Reflects Data After the fact </li></ul><ul><li>Most are only current state views, no history </li></ul><ul><li>No data development and Standardization Process </li></ul><ul><li>No standards for Definitions </li></ul>
  69. 69. Bi-Directional Metadata Management
  70. 70. Vendor Stregths in Data Quality
  71. 71. Wrap-up <ul><li>8. Wrap-up </li></ul>
  72. 72. Wrap up <ul><li>Use metadata as part of your data quality effort </li></ul><ul><ul><li>Incomplete Metadata is a pay me now or pay me later proposition </li></ul></ul><ul><li>Develop statistics around the data distribution, refresh strategy, access etc. </li></ul><ul><ul><li>Know what your data looks like. Know when it changes. </li></ul></ul><ul><li>Use your metadata to answer the who, what, when, where and why about your data. </li></ul><ul><ul><li>Tie your Data Quality Management (DQM) to your Total Quality Management (TQM) to create a TdQM program. </li></ul></ul><ul><li>Understand the data distribution in the production environment. </li></ul><ul><ul><li>Understand the statistics about your data. </li></ul></ul><ul><li>Publish statistics to common repository </li></ul><ul><ul><li>Share your data quality standards and reports about the statistics. </li></ul></ul>
  73. 73. Summary - Implement <ul><li>Implement Validation Routines at data collection points. </li></ul><ul><li>Implement ETL and Data Quality Tools to automate the continuous detection, cleansing, and monitoring of key files and data flows. </li></ul><ul><li>Implement Data Quality Checks. Implement data quality checks or audits at reception points or within ETL processes. Stringent checks should be done at source systems and a data integration hub. </li></ul><ul><li>Consolidate Data Collection Points to minimize divergent data entry practices. </li></ul><ul><li>Consolidate Shared Data. Use a data warehouse or ODS to physically consolidate data used by multiple applications. </li></ul><ul><li>Minimize System Interfaces by (1) backfilling a data warehouse behind multiple independent data marts, (2) merging multiple operational systems or data warehouses, (3) consolidating multiple non-integrated legacy systems by implementing packaged enterprise application software, and/or (4) implementing a data integration hub (see next). </li></ul>
  74. 74. Summary - Implement <ul><li>Implement a Data Integration Hub which can minimize system interfaces and provide a single source of clean, integrated data for multiple applications. This hub uses a variety of middleware (e.g. message queues, object request brokers) and transformation processes (ETL, data quality audits) to prepare and distribute data for use by multiple applications. </li></ul><ul><li>Implement a Meta Data Repository. Create a repository for managing meta data gleaned from all enterprise systems. The repository should provide a single place for systems analysts and business users to look up definitions of data elements, reports, and business views; trace the lineage of data elements from source to targets; identify data owners and custodians; and examine data quality reports. In addition, enterprise applications, such as a data integration hub or ETL tools, can use this meta data to determine how to clean, transform, or process data in its workflow. </li></ul>
  75. 75. Some Light Reading… <ul><li>Metadata Solutions by Adrienne Tannenbaum </li></ul><ul><li>Improving Data Warehousing and Business Information Quality By Larry English </li></ul><ul><li>The DOD 8320 M Standard for data creation and management </li></ul><ul><li>Data Warehousing and The Zachman Framework by W.H. Inmon, John Zachman and John Geiger </li></ul><ul><li>Common Warehouse Metamodel (CWM) Specification </li></ul>
  76. 76. Working with complete attributes… A vital piece of previously omitted metadata adversely impacts the outcome of the game…
  77. 77. <ul><li>John Murphy – 303-670-8401 </li></ul><ul><li>[email_address] </li></ul><ul><li>Suzanne Riddell 303-216-9491 </li></ul><ul><li>[email_address] </li></ul>
  78. 78. Touch Points Impact Operational Systems Data Warehouse Data Marts Repositories Add, Update Retrieve Same Data Multiple Locations Multiple Touch Points
  79. 79. Quality Assessment Content <ul><li>Project Information </li></ul><ul><ul><li>Identifier, Name, Manager, Start Date, End Date </li></ul></ul><ul><li>Project Metrics </li></ul><ul><ul><li>Reused Data Object Count </li></ul></ul><ul><ul><li>New Data Object Count </li></ul></ul><ul><ul><li>Objects Modified </li></ul></ul>
  80. 80. Metadata Strategy <ul><li>Build Data Quality Process </li></ul><ul><ul><li>Establish Data Quality Steering Committee </li></ul></ul><ul><ul><li>Establish Data Stewards </li></ul></ul><ul><ul><li>Establish Metadata Management Process </li></ul></ul><ul><ul><li>Establish Data Development and Certification Process </li></ul></ul><ul><li>Audit existing metadata resources </li></ul><ul><ul><li>Data models </li></ul></ul><ul><ul><li>ETL Applications </li></ul></ul><ul><ul><li>RDBMS Schemas </li></ul></ul><ul><ul><li>Collect and Certify existing metadata </li></ul></ul><ul><li>Develop Meta Model </li></ul><ul><ul><li>Determine key metadata sources and alternate sources </li></ul></ul><ul><li>Develop Metadata Repository and Access Strategy </li></ul><ul><ul><li>Implement Meta Model </li></ul></ul><ul><ul><li>Populate with available as is metadata </li></ul></ul><ul><li>Define Gaps in the Metadata </li></ul>
  81. 81. Using Metadata For Quality <ul><li>Develop The Data Quality Process </li></ul><ul><li>Implement Data Development and Standardization Process </li></ul><ul><li>Establish the Metadata Repository </li></ul><ul><li>Profile and Baseline your Data </li></ul><ul><li>Use the Metadata to Improve your Data Quality </li></ul><ul><li>Revise The Data Quality Process </li></ul>
  82. 82. Statistical Analysis <ul><li>Determine your Sample Size </li></ul><ul><ul><li>Size needs to be statistically significant </li></ul></ul><ul><ul><li>If in doubt use a true random 1% </li></ul></ul><ul><ul><li>Repeat complete process several times to gain confidence and repeatability </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>N=((Confidence Level x Est. Standard Deviation) / Bound)^2 </li></ul></ul></ul><ul><ul><ul><li>N=((2.575x.330)/.11)^2 </li></ul></ul></ul><ul><ul><ul><li>N=60 Rows </li></ul></ul></ul><ul><ul><li>Use as large a meaningful sample set as possible. </li></ul></ul>
  83. 83. What Causes Data Warehouses to Fail <ul><li>Failing to understand the purpose of data warehousing </li></ul><ul><li>Failing to understand who are the real “customers” of the data warehouse </li></ul><ul><li>Assuming the source data is “OK” because the operational systems seem to work just fine </li></ul><ul><li>Not developing enterprise-focused information architecture—even if only developing a departmental data mart. </li></ul><ul><li>Focusing on performance over information quality in data warehousing </li></ul><ul><li>Not solving the information quality problems at the source </li></ul><ul><li>Inappropriate “Ownership” of data correction/cleanup processes </li></ul><ul><li>Not developing effective audit and control processes for the data Extract, Correct, Transform and Load (ECTL) processes </li></ul><ul><li>Misuse of information quality software in the data warehousing processes </li></ul><ul><li>Failing to exploit this opportunity to “correct” some of the wrongs created by the previous 40 years of bad habits </li></ul>
  84. 84. Metadata Tool Vendors <ul><li>Data Advantage – www.dataadvantage.com </li></ul><ul><li>CA Platinum - www.ca.com </li></ul><ul><li>Arkidata – www.arkidata.com </li></ul><ul><li>Sagent- www.sagent.com </li></ul><ul><li>Dataflux – www.dataflux.com </li></ul><ul><li>DataMentors – www.datamentors.com </li></ul><ul><li>Vality – www.vality.com </li></ul><ul><li>Evoke – www.evokesoft.com </li></ul>
  85. 85. Data and Information Quality <ul><li>2. Quality… </li></ul>
  86. 86. Quality – What it is and is not <ul><li>Data and Information Quality is the ability to consistently meet the customers expectations and to adapt to those expectations as they change. </li></ul><ul><li>Quality is a process not an end point. </li></ul><ul><li>Quality is understanding the impact of change and the ability to Pro-actively adapt. </li></ul><ul><li>Quality is building adaptable / survivable processes – The less I have to change and keep my Knowledge Workers satisfaction high the more successful I’ll be. </li></ul><ul><li>Data and Information Quality is not Data Cleansing or transformations. By then it’s too late. </li></ul><ul><li>Quality impacts the costs associated with Scrap and Rework – Just Like Manufacturing! </li></ul>
  87. 87. The Quality Leaders <ul><li>The Quality Leaders </li></ul><ul><ul><li>W. Edward Demming – 14 Points of Quality moving from do it fast to do it right. </li></ul></ul><ul><ul><li>Philip Crosby -14 Step Quality Program – Determine what is to be delivered, then the timeline. </li></ul></ul><ul><ul><li>Malcolm Baldridge – Determination of excellence, commitment to change </li></ul></ul><ul><ul><li>Masaaki Kaizen – Continuous Process Improvement </li></ul></ul><ul><li>Quality Frameworks </li></ul><ul><ul><li>Six Sigma – A statistically repeatable approach </li></ul></ul><ul><ul><li>Lean Thinking – Simplify to eliminate waste </li></ul></ul><ul><ul><li>ISO 9000 – Quality measurement process </li></ul></ul>
  88. 88. Quality Tools <ul><li>Six Sigma – A statistically repeatable approach </li></ul><ul><ul><li>Define - Once a project has been selected by management, the team identifies the problem, defines the requirements and sets an improvement goal. </li></ul></ul><ul><ul><li>Measure - Used to validate the problem, refine the goal, then establish a baseline to track results. </li></ul></ul><ul><ul><li>Analyze – Identifies the potential root causes and validate a hypothesis for corrective action. </li></ul></ul><ul><ul><li>Improve – Develop solutions to root causes, test the solutions and measure the impact of the corrective action. </li></ul></ul><ul><ul><li>Control - Establish standard methods and correct problems as needed. In other words, the corrective action should become the new requirement but additional problems may occur that will have to be adjusted for. </li></ul></ul>
  89. 89. Quality Principles – The Knowledge Worker <ul><li>IT has a reason a reason to exist, it’s the Knowledge Worker </li></ul><ul><li>At Toyota the Knowledge Worker is the “Honored Guest” </li></ul><ul><li>It’s all for the knowledge worker. How well do you know them? </li></ul><ul><ul><li>Who are your knowledge workers? </li></ul></ul><ul><ul><li>What data do they need? </li></ul></ul><ul><ul><li>When do they use your data? </li></ul></ul><ul><ul><li>Where do they access it from? </li></ul></ul><ul><ul><li>Why do they need it to do their job? </li></ul></ul><ul><li>Do your KWs feel like Honored Guests or cows in the pasture? </li></ul><ul><li>Building a Profile of the Knowledge Workers </li></ul><ul><ul><li>Classes of Knowledge Workers </li></ul></ul><ul><ul><ul><li>Farmers, Explorers, Inventors </li></ul></ul></ul><ul><ul><li>Determine the distribution of the Knowledge Workers </li></ul></ul><ul><ul><li>Determine their use profile </li></ul></ul>
  90. 90. User Groups By Data Retrieval Needs 80% Grazers 15% Explorers 5% Inventors Grazers – Push Reporting Explorers – Push with Drill Down Inventors - Any, All and then Some
  91. 91. Quality Shared – IT and Users <ul><li>Shared Ownership of the data </li></ul><ul><ul><li>What data do I have? </li></ul></ul><ul><ul><li>How do I care for it? </li></ul></ul><ul><ul><li>What do I want to do with it? </li></ul></ul><ul><ul><li>Where do I / my process add value? </li></ul></ul><ul><li>Start with a target </li></ul><ul><ul><li>Build the car while your driving </li></ul></ul><ul><ul><li>Everyone owns the process </li></ul></ul><ul><ul><li>Everyone participates </li></ul></ul><ul><ul><li>Breakdown the barriers </li></ul></ul>
  92. 92. The Barriers to Quality <ul><li>Knowledge Workers Gripes about IT </li></ul><ul><ul><li>IT can’t figure out how to get my data on time – I’ll do it in Access </li></ul></ul><ul><ul><li>IT has multiple calculations for the same values – I’ll correct them by hand </li></ul></ul><ul><ul><li>It takes IT forever to build that table for me. I’ll do it in Excel </li></ul></ul><ul><li>IT Gripes about Knowledge Workers </li></ul><ul><ul><li>KW won’t make the time to give us an answer </li></ul></ul><ul><ul><li>What KW’s said last month isn’t the same as this month </li></ul></ul><ul><ul><li>They are unrealistic in their expectations </li></ul></ul><ul><ul><li>We can’t decide that, it’s their decision </li></ul></ul><ul><ul><li>I don’t think they can understand a data model </li></ul></ul>
  93. 93. Quality Tools <ul><li>Lean Thinking – Simplify to Eliminate Waste </li></ul><ul><ul><li>Value - Defining what the customer wants. Any characteristic of the product or service that doesn't align with the customers' perception of value is an opportunities to streamline. </li></ul></ul><ul><ul><li>Value Stream - The value stream is the vehicle for delivering value to the customer. It is the entire chain of processes that develop, produce and deliver the desired outcome. Lean Enterprise tries to streamline the process at every step of the way. </li></ul></ul><ul><ul><li>Flow - Sequencing the value stream (process flow) in such a manner as to eliminate any part of the process that doesn't add value. </li></ul></ul><ul><ul><li>Pull - This is the concept of producing only what is needed, when it's needed. This tries to avoid the stockpiling of products by producing or providing only what the customer wants, when they want it. </li></ul></ul><ul><ul><li>Perfection- The commitment to continually pursue the ideal means creating value while eliminating waste. </li></ul></ul>
  94. 94. Total Quality data Management <ul><li>TQdM as a data quality standard process from Larry English </li></ul><ul><li>Process 1- Assess the Data Definition Information Architecture Quality </li></ul><ul><ul><li>In – Starting Point </li></ul></ul><ul><ul><li>Out – Technical Data Definition Quality </li></ul></ul><ul><ul><li>Out – Information Groups </li></ul></ul><ul><ul><li>Out – Information Architecture </li></ul></ul><ul><ul><li>Out – Customer Satisfaction </li></ul></ul><ul><li>Process 2 - Assess the Information Quality </li></ul><ul><ul><li>In – Technical Data Definition Quality Assessment </li></ul></ul><ul><ul><li>Out – Information Value and Cost Chain </li></ul></ul><ul><ul><li>Out – Information Quality Reports </li></ul></ul><ul><li>Process 3 – Measure Non-quality Information Costs </li></ul><ul><ul><li>In – Outputs from Process 2 </li></ul></ul><ul><ul><li>Out – Information Value and Cost Analysis </li></ul></ul>
  95. 95. Total Quality data Management <ul><li>Process 4 –Re-engineer Data and Data Clean-up </li></ul><ul><ul><li>In – Outputs from Process 3 </li></ul></ul><ul><ul><li>Out – Data Defect identification </li></ul></ul><ul><ul><li>Out – Cleansed Data to Data Warehouse and Marts </li></ul></ul><ul><li>Process 5 – Improve Information Process Quality </li></ul><ul><ul><li>In – Production data, Raw and Clean </li></ul></ul><ul><ul><li>Out – Identified opportunities for Quality Improvement </li></ul></ul><ul><li>Process 6 – Establish Information Quality Environment </li></ul><ul><ul><li>In – All quality issues from Process 1 to 5 </li></ul></ul><ul><ul><li>Out – Management of Process 1 to 5 </li></ul></ul><ul><li>Collects much of the existing Meta Data in existence. </li></ul>
  96. 96. Information Quality Improvement Process P1 Assess Data Definition And Information Architecture P2 Assess Information Quality P3 Measure Non-Quality Information Costs P4 Re-Engineer And Cleanse Data P5 Improve Information Process Quality P6 Establish Information Quality Environment

×