Building a Data Quality Program from Scratch


Published on

Published in: Technology

Building a Data Quality Program from Scratch

  1. 1. Building A Data Quality Program From Scratch DAMA Chicago October 19, 2011 John Grage – Sr. Mgr. Discover Financial Services
  2. 2. <ul><li>Company Introduction </li></ul><ul><li>Card Acceptance </li></ul><ul><li>Data Quality Defined </li></ul><ul><li>The Six Factors of Data Quality </li></ul><ul><li>Best Practices for Improving Data Quality </li></ul><ul><li>Origins of Poor Data Quality </li></ul><ul><li>Benefits of High Data Quality </li></ul><ul><li>Who is Responsible for Data Quality? </li></ul><ul><li>Let’s Get Started </li></ul><ul><li>Celebrate the Wins </li></ul><ul><li>Recommendations </li></ul><ul><li>Core Functional Requirements of a Data Quality Tool </li></ul><ul><li>Q&A </li></ul>Agenda
  3. 3. Company Introduction <ul><li>Discover Financial Services (NYSE: DFS) </li></ul><ul><ul><li>Direct Banking and Payment Services Company </li></ul></ul><ul><ul><li>Founded in 1986 </li></ul></ul><ul><ul><li>We Offer Many Consumer Products </li></ul></ul><ul><ul><ul><li>Credit Card (One of the Largest Credit Card Issuers in the U.S.) </li></ul></ul></ul><ul><ul><ul><li>ATM/Debit Card </li></ul></ul></ul><ul><ul><ul><li>Loans (Student, Credit Card, and Personal) </li></ul></ul></ul><ul><ul><ul><li>Banking (Online Savings Accts, CDs, and Money Market Accts) </li></ul></ul></ul><ul><ul><li>We Own Three Payments Networks </li></ul></ul><ul><ul><ul><li>Discover Network: has millions of merchants and cash access locations </li></ul></ul></ul><ul><ul><ul><li>PULSE: one of the nation’s leading ATM/debit networks </li></ul></ul></ul><ul><ul><ul><li>Diners Club International: a global payments network with acceptance in 185 countries and territories </li></ul></ul></ul><ul><ul><li>Riverwoods, IL Headquarters </li></ul></ul><ul><ul><li>Approximately 10,500 Employees </li></ul></ul><ul><ul><li>Approximately 50 Million Card Holders </li></ul></ul><ul><ul><li>Sites Include: and </li></ul></ul>
  4. 4. Card Acceptance <ul><li>Discover Card </li></ul><ul><ul><li>North America – U.S. / Canada / Mexico </li></ul></ul><ul><ul><li>Central America – Costa Rica / El Salvador / Panama and others </li></ul></ul><ul><ul><li>South America – Brazil / Ecuador </li></ul></ul><ul><ul><li>Caribbean – Bahamas / BVI / Jamaica / Puerto Rico and others </li></ul></ul><ul><ul><li>Europe – Austria / Finland / Poland / Turkey and others </li></ul></ul><ul><ul><li>Asia – Mainland China / Japan / South Korea </li></ul></ul><ul><ul><li>Africa – South Africa </li></ul></ul><ul><ul><li>Many Other Countries Coming Soon </li></ul></ul><ul><ul><li>See and select ‘International Acceptance’ under ‘Help and Support’ for up to date list </li></ul></ul>
  5. 5. Data Quality Defined <ul><li>Many Definitions </li></ul><ul><ul><li>The degree of excellence exhibited by the data in relation to the portrayal of the actual scenario. </li></ul></ul><ul><ul><li>The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. </li></ul></ul><ul><ul><li>The people, processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria. </li></ul></ul><ul><ul><li>People (must), Process (must), Technology (tools needed at some point) </li></ul></ul><ul><li>Myths and Misconceptions </li></ul><ul><ul><li>More than defect correction </li></ul></ul><ul><ul><li>Not a one time action </li></ul></ul><ul><ul><li>Seldom about perfection </li></ul></ul>
  6. 6. The Six Factors of Data Quality <ul><li>Context </li></ul><ul><ul><li>The purpose for which it is used </li></ul></ul><ul><li>Storage </li></ul><ul><ul><li>Where the data resides </li></ul></ul><ul><li>Data Flow </li></ul><ul><ul><li>How the data enters and moves through the organization </li></ul></ul><ul><li>Work Flow </li></ul><ul><ul><li>How work activities interact with and use the data </li></ul></ul><ul><li>Stewardship </li></ul><ul><ul><li>People responsible for managing the data </li></ul></ul><ul><li>Continuous Monitoring </li></ul><ul><ul><li>Processes for regularly validating the data </li></ul></ul>
  7. 7. Best Practices for Improving Data Quality <ul><li>Every Data Quality Effort Starts with Data Profiling </li></ul><ul><li>Tool Based Data Profiling is More effective Than Manual Methods </li></ul><ul><li>Data Profiling is Not a One Time Task </li></ul><ul><li>Data Profiling, Integration and Quality are Closely Related </li></ul><ul><li>Proactive Order Can Reduce Reactive Chaos </li></ul><ul><li>Improving Data When It’s Created or Changed is Easier Than Fixing It Later </li></ul><ul><ul><li>Garbage in, garbage out </li></ul></ul><ul><ul><li>An ounce of prevention is worth a pound of cure </li></ul></ul><ul><ul><li>Data quality needs to move upstream </li></ul></ul>
  8. 8. Origins of Poor Data Quality <ul><li>Inconsistent Definitions for Common Terms </li></ul><ul><li>Any Manual Intervention in the Data Flow Process (employees/customers) </li></ul><ul><li>Data Migration or Conversion Projects </li></ul><ul><li>External Data </li></ul><ul><li>Customer, Product and Financial Data are More Prone to Data Quality Problems Compared to Other Types of Data </li></ul>
  9. 9. Benefits of High Data Quality <ul><li>Greater Confidence in Analytic Systems </li></ul><ul><li>Less Time Spent Reconciling Data and/or Fixing Problems </li></ul><ul><li>Single Version of the Truth </li></ul><ul><li>Increased Customer Satisfaction </li></ul><ul><li>Reduced Costs </li></ul><ul><li>Increased Revenues </li></ul><ul><li>Compliance </li></ul><ul><ul><li>Compliance can drive your DQ program if you can’t sell the other benefits </li></ul></ul><ul><ul><li>Make friends with you audit staff </li></ul></ul><ul><ul><li>HIPAA, GLBA, SOX, Basel II, FDIC, Federal Reserve and others </li></ul></ul>
  10. 10. Who is Responsible for Data Quality? <ul><li>Information Technology </li></ul><ul><li>Business Analysts </li></ul><ul><li>Business </li></ul><ul><li>Front-Line Workers </li></ul><ul><li>DQ Analysts </li></ul><ul><li>Data Steward </li></ul><ul><li>Corporate Executives </li></ul><ul><li>Board of Directors </li></ul><ul><li>No One </li></ul><ul><li>All of Us Are – We just play different roles </li></ul>
  11. 11. Let’s Get Started (Metadata) <ul><li>Information = Data (content) + Metadata (context) </li></ul><ul><li>Your DQ Program Needs to Address Both Data and Metadata </li></ul><ul><li>“ Don’t Boil the Ocean” </li></ul><ul><li>Start with a Focus on Structured Data (get this right b4 tackling others) </li></ul><ul><li>Start With Selecting a Handful of Business Attributes From: </li></ul><ul><ul><li>Customer </li></ul></ul><ul><ul><li>Product </li></ul></ul><ul><ul><li>Vendor / Supplier </li></ul></ul><ul><ul><li>Employee </li></ul></ul><ul><ul><li>Financial </li></ul></ul><ul><ul><li>Master Reference Data </li></ul></ul><ul><ul><li>or an attribute(s) someone brings to you. Don’t turn away this opportunity </li></ul></ul><ul><li>Find Data Steward / SME / or Someone with Business Knowledge About Attribute Who is Willing to Work With You </li></ul><ul><li>Find Published Metadata About Those Attributes </li></ul><ul><ul><li>Verify Metadata is current and accurate with your SME </li></ul></ul><ul><ul><li>If Metadata does not exist then that is your first step </li></ul></ul>
  12. 12. Let’s Keep Going (Discovery) <ul><li>Update and/or Publish Your Metadata on These Attributes </li></ul><ul><ul><li>Great if you already have a single metadata repository tool </li></ul></ul><ul><ul><li>If not, that should be one goal of your data governance program </li></ul></ul><ul><ul><li>Document and train individuals on how to find and use this metadata </li></ul></ul><ul><ul><li>Enterprise LDM should be in your repository </li></ul></ul><ul><ul><ul><li>Business subject areas, critical entities, attributes and relationships </li></ul></ul></ul><ul><ul><ul><li>Metadata about these attributes is your golden record </li></ul></ul></ul><ul><li>Discovery (where do these attributes reside?) </li></ul><ul><ul><li>Almost impossible to get 100% coverage without a tool </li></ul></ul><ul><ul><li>Could write lots of SQL and interrogate lots of programs and copybooks </li></ul></ul><ul><ul><li>Either way you will have something to work with – just how complete is it? </li></ul></ul>
  13. 13. Let’s Keep Going (POC) <ul><li>Start With a POC Within One LOB </li></ul><ul><ul><li>1-2 week effort </li></ul></ul><ul><ul><li>Examine a small number of attributes </li></ul></ul><ul><ul><li>Gather a small set of business rules </li></ul></ul><ul><ul><li>Profile the data </li></ul></ul><ul><ul><li>Share findings with SME </li></ul></ul><ul><ul><li>This is your chance to show value within a LOB that a DQ program can bring </li></ul></ul>
  14. 14. Let’s Keep Going (Project) <ul><li>Expand to Data Quality Project for That LOB </li></ul><ul><ul><li>1-6 month effort </li></ul></ul><ul><ul><li>Expand to full set of attributes </li></ul></ul><ul><ul><li>Expand to full set of business rules </li></ul></ul><ul><ul><li>Profile the data </li></ul></ul><ul><ul><li>Share findings with SME and LOB </li></ul></ul><ul><ul><li>Build action plan to address DQ issues </li></ul></ul><ul><ul><li>Fix DQ issues </li></ul></ul><ul><ul><li>Build in monitoring and reporting activities </li></ul></ul><ul><ul><li>Start looking upstream </li></ul></ul><ul><ul><li>Publish results – gain corporate awareness of what you have accomplished </li></ul></ul><ul><ul><li>May need to do more than one LOB before preceding to next step </li></ul></ul>
  15. 15. Let’s Keep Going (Enterprise) <ul><li>Expand to Data Quality Project Across the Enterprise </li></ul><ul><ul><li>6-12+ month effort </li></ul></ul><ul><ul><li>This is where you start to enter into MDM </li></ul></ul><ul><ul><li>Look at critical business entities / attributes that span the enterprise </li></ul></ul><ul><ul><ul><li>May be some of the same attributes that you looked at individually within their LOB </li></ul></ul></ul><ul><ul><li>Look at full set of business rules across the enterprise </li></ul></ul><ul><ul><li>Profile the data across multiple LOBs </li></ul></ul><ul><ul><li>Share findings with enterprise SME and Data Governance Council </li></ul></ul><ul><ul><li>Work with DGC to prioritize next steps </li></ul></ul><ul><ul><li>Build action plan to address DQ issues </li></ul></ul><ul><ul><li>Fix DQ issues </li></ul></ul><ul><ul><li>Build in monitoring and reporting activities </li></ul></ul><ul><ul><li>Focus upstream - need to address DQ issues in operational systems </li></ul></ul><ul><ul><li>Publish results – gain corporate awareness of what you have accomplished </li></ul></ul>
  16. 16. Let’s Keep Going (6 Key DQ Dimensions) <ul><li>Completeness </li></ul><ul><ul><li>Are data values missing or in an unusable state? </li></ul></ul><ul><ul><li>Nullability </li></ul></ul><ul><li>Conformity </li></ul><ul><ul><li>Should data conform to specified formats? </li></ul></ul><ul><li>Consistency </li></ul><ul><ul><li>Do distinct data instances provide conflicting information? </li></ul></ul><ul><ul><li>Are values consistent across data sets? </li></ul></ul><ul><li>Accuracy </li></ul><ul><ul><li>Does data accurately represent the “real-world” values they are expected to model? i.e. incorrect spellings and not current data </li></ul></ul><ul><li>Duplication </li></ul><ul><ul><li>Are there multiple, unnecessary representations of the same data? </li></ul></ul><ul><li>Integrity </li></ul><ul><ul><li>What data is missing important relationship links? The inability to link related records together may introduce duplication across your enterprise </li></ul></ul>
  17. 17. Let’s Keep Going (Profile) <ul><li>Run Data Profiling Against Your Attribute(s) </li></ul><ul><ul><li>A DQ tool makes your life much simpler </li></ul></ul><ul><ul><li>Report on </li></ul></ul><ul><ul><ul><li>Source system </li></ul></ul></ul><ul><ul><ul><li>Entity name </li></ul></ul></ul><ul><ul><ul><li>Attribute name </li></ul></ul></ul><ul><ul><ul><li>Data type and length </li></ul></ul></ul><ul><ul><ul><li>Nullability </li></ul></ul></ul><ul><ul><ul><li>Identify if attribute is a PK or FK </li></ul></ul></ul><ul><ul><ul><li>Total number of rows (or %) examined (may not want/need to look at all rows) </li></ul></ul></ul><ul><ul><ul><li>Cardinality </li></ul></ul></ul><ul><ul><ul><li>Min and max values for the attribute </li></ul></ul></ul><ul><ul><ul><li>Classification (SS#, postal code, name, address, etc.) DQ tools good at this </li></ul></ul></ul><ul><ul><ul><li>Number of data quality issues (attributes not in-line with business rules) </li></ul></ul></ul><ul><ul><ul><li>Provide explanations and examples for each exception </li></ul></ul></ul>
  18. 18. Let’s Keep Going (Analyze / Fix) <ul><li>Analyze Your Results </li></ul><ul><ul><li>Look at results from your analysis regarding DQ dimensions looked at </li></ul></ul><ul><ul><li>Identify data quality issues </li></ul></ul><ul><ul><li>Determine with SME the impact to LOB or company these exceptions bring </li></ul></ul><ul><ul><li>$ is the best message to bring </li></ul></ul><ul><ul><li>Compliance is equally as effective </li></ul></ul><ul><ul><li>Build action plan to fix </li></ul></ul><ul><ul><li>Determine cost to fix </li></ul></ul><ul><ul><li>Take action to fix if cost effective (remember it’s not about perfection) </li></ul></ul><ul><ul><li>Save results </li></ul></ul>
  19. 19. Let’s Keep Going (Swim Upstream) <ul><li>Trace Data Flow in Reverse from Data Quality Issue </li></ul><ul><li>Data was Corrupted Somewhere Along Data Flow </li></ul><ul><ul><li>Right off the bat – as data entered the company </li></ul></ul><ul><ul><ul><li>Bad vendor file </li></ul></ul></ul><ul><ul><ul><li>Bad data entry from customer service rep (telephone call) </li></ul></ul></ul><ul><ul><ul><li>Bad data entry from customer (online application) </li></ul></ul></ul><ul><ul><li>Programming error in operational system </li></ul></ul><ul><ul><li>Data Transformation processes as data moves along </li></ul></ul><ul><ul><li>??? </li></ul></ul><ul><li>Find Where Corruption is Occurring and Fix It </li></ul><ul><li>Beware: Corruption May be Occurring in Multiple Places </li></ul>
  20. 20. Let’s Keep Going (Monitoring) <ul><li>Build Monitoring Process to Audit Your Fix </li></ul><ul><li>Monitoring Process Should be a Scheduled Automated Process </li></ul><ul><li>Need to Review Results to Determine if Data is No Longer Being Corrupted </li></ul><ul><li>Take Action if Data Quality is Being Compromised </li></ul>
  21. 21. Let’s Keep Going (Non-Compliance) <ul><li>Use Pie charts, Bar Graphs, etc to Pictorially Illustrate Effect of Not Addressing Discovered DQ Issues </li></ul><ul><li>Tie to Regulatory Compliance if Helpful. Refer to HIPAA, Basel II, SOX, FDIC, Federal Reserve. </li></ul><ul><li>Tie to $ </li></ul><ul><ul><li>Increased cost </li></ul></ul><ul><ul><li>Decreased revenue </li></ul></ul><ul><li>Present to Data Governance Council </li></ul>
  22. 22. Celebrate The Wins <ul><li>Celebrate </li></ul><ul><li>Publish Wins on Scorecard </li></ul><ul><li>Show $ Saved or Revenue Increased </li></ul><ul><li>Constantly Remind Enterprise of What You are Doing and Value You are Providing </li></ul>
  23. 23. Recommendations <ul><li>Start Small (POCs) </li></ul><ul><li>Show Some Quick Wins - $ </li></ul><ul><li>Grow From There </li></ul><ul><li>Focus on What You Have to Work With, Not What You Don’t Have to Work With </li></ul><ul><li>Profile Data More Deeply and More Often </li></ul><ul><li>Find Solutions in Tools </li></ul><ul><li>Establish Both Proactive and Reactive Processes </li></ul><ul><li>Take Data Quality Upstream </li></ul><ul><li>Use Regulatory Compliance to Drive Data Quality </li></ul><ul><li>Use MetaData to Drive Quality </li></ul><ul><li>Address Enterprise Data Quality </li></ul><ul><li>Derive EDQ Org Structure and Support Through Data Governance or other Executive Support </li></ul>
  24. 24. Core Functional Requirements of a DQ Tool <ul><li>Profiling </li></ul><ul><ul><li>Capture statistics (metadata) providing insight into the quality of the data and help to identify data quality issues </li></ul></ul><ul><li>Parsing and Standardization </li></ul><ul><ul><li>Decomposition of text fields into component parts and the formatting of values into consistent layouts based on industry standards, local standards, user defined business rules and knowledge bases of values and patterns </li></ul></ul><ul><li>Generalized “Cleansing” </li></ul><ul><ul><li>The modification of data values to meet domain restrictions, integrity constraints or other business rules that define when the quality of data is sufficient for organization </li></ul></ul><ul><li>Matching </li></ul><ul><ul><li>Identifying, linking or merging related entries within or across sets of data </li></ul></ul><ul><li>Monitoring </li></ul><ul><ul><li>Deploying controls ensuring data continues to conform to business rules that define data quality for the organization </li></ul></ul><ul><li>Enrichment </li></ul><ul><ul><li>Enhancing the value of internally held data by appending related attributes from external sources (i.e. consumer demographic attributes or geographic descriptors) </li></ul></ul>
  25. 25. The End <ul><li>Thank You! </li></ul><ul><li>Questions? </li></ul>