Data Quality Process Design For Analytics And Reporting

656 views
568 views

Published on

An implementation of data auditing for relational and/or ad-hoc reporting systems. Helps establish a glide path toward a more dimensional solution.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
656
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Quality Process Design For Analytics And Reporting

  1. 1. Jim Atwater Principal Consultant Management Analytics Practice McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  2. 2. Ad-hoc reporting processes are largely relational,  aspiring to a more dimensional model  More than one data source is involved, including relational databases, spreadsheets, SharePoint lists and flat files  Source data is staged to a relational database and reported using standard tools like Excel and PowerPoint  Data Quality is a concern, guided by a need for “One version of the truth” McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  3. 3. McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  4. 4. No “T” in the “ETL” means:   no auditing or error handling  no added value until the report is generated  analysts end up doing the same work every reporting cycle QA and formatting step actually adds errors   Column names  User-defined aggregations  Macros and calculated fields Violates the prime directive   “One version of the truth” -> “None versions are the truth”  Automation only automates the errors McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  5. 5. McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  6. 6. Screen Definitions   A screen is a specific type of an automated test case  Screens can validate (among other things)  physical or logical structure  atomic-level data values  logical aggregation values  Screens enforce data quality  Identify errors and score their severity  Feed the Error Event Fact Table Screen Order   Allows screens to be run in parallel for better performance McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  7. 7. Staged data from source systems   Data quality begins and ends with the source systems  Composed of feeds from source systems  Driven by source system “data map” analysis Run screens   Run in groups to optimize performance  Consumed by master screen-processing codebase for supportability Data quality metrics   Screen definition  Data quality score  Exception action  Screen type and category  Screen metadata (elapsed time, row count, byte count) McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  8. 8. Error events   Star-schema data structure  Source-system feed data rows are at the fact level  Main facts are the screen definition key and the severity score  Primary consumer is the Information Steward  Source of the Data Quality detail report Audit dimensions   One dimension table for each report data source  One row for each class of error each time the report is updated  Integrated with the report data deliverables  Consumed by users as well as the Information Quality Lead  Acts as a physical guarantee of “one version of the truth” McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  9. 9. Source System Owners   Owns data provided by the source systems  Works with Information Steward to define strategy and reports  Works with Information Quality Lead to define and enforce data quality metrics Information Steward   Owns relationships with report consumers to define reports and select data sources  Owns relationships with source system owners to define feeds Information Quality Lead   Owns business rules for data quality metrics  Owns relationship with source system owners to define and enforce data quality metrics McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  10. 10. Key dependencies   Diversity of reports  Quantity and complexity of data sources  Responsiveness of source systems Milestones   Identify and automate of reports  Stack-rank reports by impact of decisions made  Map data from sources to reports  Define data quality screens  Establish infrastructure  Acquire hardware and staff resources  Create QA infrastructure  Data quality process  Iterative process between Users and Source Systems McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  11. 11. McKeel Research LLC   Affiliated with Allyis, Incorporated  Contact:  James W. Atwater, 3rd  Principal Consultant, Management Analytics Practice  Office: (425) 996-0427  Cell: (425) 766-0832  Email/IM: jamesatwater@hotmail.com McKeel 12/9/2008 All Rights Reserved RESEARCH LLC

×