Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jim Atwater
                Principal Consultant
            Management Analytics Practice



                            ...
Ad-hoc reporting processes are largely relational,

aspiring to a more dimensional model
 More than one data source is i...
McKeel
12/9/2008   All Rights Reserved   RESEARCH LLC
No “T” in the “ETL” means:

     no auditing or error handling
     no added value until the report is generated
     ...
McKeel
12/9/2008   All Rights Reserved   RESEARCH LLC
Screen Definitions

     A screen is a specific type of an automated test case
     Screens can validate (among other t...
Staged data from source systems

     Data quality begins and ends with the source systems
     Composed of feeds from ...
Error events

       Star-schema data structure
       Source-system feed data rows are at the fact level
       Main ...
Source System Owners

     Owns data provided by the source systems
     Works with Information Steward to define strat...
Key dependencies

     Diversity of reports
     Quantity and complexity of data sources
     Responsiveness of source...
McKeel Research LLC

     Affiliated with Allyis, Incorporated
     Contact:
           James W. Atwater, 3rd
       ...
Upcoming SlideShare
Loading in …5
×

Data Quality Process Design For Analytics And Reporting

771 views

Published on

An implementation of data auditing for relational and/or ad-hoc reporting systems. Helps establish a glide path toward a more dimensional solution.

  • Be the first to comment

  • Be the first to like this

Data Quality Process Design For Analytics And Reporting

  1. 1. Jim Atwater Principal Consultant Management Analytics Practice McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  2. 2. Ad-hoc reporting processes are largely relational,  aspiring to a more dimensional model  More than one data source is involved, including relational databases, spreadsheets, SharePoint lists and flat files  Source data is staged to a relational database and reported using standard tools like Excel and PowerPoint  Data Quality is a concern, guided by a need for “One version of the truth” McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  3. 3. McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  4. 4. No “T” in the “ETL” means:   no auditing or error handling  no added value until the report is generated  analysts end up doing the same work every reporting cycle QA and formatting step actually adds errors   Column names  User-defined aggregations  Macros and calculated fields Violates the prime directive   “One version of the truth” -> “None versions are the truth”  Automation only automates the errors McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  5. 5. McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  6. 6. Screen Definitions   A screen is a specific type of an automated test case  Screens can validate (among other things)  physical or logical structure  atomic-level data values  logical aggregation values  Screens enforce data quality  Identify errors and score their severity  Feed the Error Event Fact Table Screen Order   Allows screens to be run in parallel for better performance McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  7. 7. Staged data from source systems   Data quality begins and ends with the source systems  Composed of feeds from source systems  Driven by source system “data map” analysis Run screens   Run in groups to optimize performance  Consumed by master screen-processing codebase for supportability Data quality metrics   Screen definition  Data quality score  Exception action  Screen type and category  Screen metadata (elapsed time, row count, byte count) McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  8. 8. Error events   Star-schema data structure  Source-system feed data rows are at the fact level  Main facts are the screen definition key and the severity score  Primary consumer is the Information Steward  Source of the Data Quality detail report Audit dimensions   One dimension table for each report data source  One row for each class of error each time the report is updated  Integrated with the report data deliverables  Consumed by users as well as the Information Quality Lead  Acts as a physical guarantee of “one version of the truth” McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  9. 9. Source System Owners   Owns data provided by the source systems  Works with Information Steward to define strategy and reports  Works with Information Quality Lead to define and enforce data quality metrics Information Steward   Owns relationships with report consumers to define reports and select data sources  Owns relationships with source system owners to define feeds Information Quality Lead   Owns business rules for data quality metrics  Owns relationship with source system owners to define and enforce data quality metrics McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  10. 10. Key dependencies   Diversity of reports  Quantity and complexity of data sources  Responsiveness of source systems Milestones   Identify and automate of reports  Stack-rank reports by impact of decisions made  Map data from sources to reports  Define data quality screens  Establish infrastructure  Acquire hardware and staff resources  Create QA infrastructure  Data quality process  Iterative process between Users and Source Systems McKeel 12/9/2008 All Rights Reserved RESEARCH LLC
  11. 11. McKeel Research LLC   Affiliated with Allyis, Incorporated  Contact:  James W. Atwater, 3rd  Principal Consultant, Management Analytics Practice  Office: (425) 996-0427  Cell: (425) 766-0832  Email/IM: jamesatwater@hotmail.com McKeel 12/9/2008 All Rights Reserved RESEARCH LLC

×