Using ca e rwin modeling to asure data 09162010
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Using ca e rwin modeling to asure data 09162010






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Using ca e rwin modeling to asure data 09162010 Presentation Transcript

  • 1. Data Profilingusing CA ERwin Modeling to assure data and metadata
  • 2. abstract• This session explores the use of data profiling to increase the accuracy of critical data assets and their associated data models/metadata. This presentation will include examples of how clients have leveraged data profiling in combination with data modeling for master data management, data warehousing, data governance, and other data intensive initiatives. PAGE 2
  • 3. biography• Antonio C. Amorin President, Data Innovations, Inc. – Eighteen years of data modeling experience and fourteen years of experience using CA ERwin® Data Modeler – Ten years of data profiling experience and two years of experience using CA ERwin® Data Profiler – Data Innovations, Inc. – CA Partner since 2004 – Presented at CA World’08, CBI’s Life Sciences Forum on “Customer Data Quality and Integrity”, ERwin User Groups, webcasts and at client sites – Graduated from Illinois State University with a BA in Computer Science and a minor in Economics PAGE 3
  • 4. agenda• Data Profiling• Data and Metadata Quality• Data Governance and Data Warehousing• Real-life Examples• Summary PAGE 4
  • 5. data profilingPAGE 5
  • 6. data profiling• What is data profiling? – The analysis of data content to infer metadata – A component of data modeling• What are the basic components of the CA ERwin® Data Profiler? – Column analysis – PF key analysis – Data object analysis – Overlap analysis PAGE 6
  • 7. data profiling• Column analysis – Inferred metadata provides intimate knowledge of the data content at the column level • Cardinality • Range • Mode • Sparse • Null count PAGE 7
  • 8. data profiling• Column analysis (continued) • Value frequencies • Pattern frequencies • Length frequencies• Identify critical data elements – Allows the user to focus analysis on specific attributes PAGE 8
  • 9. data profiling• PF key analysis – Cross-table analysis of primary- foreign key relationships • Column matches • Classification – Parent-child – Reference – None PAGE 9
  • 10. data profiling• PF key analysis (continued) – Cross-table analysis of primary- foreign key relationships • Expressions – table.column=table.column • Row hit rate • Value hit rate • Selectivity PAGE 10
  • 11. data profiling• Data objects – Similar to subject areas – Groups objects together that contain the same data content – Based on the parent-child relationships – Creates an object view of related tables or files PAGE 11
  • 12. data profiling• Overlap analysis – Cross-system analysis that identifies data content overlap – Data Set Summary • Provides graphical overview – Legend identifies color coded data sources – Allows modeler to visualize data content overlap between data sources PAGE 12
  • 13. data profiling• Overlap analysis (continued) – Data set overlaps • Table compares each data source to the other data sources • Allows comparison of two data sources at a time • Identifies the number of tables and columns that overlap between each data source PAGE 13
  • 14. data profiling• Overlap analysis (continued) – Column Summary • Identifies each column in the primary data source • Identifies value overlap between data sources • Allows modeler to use critical data elements to focus analysis • Allows modeler to drill into analysis to identify data content overlap PAGE 14
  • 15. data profiling• Overlap analysis (continued) – Matches data preview • Allows the modeler to view hits or misses • Identifies specific data content that overlaps or does not overlap between each data source PAGE 15
  • 16. data and metadata qualityPAGE 16
  • 17. data and metadata quality• Data – Business data - information utilized to operate the business• Metadata – Information generated during the development of IT solutions – Defines both the business and technical understanding of the data – Utilized to store, process, and report on business data PAGE 17
  • 18. data and metadata quality• Data Quality – Accuracy of the business data – High/low quality – Mission critical• Metadata quality – Properly represents data content – Validate parent-child relationships PAGE 18
  • 19. data and metadata quality• Leveraging data profiling – Use the cardinality, range, mode, and sparse indicators to identify attributes requiring detailed analysis – Identify data quality issues and validate data types using the value and pattern frequencies – Leverage the null count and length frequencies to validate column metadata – Validate parent-child relationships using the primary-foreign key analysis – Leverage the overlap analysis with reference tables containing valid values for data quality assessments PAGE 19
  • 20. data governance and data warehousingPAGE 20
  • 21. data governance and data warehousingLeveraging data profiling for data governance• Business Data – Standards – Master data management – Data quality assessments• Metadata – Standards – Model validation PAGE 21
  • 22. data governance and data warehousingLeveraging data profiling for data governance (continued)• Standards – Business data - valid values, data patterns, and standardized values for static data content – Metadata – validate model metadata represents data content properly and validate parent-child relationships – Automate the analysis with profiling – Develop profiling reports for each standard – Define and implement a review process – Integrate standards and review process into SDLC PAGE 22
  • 23. data governance and data warehousingLeveraging data profiling for data governance (continued)• Master data management (MDM) – Locating reference data – Data mapping – Harmonizing reference data – Establishing validations and syndication rules – Identifying hub metadata – Data quality assessments PAGE 23
  • 24. data governance and data warehousingLeveraging data profiling for data governance (continued)• Data quality assessments – Comprehensive review at the column level – Validation of primary keys – Validation of parent-child relationships – Point-to-point content validation between systems – Standardize analysis methodology – Standardize problem notation – Standardize reporting PAGE 24
  • 25. data governance and data warehousingLeveraging data profiling for data warehousing• Data warehouse development – Leverage data models and data profiling results to locate and map business data to the data warehouse – Eliminate the code-load-explode development methodology for ETL • Profile each data source to validate data content • Identify accurate requirements for transformations to consolidate data content and correct data quality issues – Use profiling results to determine model metadata for target staging databases and the data warehouse – Profile the data warehouse regularly to ensure high quality data content PAGE 25
  • 26. real-life examples PAGE 26
  • 27. real-life examplesPublic computer hardware and software manufacturer• Introduced data profiling into ongoing data warehousing project – Profiled first data source • Found questionable data content in financial data within ten minutes of profiling data • Realized that six months were wasted mapping from the data source to the target data warehouse • All new data sources were profiled going forward to ensure validity PAGE 27
  • 28. real-life examplesLarge public food manufacturer• Introduced data profiling into sales data warehouse project – Leveraged data profiling results to create accurate ETL specifications, reducing the overall development time – Developers utilized data profiling to validate ETL unit testing – Used cross-system analysis to integrate data content from disparate data sources into standardized values in data warehouse – Profiled data warehouse regularly to identify data content issues PAGE 28
  • 29. real-life examplesPublic healthcare insurance provider• Introduced data profiling into ongoing master data management project – Performed data content mapping utilizing profiling results – Analyzed IMS extracts and flat files to determine where reference data lived within legacy mainframe data sources – Leveraged profiling results to create ETL specifications – Harmonized reference data using profiling results – Validated reference data loaded into MDM hub PAGE 29
  • 30. real-life examplesMedium-sized accounting service organization• Created data store for reporting purposes – Profiled disparate data sources to identify model metadata for new data store – Leveraged profiling results to identify data quality issues for each data source – Created ETL specifications to consolidate data content from the disparate data sources using the profiling results – Validated data content in the loaded data store PAGE 30
  • 31. summary• Data Profiling• Increases accuracy of data content and metadata• Reduces project overrun• Increases value of deliverables to the business• Valuable for master data management, data warehousing, data governance, and other data intensive initiatives PAGE 31
  • 32. questions and answers
  • 33. thank you