The document discusses data governance and provides an overview of key concepts including:
- Governance establishes standards and policies for processes and data through a governing committee.
- Metadata is important for data management and describes business, operational, technical, and process attributes.
- Data profiling examines data quality and validates adherence to business rules.
- A hybrid architecture with a persistent staging area, enterprise data warehouse, data marts and OLAP cubes supports governance goals.
2. Today’s Agenda
› What is Governance?
› Why Data Governance?
› Governance Infrastructure
› Program versus Project Management
› Single Version of Truth versus Fit For Purpose
› Metadata - The Heart of Governance
› Master Data Management
› Data Profiling – Enforcing and Developing Business Rules
› Q&A
2
3. What is Governance?
• Governance is single authoritative
group responsible for creating and
enforcing standards and policies
around processes and data.”
• Governance isn’t a short term
project. Governance is an ongoing
program. Governance is enterprise
focused.
• The Governance Board is the
heart of a governance program.
The governance board is cross-
functional organization with
executive-level membership
.
3
4. What is Governance?
• Governance provides standards
and policies around the following in
relation to processes and data.
Software Products ,
Infrastructure,
Quality,
Security,
Adjudicative Dispute Resolution,
Lifecycle,
Best Practices,
Architecture and future
roadmaps,
Project Prioritization, Asset
Management,
Version Control,
Evangelizing and
Communication,
Vendor Relationship
Management, and
Legal & Corporate Compliance.
4
5. Why Data Governance?
• A strong governance program is vital to the success of any enterprise
architecture.
• Compliance – Governance programs allows for the compliance to regulatory
requirements. We have all heard “I am too pretty to go to jail”. Well, it’s true
without governance. Without governance, there exists no formalized process for
proving compliance such as regulatory compliance with HIPAA and privacy laws.
• Data governance initiatives may be aimed at achieving a number of objectives
including offering better visibility to internal and external customers such as
supply chain management.
5
6. Why Data Governance?
• Harmonizing – Governance provides for standard definitions. This allows
developers, Database administrations, end users and data stewards to be
working with the same values.
• Consistent Analysis – Allows the business to rollup (consolidated) values with
consistent values apples to apples.
• Faster Development – By providing standard definitions and models, we provide
the infrastructure for extreme development. The hard part is adding data. If we do
not need to add data, we can develop applications in days or hours.
6
7. Why Data Governance?
• The Confliction Resolution – resolve disputes between business groups.
• Asset Management – harvesting and management of assets to maximum
business returns. Ensure prioritization with an enterprise view. Reduction in costs
thru elimination of duplicate efforts.
• Security – Operational metadata. Who access what and when and how? This will
not only allow for regulatory compliance; however, for better data warehouse
design.
• Better Data and Process Quality thru clearly defined business rules.
Empowerment thru the speed of business exposing the business rules.
7
8. Governance Infrastructure
• The Governing Committee should be a group of key BI Senior Sponsors, project
sponsors, and IT personnel from each of the Business Units (or at least those that
currently and potentially utilize the BI COE services). IT should serve the
governance as trusted advisors not the primary drivers of the governance board.
The lead of the governance board should come from the business side.
• This governing committee is not expected to be involved in day-to-day
management of the BI COE.
• The Governing Committee should meet regularly (i.e., alternating months or
quarterly).
8
9. Program versus Project Management
• Program Management is what forces data governance to be enterprise focused. It
is easy to implement the database components without regard to the enterprise if
the outer functions are absent. With them firmly in place, the focus must shift to
serving the entire enterprise. The ensure that silo application are not created that
data is sharable and reusable that corporate standards are adhered to that best
practices are used thought-out.
9
10. Program versus Project Management
Program Project
On Going One Time
Broad Perspective Specific Perspective
Requires Architecture May Not Require
Architecture
Long Term Focus Short Term Focus
Realizes Benefits of May Not Realize the
Standards and Reuse Benefits of Standards and
Reuse.
Strategy Is Essential Strategy Is Not Essential
10
11. Single Source of Truth versus Fit for Purpose
• Single source of truth based upon data governance rules to determine the single
source of truth.
• Fit to purpose means storing any number of data values and relying upon data
governance rules to determine the approximate one.
11
12. Modeling - Fit for Purpose
• For example, clinic discharge date and billing discharge date.
• Fit for Purpose we could keep both in the EDW.
Enc1 s1 12/1/2010 Clinical
Enc2 s2 12/2/2010 Billing
Enc3 SSOT 12/1/2010 Clinical
Additional modeling for Fit for Purpose
Peer Tables
Flex Columns
12
13. Metadata is the Heart of Governance
• Metadata is very important for the management and operation of a data.
• Metadata provides context for data by describing data about data. For example,
on every food product, there is a list of ingredients, calorie and vitamin content.
• Metadata can be classified into four areas:
• Business metadata
• Operational metadata
• Technical metadata
• Process metadata
13
14. Business Metadata
• Business metadata describes the business meaning of data. It includes business
definitions of the objects and metrics, hierarchies, business rules, and
aggregation rules.
• Business Metadata must be
• Searchable and
• Easy to access.
14
15. Business Metadata - Searchable
• Business metadata must be searchable to determine how many occurrences of
the term exist. This search ability component is important to support fit for
purpose architecture versus a single version of the truth.
• For example, we must be able to determine How Many Different Definitions of
Patient Stay. Without a Business Metadata repository, this is a manual effort.
15
16. Business Metadata – Ease to Access
• Business metadata must be integrated in the application so that it is easily
accessible by the end user. For example, an end user must be able to move his
cursor on top of a field on his computer screen and click for a help screen to
appear. This help screen should describe the business data of the element that
the end user has selected. The Business Metadata should appear to give a clear
definition of the data element.
16
19. Operational Metadata
• Operational metadata stores information about who accessed what and when.
This metadata is not only important for legal requirements but for the design of
the data warehouse itself.
• For example, we can identify that a particular data mart is not being utilized. This
will enable us to develop a plan. Should we eliminate the data mart? Should we
be providing better education for the end users? Should we redesign the
application or the data mart?
19
20. Technical Metadata
• Technical Metadata describes the data structures and formats such as table
types, data types, indexes and partitioning method. Also, it describes the location
of the data elements.
• Technical metadata is very critical in a federated architecture where a data model
is implementing across many different servers in different locations.
20
21. Technical Metadata
• For example, on Server A, patient’s first name might be a 5 character field. On
Server B, it maybe, 20 character field. The patient’s first name would not be
consistent across these two servers because a patient’s name could be truncated
on Server A. For example, Michael would appears as Micha on Server A and
Michael on Server B. The Governance board needs to understand these technical
data abnormities. The lack of enforcing consistent technical metadata across
servers is the primary reason federated architectures fail.
21
23. Technical Metadata
• Process Metadata describes the data input process. It includes data cleansing
rules, source target maps, transformation rules, validation rules and integration
rules. Also, process metadata covers versioning of processes (services).
• Process metadata tools need to support two important governance features:
• Data Linage and
• Impact Analysis.
23
24. Technical Metadata – Data Lineage
• The Data Lineage feature supports graphical visualization of complex data
lineage relationships. We have the ability to track lineage of data from end user
report/dashboard back to original data source elements. Data Lineage will answer
questions such as Where the data came from? What business rules,
transformations happened? And What data source is the authoritative source?.
24
26. Process Metadata – Data Lineage
Column Name Description Data Type
DATASOURCE_NUM_ID Primary key to NUMBER(10) NOT
uniquely identify a NULL
data source.
DATASOURCE_CODE Stores the code VARCHAR2(80)
assigned
DATASOURCE_NAME Stores the name of the VARCHAR2(80)
data source.
DATASOURCE_DESC Stores the description VARCHAR2(2000)
of the data source.
26
27. Technical Metadata – Impact Analysis
• The Impact Analysis feature shows us what impact there would be if we changed
a process.
• For example, the governance board has a question – What effort would it be to
reduce down the number of different definitions of patient stay from 4 to 2. The
business metadata would confirm that there are 5 different versions of patient
stay. The process metadata tool with impact analysis would tell us that 50 ETL
programs would have to be modified. It would tells us that 15 easy ETL programs,
20 medium ETL programs and 15 hard ETL programs would have to be modified.
With this information, we can make estimates on the amount of resources, the
timeframe and the total cost.
27
28. Master Data Management
• Master Data Management (MDM) provides for special data quality processes on
a limited number of objects. MDM utilizes metadata to assist in the cleansing of
MDM data. However, it is a separate concept from metadata. All objects cannot
be given MDM treatment. Usually, MDM scope is limited to between 5 and 10
objects. These objects are very critical and need special treatment to ensure a
higher data quality standard. MDM repository contains one true version of each
master entity created from multiple source systems and this “golden copy” is used
by downstream systems.
28
29. Master Data Management
• MDM tools are designed to identify duplicates, handle the variations of key entity
across source systems and standardize the data.
• The following represent three objects to consider for MDM in Healthcare:
• Master Patient Index,
• Units of Measures and
• Enterprise Terminology Services
29
30. Data Profiling
• Data Profiling is a data investigation and quality monitoring process. It allows the
business to access the quality of their data through metrics, to discover or infer
rules based on data and to monitor historical metrics about data quality such as
range of values, frequency, patterns/formats and sparseness.
• Data Profiling is a key enforcement mechanism of data governance. Data profiling
examines the data to validate the data. Often this process leads to the discovery
of new business rules.
30
31. Data Profiling Patterns
• Conversion and Translation Rules Validation - Conversion and translation mean
conforming data to a single value for a data element. For example, the state code
for Colorado might be 23, C or CO in three different source systems. We need to
agree on one value say CO for Colorado. We then would covert or translate the
23 and C to CO. This allows us to run queries in which CO will represent all of the
Colorado data.
• Format Validation - Sometimes data elements need a standard format such as
phone number or Social Security Number (SSN). For example, SSN has the
format 999-99-9999. The SSN utilizes this standard format for intelligence. The
first three numbers identify a specific region of the county where the SSN was
first obtained.
• Range Checking - Range checking is utilized to see if data values fit within a
boundary of values. For example, a birth date may be checked to verify if the
person is no more than 200 years old. Often people leave off the first two digits of
a birth date. For example, a person born in 1959 would say that they were born in
59.
31
32. Data Profiling Patterns
• Sparseness - A sparseness check evaluates the percentage of the data elements
that are actually populated with meaningful data values. Also, we check that the
system is correctly utilizing null values. A null value is a missing value. Some
systems mistakenly insert a zero or blank in place of a null value.
• Uniqueness Checking - In a relational database, every table that is normalized to
at least 1st normal form needs a primary key. A primary key enforces uniqueness.
• Frequency Distribution - A count of the distinct vales in a column. Often, if there is
a value with only 1 occurrence in a table with millions of rows, it may be an
indication that this value should be merged with other values.
• Overloading Columns - A check to see if a column has multiple values in the
same columns. Sometimes, different developers store different values in the
same column. This may be an indication that we need to redesign the column into
two or more columns.
• Best Sources - Checking to determine that a column is being populated by the
best source. Often a column could be populated from multiple sources. However,
one source is usually better than the others. This best source is often referred to
as a Gold Source.
• Derivation - If a data value is being derived, check to make sure the calculation is
correct.
32
33. Data Profiling Patterns
• Data Enrichment - Adding additional information to a data element. Check to see
if the process is correct.
• Aggregation Hierarchies Validation - Validate the hierarchies utilized to rollup or
aggregate data values.
• Matching - Validating that the matching process is correct. Matching data values
is very complex and is covered more in MDM (Master Data Management).
• Dependency between columns.
33
36. Appendix A – Hybrid Architecture
› PSA – Persistent Stage Area
› EDW – Enterprise Data Warehouse
› Data Marts
› Cubes – OLAP
36
37. Hybrid Architecture Diagram
Clinical Clinical Messaging
Star
Dimensional
Data Mart
Dashboards & Reports
Atomic Level
Data
Persistent
3NF EDW cube
Mart
Staging
FINANCIAL Area OLAP Cube
Star
Dimensional
Data Mart
HR
SUPPLY CHAIN
37
38. Persistent Staging Area
› Late Arriving Data.
› Easier and Conformed ETL.
› Support Incremental loads.
› SORP (System of Original Record Proxy).
› Supports Data Lineage.
› Flatter 3rd normal form. Better integration than star dimension
model.
› Data Profiling to develop data governance rules.
38
39. Enterprise Data Warehouse
› 3rd normal form. Better Integration than Star Dimensional model.
› Supports both a Single Source of Truth and Fit for Purpose.
› Supports Versioning.
› Data Profiling to support Data Governance Rules.
39
40. Data Marts and OLAP Cubes
› Data Profiling to support Data Governance Rules.
› Star Dimensional Modeling is better (easier) presentation format
than 3rd normal form.
› Fast response with partitioning and bitmapped indexes.
› Supports Versioning.
40