SlideShare a Scribd company logo
1 of 6
Download to read offline
 
- BI Architecture in Support of Data Quality -
	
  
1	
  
BI Architecture in Support of Data Quality
31-OKT-2012
Tom Breur runs consulting firm XLNT Consulting
(www.xlntconsulting.com) dedicated to helping companies make more
money with their data. His fields of interest span data mining, analytics,
data quality, IT governance, Agile development, data warehousing, and
business models. Tom was one of the very first to pass the Information
Quality Certified Professional exam.
Chapter proposal for “Information Quality and Governance for Business
Intelligence”
Topic: Development of DQ/IQ oriented architectures for BI
Chapter proposal
Business intelligence (BI) projects are risky. According to analyst firms
like Gartner, or the Standish Group reports, for instance, these projects
have petrifying failure rates. In large part, these risks are caused by data
quality problems.
BI teams merely "hold" corporate data (as in: provide Stewardship), they
do not "own" them. However, under organizational pressure, they are
sometimes "forced" into a role of ownership of data, and accompanying
data quality errors, when they attempt to integrate disparate data sources
in the data warehouse.
When BI teams inadvertently become “owners” of corporate data and
their interpretation, this creates a fundamental problem in business
alignment: every time you become problem holder (the person
experiencing pain), without adequate resources to solve the problem at its
root (the problem owner), you risk getting caught in the (corporate) line
of fire. Fundamentally, this problem is caused because the BI team
“needs” to decide how to align data from different data providers, without
access to the resources to resolve the underlying discrepancies.
This problem can either be mitigated, or exacerbated, depending on the
architecture chosen for the data warehouse. An optimal architecture for
data warehousing should allow interpretation of data to be left to their
owners. This is not a responsibility of the BI team. They can facilitate, and
shed light on source system characteristics, but the BI team should never
be tasked to “own” interpretation of data quality problems.
 
- BI Architecture in Support of Data Quality -
	
  
2	
  
Therefore, in this chapter we will suggest using an architecture that
provides auditable and (easily) traceable data transformations. Immediate
data lineage from the output of business intelligence reports back to data
files as provided by source system owners enables a BI team stay
“semantics agnostic” with regards to interpretation as required for data
integration in the data warehouse. This also allows the team to focus
exclusively on their stewardship role, and design tightly engineered data
flows.
In the 90’s an “ideological “war” raged between Bill Inmon and Ralph
Kimball. It is safe to conclude that Kimball emerged as the “winner” of this
debate. Nowadays, the majority of data warehouse architectures are
based on Kimball’s proposed “Bus architecture”, a two-tiered data
warehouse architecture.
In this chapter, we will recommend a revival of three-tiered data
warehouse architectures, also called hub-and-spoke topologies. But
contrary to Bill Inmon’s (initial) suggestion, we recommend against third
normal form as the modeling paradigm for the central hub. Instead, we
suggest using a hyper normalized data-modeling paradigm that was
specifically designed for data warehousing for the central hub.
A thriving community of BI experts have emerged that embrace an
approach to data warehousing that is often based on some version of the
Data Vault modeling paradigm as first described and advocated by Dan
Linstedt. However, the approach we recommend is in no way tied, nor
limited to Data Vault modeling. Instead, we suggest using “a” hyper
normalized paradigm that has been specifically designed for data
warehousing. You need a modeling paradigm that is resilient to change
(as we know BI requirements are often ambiguous), and that safeguards
against a loss of historical information when changes to the data model
need to be made over time.
One of the alternative approaches to Data Vault modeling that has come
available for such a three-tiered architecture for data warehousing is
Anchor Modeling, as described by Lars Rönnbäck. One can make a case
that on theoretical grounds the Anchor Modeling approach is conceptually
superior. Other approaches to hyper normalization are likely to emerge in
the future in this young and burgeoning field.
There are a few reasons why hyper normalized approaches to data
modeling for the central hub in three-tiered data warehouse architecture
have proven superior. First of all, in Agile fashion, BI developers want to
embrace change. Experience has now shown that a special purpose hyper
normalized data modeling paradigm can deal much better with changing
 
- BI Architecture in Support of Data Quality -
	
  
3	
  
requirements because core concepts (business keys) are stored separately
from their (historical) descriptive attributes, and interconnected
relationships.
End-users don’t always have their requirements clear at the outset of a
project. Agile methods suggest making it as easy (and inexpensive) as
possible for end-users to change their requirements. True needs often
only emerge in the course of using BI solutions, which makes it
fundamentally impossible to provide stable requirements upfront. The
architecture should support this goal, and should not make it prohibitively
expensive for end-users to change their mind. This drive has led to the
development of hyper normalized data models specifically geared to data
warehousing.
We will demonstrate (using artificially simplified modeling challenges to
demonstrate the principle) why third normal form modeling in a data
warehouse, does not handle change very well. This has to do with the way
Inmon (initially) suggested the time dimension should be modeled into the
data warehouse, by appending a time stamp to the primary key. This
causes serious maintenance problems, as practitioners who tried to
implement this approach have discovered (the hard way...).
Also because three different types of entities, notably business keys,
descriptive attributes (and history), and relations are stored together in
one table, changes therefore ripple through the model. This makes it
difficult to and expensive to contain the risks of (unforeseen) changes in
requirements.
Third normal form modeling is also undesirable for another reason,
because it makes it hard to determine and scope dats model boundaries.
It doesn’t support incremental development very well, which is a
requirement Agile developers have come to appreciate very much. They
want to be able to start very small, and efficiently grow in tiny
increments, without adding significant overhead because of these “baby
steps.”
In the chapter, we will also explain why Dr. Kimball’s approach to data
warehousing (a two-tiered topology, so-called Bus-architecture) has
important drawbacks, and is intrinsically non-agile. The primary reason for
this is that the smallest increment of delivery is a star, which is considered
too big for frequent and early delivery of value to end-users, as specified
by Agile principles. On top of that, the Bus architecture forces developers
to commit too early which attributes to include, and how to conform
dimensions.
 
- BI Architecture in Support of Data Quality -
	
  
4	
  
All of these implications of the architecture for data warehousing that Dr
Kimball has proposed, make development teams less Agile because it
defies the Agile principle of delaying decisions, to the last responsible
moment. Also, changes to the bus architecture data model tend to be
cumbersome and expensive. They require keeping a full history of all
loaded tables in a persistent staging area, and careful engineering to
avoid additional cost when redoing the initial load in the case of a data
model change.
One of the engineering principles that was most eloquently described in
the Lean Software Programming methods, is delaying decisions until the
last responsible moment. This is used in conjunction with concurrent
development. Decisions that haven’t been made, never need to be
reverted, so carry zero cost for reengineering.
Data Quality
A more fundamental reason why a three-tiered architecture better
supports data quality objectives, has to do with the need to support
company wide decision-making. In order to make a “single version of the
truth” possible, BI teams need to conform dimensions (regardless of
whether they use Kimball or Inmon’s approach). This is where many
purported “data quality” problems arise.
Data quality problems stem largely from one of two sources: either
inaccurate data entry, or misaligned conforming of dimensions.
Conforming dimensions is largely a political corporate discussion. It hinges
on two (or worse: even more) departments agreeing on a “view” on the
facts as have been surfaced by the BI team.
Note that not conforming doesn’t provide a solution, either. In that case,
two seemingly similar constructs, e.g.: “customer” as defined by
marketing, versus “customers” as acknowledged by Finance will each have
their own dimension. However, because of (slight) differences in
definition, slight differences in counts will result when yu query the same
data using two different “Customer” dimensions.
This challenge surfaces a fundamental flaw in the Bus architecture: the BI
team needs to facilitate a discussion between two or more stakeholders of
the data, on how exactly to conform dimensions, i.e., how to define the
meaningful entities as described in dimensions of an enterprise wide data
warehouse. The BI team should facilitate this discussion, not take
ownership of the eventual decision.
But in the early data warehouse project stages, none of the business
users have access to data form the data warehouse. The BI team hasn’t
 
- BI Architecture in Support of Data Quality -
	
  
5	
  
built their solution, yet. Therefore, end-users cannot (easily) run reports
using one or the other definition to see what impact they will have on the
numbers as they appear in corporate reporting. So quite “naturally”, the
BI team either fails to facilitate an informed discussion between
stakeholders throughout the business, or “has” to take ownership
themselves for how they conform dimensions. Both outcomes are highly
undesirable, and largely the result of a two-tiered data warehouse
architecture.
This ownership for how to conform dimensions, gets the BI team caught
smack in the middle of corporate cross-fire: no matter how you define an
entity in a conformed dimension, “the numbers” (relative to management
performance objectives) will always come out better for one
manager/department than for another. Hopefully you were either very
smart or lucky, and happened to choose a definition that favors the more
influential stakeholder (or at least that has less power to retaliate).
What effectively happens in most projects using the “traditional” Kimball
(Bus) architecture, is that the BI tram commits to a certain view on how
to conform dimensions, but this commitment is made too early. Changing
that solution also becomes prohibitively expensive as time goes by.
Therefore, the BI team finds itself “locked into” a design, and looses
flexibility. This can be even more problematic when (some) data quality
issues don’t surface until much later.
Conforming dimensions is largely a political corporate discussion. These
discussions become “risky” when (individual) performance objectives are
at stake. He who finds himself with a less favorable number, will quite
naturally resist this view on how to conform the dimension.
We, as BI professionals, don’t own the data. We merely hold them for the
business. As stewards it is our responsibility to surface discrepancies in
definitions, and quantify where and how the business is loosing money as
a result of this confusion.
Data quality problems are rarely truly “technical” in nature. Mostly, data
quality problems are the result of poor (internal) alignment of corporate
objectives. The classic, hackneyed example is rewarding people for speed
of data entry, rather than for its quality. The back office manager
(problem owner) is expected to process as much work as possible, with
the least amount of resources. Downstream information users (problem
holders) bear the brunt of hasty work, and failure to install (“time
consuming”) inspect-and-adapt cycles.
But the grand perspective on this dynamic goes much further. The
majority of data quality issues point to cracks in the corporate value
 
- BI Architecture in Support of Data Quality -
	
  
6	
  
chain. In the chapter I will argue that information loss (either “errors” or
inappropriate reports) is fundamentally caused by a failure to conform
dimensions (politically: align corporate objectives). In that way, BI team
members take on the burden of misalignment between cross-
departmental objectives. Risky business, indeed.
What are we to do about this?
If you store all the (source) facts in your data warehouse, keep them
available, traceable and audit able, there is a way out of this conundrum.
First of all, this requires you move away from 20th century data
warehouse architectures that have required us to conform dimensions
early in the project. This puts us at a disadvantage, because requirements
are ambiguous, and amenable to change. If the cost of change then
becomes prohibitive, we would need to take irresponsible risks. Instead,
we’d like to mitigate those risks.
In Agile software development practices we like the idea of "deciding at
the last responsible moment", in conjunction with concurrent development
practices. That way, we can show end-users bits of functionality (very)
early on. This gives us credit with our sponsors, and business stakeholders
will gain confidence in our progress. But at least as important: the sooner
reliable feedback about data accuracy is passed back to the development
team, the more efficient they will be able to work.
Imagine you could run your BI projects like this. You inform all
departments involved what the consequences are (counts, frequencies,
amounts, etc.) of choosing a particular definition to conform data across
systems and departments. That way, you empower all stakeholders to
have an informed discussion about what "view" on the facts would best
support their particular (end-to-end) value stream as a holistic business.
All business stakeholders have good reasons to cling to their departmental
specific view. When you apply this approach to managing BI project risk,
stakeholders should be well-placed to explain why the numbers for
"corporate" don't exactly line up with the numbers for the "department."
No problem, there may be excellent, business process specific reasons for
that.
Conversely, you can imagine when you attempt to conform dimensions
before "the business" has agreed on common dimensions. Potentially, you
might bear the brunt of their frustration when they are confronted with
erroneous reports! Risky business, indeed.

More Related Content

What's hot

Business Intelligence
Business IntelligenceBusiness Intelligence
Business IntelligenceAmulya Lohani
 
Business Intelligence Introduction
Business Intelligence IntroductionBusiness Intelligence Introduction
Business Intelligence IntroductionAmr Ali
 
Business Intelligence 3.0 Revolution
Business Intelligence 3.0 RevolutionBusiness Intelligence 3.0 Revolution
Business Intelligence 3.0 Revolutionwww.panorama.com
 
Realizing a multitenant big data infrastructure 3
Realizing a multitenant big data infrastructure 3Realizing a multitenant big data infrastructure 3
Realizing a multitenant big data infrastructure 3Steven Sit
 
All you need to know about banking by IBM
All you need to know about banking by IBMAll you need to know about banking by IBM
All you need to know about banking by IBMSofia Cherradi
 
Advanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data AnalyticsAdvanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data AnalyticsArvind Sathi
 
Business Intelligence Module 3
Business Intelligence Module 3Business Intelligence Module 3
Business Intelligence Module 3Home
 
Business Intelligence Module 2
Business Intelligence Module 2Business Intelligence Module 2
Business Intelligence Module 2Home
 
Agile BI via Data Vault and Modelstorming
Agile BI via Data Vault and ModelstormingAgile BI via Data Vault and Modelstorming
Agile BI via Data Vault and ModelstormingDaniel Upton
 
Business intelligence and big data
Business intelligence and big dataBusiness intelligence and big data
Business intelligence and big dataShäîl Rûlès
 
Implementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformImplementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformArvind Sathi
 
Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...
Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...
Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...Cezar Cursaru
 
Industry Analyst Perspective - what does the next generation of MDM look like...
Industry Analyst Perspective - what does the next generation of MDM look like...Industry Analyst Perspective - what does the next generation of MDM look like...
Industry Analyst Perspective - what does the next generation of MDM look like...Aaron Zornes
 
Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...
Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...
Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...Optimus BT
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemKiran kumar
 

What's hot (20)

Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Business Intelligence Introduction
Business Intelligence IntroductionBusiness Intelligence Introduction
Business Intelligence Introduction
 
Business Intelligence 3.0 Revolution
Business Intelligence 3.0 RevolutionBusiness Intelligence 3.0 Revolution
Business Intelligence 3.0 Revolution
 
Realizing a multitenant big data infrastructure 3
Realizing a multitenant big data infrastructure 3Realizing a multitenant big data infrastructure 3
Realizing a multitenant big data infrastructure 3
 
Business Intelligence - Conceptual Introduction
Business Intelligence - Conceptual IntroductionBusiness Intelligence - Conceptual Introduction
Business Intelligence - Conceptual Introduction
 
All you need to know about banking by IBM
All you need to know about banking by IBMAll you need to know about banking by IBM
All you need to know about banking by IBM
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Business process based analytics
Business process based analyticsBusiness process based analytics
Business process based analytics
 
Advanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data AnalyticsAdvanced Analytics Platform for Big Data Analytics
Advanced Analytics Platform for Big Data Analytics
 
Business Intelligence Module 3
Business Intelligence Module 3Business Intelligence Module 3
Business Intelligence Module 3
 
Business Intelligence Module 2
Business Intelligence Module 2Business Intelligence Module 2
Business Intelligence Module 2
 
Agile BI via Data Vault and Modelstorming
Agile BI via Data Vault and ModelstormingAgile BI via Data Vault and Modelstorming
Agile BI via Data Vault and Modelstorming
 
BI Presentation
BI PresentationBI Presentation
BI Presentation
 
Business intelligence and big data
Business intelligence and big dataBusiness intelligence and big data
Business intelligence and big data
 
Implementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformImplementing Advanced Analytics Platform
Implementing Advanced Analytics Platform
 
Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...
Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...
Datamonitor Decision Matrix Selecting A Business Intelligence Vendor (Competi...
 
National Bank MDM Initiative
National Bank MDM InitiativeNational Bank MDM Initiative
National Bank MDM Initiative
 
Industry Analyst Perspective - what does the next generation of MDM look like...
Industry Analyst Perspective - what does the next generation of MDM look like...Industry Analyst Perspective - what does the next generation of MDM look like...
Industry Analyst Perspective - what does the next generation of MDM look like...
 
Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...
Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...
Business Intelligence, Portals, Dashboards and Operational Matrix with ShareP...
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
 

Similar to Bi ar

BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data qualityTom Breur
 
Search-based BI. Getting ready for the next wave of innovation in Business In...
Search-based BI. Getting ready for the next wave of innovation in Business In...Search-based BI. Getting ready for the next wave of innovation in Business In...
Search-based BI. Getting ready for the next wave of innovation in Business In...grauw
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringCognizant
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
The Case for Business Modeling
The Case for Business ModelingThe Case for Business Modeling
The Case for Business ModelingNeil Raden
 
Business Intelligence Solution on Windows Azure
Business Intelligence Solution on Windows AzureBusiness Intelligence Solution on Windows Azure
Business Intelligence Solution on Windows AzureInfosys
 
Adapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesAdapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesbboyina
 
BR Allen Associates compares Information Infrastructure providers
BR Allen Associates compares Information Infrastructure providersBR Allen Associates compares Information Infrastructure providers
BR Allen Associates compares Information Infrastructure providersIBM India Smarter Computing
 
Multi dimensional modeling
Multi dimensional modelingMulti dimensional modeling
Multi dimensional modelingnoviari sugianto
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paperjuly12jana
 
Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...
Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...
Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...Accelerite
 
Ch33 - Dim Modelling
Ch33 - Dim ModellingCh33 - Dim Modelling
Ch33 - Dim ModellingRavi S
 
Clabby Analytics case study on Business Analytics Cloud on System z
Clabby Analytics case study on Business Analytics Cloud on System zClabby Analytics case study on Business Analytics Cloud on System z
Clabby Analytics case study on Business Analytics Cloud on System zIBM India Smarter Computing
 
Data Architecture Process in a BI environment
Data Architecture Process in a BI environmentData Architecture Process in a BI environment
Data Architecture Process in a BI environmentSasha Citino
 

Similar to Bi ar (20)

BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data quality
 
Search-based BI. Getting ready for the next wave of innovation in Business In...
Search-based BI. Getting ready for the next wave of innovation in Business In...Search-based BI. Getting ready for the next wave of innovation in Business In...
Search-based BI. Getting ready for the next wave of innovation in Business In...
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Dw bi
Dw biDw bi
Dw bi
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
The Case for Business Modeling
The Case for Business ModelingThe Case for Business Modeling
The Case for Business Modeling
 
Business Intelligence Solution on Windows Azure
Business Intelligence Solution on Windows AzureBusiness Intelligence Solution on Windows Azure
Business Intelligence Solution on Windows Azure
 
Data warehouse presentation
Data warehouse presentationData warehouse presentation
Data warehouse presentation
 
Adapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologiesAdapting data warehouse architecture to benefit from agile methodologies
Adapting data warehouse architecture to benefit from agile methodologies
 
BR Allen Associates compares Information Infrastructure providers
BR Allen Associates compares Information Infrastructure providersBR Allen Associates compares Information Infrastructure providers
BR Allen Associates compares Information Infrastructure providers
 
Multi dimensional modeling
Multi dimensional modelingMulti dimensional modeling
Multi dimensional modeling
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Offers bank dss
Offers bank dssOffers bank dss
Offers bank dss
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
 
Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...
Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...
Shareinsights an-end-to-end-implementation-of-the-modern-analytics-archi...
 
Ch33 - Dim Modelling
Ch33 - Dim ModellingCh33 - Dim Modelling
Ch33 - Dim Modelling
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
Clabby Analytics case study on Business Analytics Cloud on System z
Clabby Analytics case study on Business Analytics Cloud on System zClabby Analytics case study on Business Analytics Cloud on System z
Clabby Analytics case study on Business Analytics Cloud on System z
 
Data Architecture Process in a BI environment
Data Architecture Process in a BI environmentData Architecture Process in a BI environment
Data Architecture Process in a BI environment
 

Bi ar

  • 1.   - BI Architecture in Support of Data Quality -   1   BI Architecture in Support of Data Quality 31-OKT-2012 Tom Breur runs consulting firm XLNT Consulting (www.xlntconsulting.com) dedicated to helping companies make more money with their data. His fields of interest span data mining, analytics, data quality, IT governance, Agile development, data warehousing, and business models. Tom was one of the very first to pass the Information Quality Certified Professional exam. Chapter proposal for “Information Quality and Governance for Business Intelligence” Topic: Development of DQ/IQ oriented architectures for BI Chapter proposal Business intelligence (BI) projects are risky. According to analyst firms like Gartner, or the Standish Group reports, for instance, these projects have petrifying failure rates. In large part, these risks are caused by data quality problems. BI teams merely "hold" corporate data (as in: provide Stewardship), they do not "own" them. However, under organizational pressure, they are sometimes "forced" into a role of ownership of data, and accompanying data quality errors, when they attempt to integrate disparate data sources in the data warehouse. When BI teams inadvertently become “owners” of corporate data and their interpretation, this creates a fundamental problem in business alignment: every time you become problem holder (the person experiencing pain), without adequate resources to solve the problem at its root (the problem owner), you risk getting caught in the (corporate) line of fire. Fundamentally, this problem is caused because the BI team “needs” to decide how to align data from different data providers, without access to the resources to resolve the underlying discrepancies. This problem can either be mitigated, or exacerbated, depending on the architecture chosen for the data warehouse. An optimal architecture for data warehousing should allow interpretation of data to be left to their owners. This is not a responsibility of the BI team. They can facilitate, and shed light on source system characteristics, but the BI team should never be tasked to “own” interpretation of data quality problems.
  • 2.   - BI Architecture in Support of Data Quality -   2   Therefore, in this chapter we will suggest using an architecture that provides auditable and (easily) traceable data transformations. Immediate data lineage from the output of business intelligence reports back to data files as provided by source system owners enables a BI team stay “semantics agnostic” with regards to interpretation as required for data integration in the data warehouse. This also allows the team to focus exclusively on their stewardship role, and design tightly engineered data flows. In the 90’s an “ideological “war” raged between Bill Inmon and Ralph Kimball. It is safe to conclude that Kimball emerged as the “winner” of this debate. Nowadays, the majority of data warehouse architectures are based on Kimball’s proposed “Bus architecture”, a two-tiered data warehouse architecture. In this chapter, we will recommend a revival of three-tiered data warehouse architectures, also called hub-and-spoke topologies. But contrary to Bill Inmon’s (initial) suggestion, we recommend against third normal form as the modeling paradigm for the central hub. Instead, we suggest using a hyper normalized data-modeling paradigm that was specifically designed for data warehousing for the central hub. A thriving community of BI experts have emerged that embrace an approach to data warehousing that is often based on some version of the Data Vault modeling paradigm as first described and advocated by Dan Linstedt. However, the approach we recommend is in no way tied, nor limited to Data Vault modeling. Instead, we suggest using “a” hyper normalized paradigm that has been specifically designed for data warehousing. You need a modeling paradigm that is resilient to change (as we know BI requirements are often ambiguous), and that safeguards against a loss of historical information when changes to the data model need to be made over time. One of the alternative approaches to Data Vault modeling that has come available for such a three-tiered architecture for data warehousing is Anchor Modeling, as described by Lars Rönnbäck. One can make a case that on theoretical grounds the Anchor Modeling approach is conceptually superior. Other approaches to hyper normalization are likely to emerge in the future in this young and burgeoning field. There are a few reasons why hyper normalized approaches to data modeling for the central hub in three-tiered data warehouse architecture have proven superior. First of all, in Agile fashion, BI developers want to embrace change. Experience has now shown that a special purpose hyper normalized data modeling paradigm can deal much better with changing
  • 3.   - BI Architecture in Support of Data Quality -   3   requirements because core concepts (business keys) are stored separately from their (historical) descriptive attributes, and interconnected relationships. End-users don’t always have their requirements clear at the outset of a project. Agile methods suggest making it as easy (and inexpensive) as possible for end-users to change their requirements. True needs often only emerge in the course of using BI solutions, which makes it fundamentally impossible to provide stable requirements upfront. The architecture should support this goal, and should not make it prohibitively expensive for end-users to change their mind. This drive has led to the development of hyper normalized data models specifically geared to data warehousing. We will demonstrate (using artificially simplified modeling challenges to demonstrate the principle) why third normal form modeling in a data warehouse, does not handle change very well. This has to do with the way Inmon (initially) suggested the time dimension should be modeled into the data warehouse, by appending a time stamp to the primary key. This causes serious maintenance problems, as practitioners who tried to implement this approach have discovered (the hard way...). Also because three different types of entities, notably business keys, descriptive attributes (and history), and relations are stored together in one table, changes therefore ripple through the model. This makes it difficult to and expensive to contain the risks of (unforeseen) changes in requirements. Third normal form modeling is also undesirable for another reason, because it makes it hard to determine and scope dats model boundaries. It doesn’t support incremental development very well, which is a requirement Agile developers have come to appreciate very much. They want to be able to start very small, and efficiently grow in tiny increments, without adding significant overhead because of these “baby steps.” In the chapter, we will also explain why Dr. Kimball’s approach to data warehousing (a two-tiered topology, so-called Bus-architecture) has important drawbacks, and is intrinsically non-agile. The primary reason for this is that the smallest increment of delivery is a star, which is considered too big for frequent and early delivery of value to end-users, as specified by Agile principles. On top of that, the Bus architecture forces developers to commit too early which attributes to include, and how to conform dimensions.
  • 4.   - BI Architecture in Support of Data Quality -   4   All of these implications of the architecture for data warehousing that Dr Kimball has proposed, make development teams less Agile because it defies the Agile principle of delaying decisions, to the last responsible moment. Also, changes to the bus architecture data model tend to be cumbersome and expensive. They require keeping a full history of all loaded tables in a persistent staging area, and careful engineering to avoid additional cost when redoing the initial load in the case of a data model change. One of the engineering principles that was most eloquently described in the Lean Software Programming methods, is delaying decisions until the last responsible moment. This is used in conjunction with concurrent development. Decisions that haven’t been made, never need to be reverted, so carry zero cost for reengineering. Data Quality A more fundamental reason why a three-tiered architecture better supports data quality objectives, has to do with the need to support company wide decision-making. In order to make a “single version of the truth” possible, BI teams need to conform dimensions (regardless of whether they use Kimball or Inmon’s approach). This is where many purported “data quality” problems arise. Data quality problems stem largely from one of two sources: either inaccurate data entry, or misaligned conforming of dimensions. Conforming dimensions is largely a political corporate discussion. It hinges on two (or worse: even more) departments agreeing on a “view” on the facts as have been surfaced by the BI team. Note that not conforming doesn’t provide a solution, either. In that case, two seemingly similar constructs, e.g.: “customer” as defined by marketing, versus “customers” as acknowledged by Finance will each have their own dimension. However, because of (slight) differences in definition, slight differences in counts will result when yu query the same data using two different “Customer” dimensions. This challenge surfaces a fundamental flaw in the Bus architecture: the BI team needs to facilitate a discussion between two or more stakeholders of the data, on how exactly to conform dimensions, i.e., how to define the meaningful entities as described in dimensions of an enterprise wide data warehouse. The BI team should facilitate this discussion, not take ownership of the eventual decision. But in the early data warehouse project stages, none of the business users have access to data form the data warehouse. The BI team hasn’t
  • 5.   - BI Architecture in Support of Data Quality -   5   built their solution, yet. Therefore, end-users cannot (easily) run reports using one or the other definition to see what impact they will have on the numbers as they appear in corporate reporting. So quite “naturally”, the BI team either fails to facilitate an informed discussion between stakeholders throughout the business, or “has” to take ownership themselves for how they conform dimensions. Both outcomes are highly undesirable, and largely the result of a two-tiered data warehouse architecture. This ownership for how to conform dimensions, gets the BI team caught smack in the middle of corporate cross-fire: no matter how you define an entity in a conformed dimension, “the numbers” (relative to management performance objectives) will always come out better for one manager/department than for another. Hopefully you were either very smart or lucky, and happened to choose a definition that favors the more influential stakeholder (or at least that has less power to retaliate). What effectively happens in most projects using the “traditional” Kimball (Bus) architecture, is that the BI tram commits to a certain view on how to conform dimensions, but this commitment is made too early. Changing that solution also becomes prohibitively expensive as time goes by. Therefore, the BI team finds itself “locked into” a design, and looses flexibility. This can be even more problematic when (some) data quality issues don’t surface until much later. Conforming dimensions is largely a political corporate discussion. These discussions become “risky” when (individual) performance objectives are at stake. He who finds himself with a less favorable number, will quite naturally resist this view on how to conform the dimension. We, as BI professionals, don’t own the data. We merely hold them for the business. As stewards it is our responsibility to surface discrepancies in definitions, and quantify where and how the business is loosing money as a result of this confusion. Data quality problems are rarely truly “technical” in nature. Mostly, data quality problems are the result of poor (internal) alignment of corporate objectives. The classic, hackneyed example is rewarding people for speed of data entry, rather than for its quality. The back office manager (problem owner) is expected to process as much work as possible, with the least amount of resources. Downstream information users (problem holders) bear the brunt of hasty work, and failure to install (“time consuming”) inspect-and-adapt cycles. But the grand perspective on this dynamic goes much further. The majority of data quality issues point to cracks in the corporate value
  • 6.   - BI Architecture in Support of Data Quality -   6   chain. In the chapter I will argue that information loss (either “errors” or inappropriate reports) is fundamentally caused by a failure to conform dimensions (politically: align corporate objectives). In that way, BI team members take on the burden of misalignment between cross- departmental objectives. Risky business, indeed. What are we to do about this? If you store all the (source) facts in your data warehouse, keep them available, traceable and audit able, there is a way out of this conundrum. First of all, this requires you move away from 20th century data warehouse architectures that have required us to conform dimensions early in the project. This puts us at a disadvantage, because requirements are ambiguous, and amenable to change. If the cost of change then becomes prohibitive, we would need to take irresponsible risks. Instead, we’d like to mitigate those risks. In Agile software development practices we like the idea of "deciding at the last responsible moment", in conjunction with concurrent development practices. That way, we can show end-users bits of functionality (very) early on. This gives us credit with our sponsors, and business stakeholders will gain confidence in our progress. But at least as important: the sooner reliable feedback about data accuracy is passed back to the development team, the more efficient they will be able to work. Imagine you could run your BI projects like this. You inform all departments involved what the consequences are (counts, frequencies, amounts, etc.) of choosing a particular definition to conform data across systems and departments. That way, you empower all stakeholders to have an informed discussion about what "view" on the facts would best support their particular (end-to-end) value stream as a holistic business. All business stakeholders have good reasons to cling to their departmental specific view. When you apply this approach to managing BI project risk, stakeholders should be well-placed to explain why the numbers for "corporate" don't exactly line up with the numbers for the "department." No problem, there may be excellent, business process specific reasons for that. Conversely, you can imagine when you attempt to conform dimensions before "the business" has agreed on common dimensions. Potentially, you might bear the brunt of their frustration when they are confronted with erroneous reports! Risky business, indeed.