DW Part 2 The Twins: Data Quality & Business Intelligence Denise Jeffries [email_address] [email_address] 205.747.3301
Star Schema (facts and dimensions)
The facts that the data warehouse helps analyze are classified along different dimensions:
The FACT table houses the main data
Includes a large amount of aggregated data (i.e. price, units sold)
DIMENSION tables off the FACT include attributes that describe the FACT
Star schemas provide simplicity for users
Star Schema example (Sales db)
Connected to multiple DIMENSIONS which are NORMALIZED into related tables
Snowflaking effects DIMS and never FACT
Used in Data warehouses and data marts when speed is more important than efficiency/ease of data selection
Needed for many BI OLAP tools
Stores less data
Snowflake Schema example (Sales db)
Comparison of SQL Star vs SnowFlake
SELECT Brand, Country, SUM (Units Sold)
ON Date_FK = Date_PK
ON Store_FK = Store_PK
ON Product_FK = Product_PK
WHERE [Year] = 2010
AND Product Category = ‘TV' GROUP BY Brand, Country
FROM Fact_Sales F (NOLOCK)
INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id
INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id
INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id
INNER JOIN Dim_Product_Category C (NOLOCK) ON P.Product_Category_Id = C.ID
INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE D.Year = 2010
AND C.Product_Category = 'tv'
Account, Customer & Address Relationships Account Contact Party Address link Account Party link Address Account Party Account Information loaded from ALL Source Systems ETL process builds the relationship between Accounts and Customers (Party) based on the relationship file from CUSTOMER CRM SYSTEM
EDW Process State Staging Area EDW Metadata | Data Governance | Data Management DM CPS MANTAS CRDB MKTG FIN SALES EDW Data cleansing Data profiling Sync & Sort BI Source System Cleanse / Pre-process IMP RM OEC ALS AFS ST RE DFP SBA AFS V-PR
Explosion in innovation
BI software now able to be deployed on intranet vs hard to maintain thick client apps
Thick client still used for developers
Web server, application server, database server
Allows offloading of processing to correct tier
More power for everyone
Change in Business
Global economy changed needs of organizations worldwide
Mergers and Acquisitions
All increase data needs
More tech savvy end users (demand more data, more tools…
More information demanding executives facilitates sponsorship of DW
Single definition of a data element needed for BI
DW brings in the data from multiple sources and conforms it so that it can be viewed together
Multiple systems have individual customers/addresses, but warehouse gives single view of the customer and all the systems they are in
Helping move from product centric systems to customer centric systems
Business view of data
DW is only successful is it provides the view the business needs of its data
A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business terms, and summarized for quick analysis.
Vivek R. Gupta, Senior Consultant [email_address] System Services corporation, Chicago, Illinois http://www.system-services.com
Example of conforming data for business view: http://www.sserve.com/ftp/dwintro.doc
Business use of DW
Business should use data mart created off data warehouse
Business uses want to use existing tools/methods (replicate queires, Excel, extract to Access) against DW and validate the data between existing and DW
Over time LoB gains confidence in DW and then begins to explore new possibilities of data use and tool use
EDW Development Project Cycle (New Source to EDW)
DW - Roadmap Management Architecture (Metadata, Data Security, Systems Management)
What is Data Quality
I can’t tell you what’s important, but your users can.
Look for the fields that can identify potential problems with the data
What is Master Data Management (MDM)
Data doesn’t stay the same
Sometimes it does
What happens to the warehouse when the data changes
When needs change
Roadmap to DQ
Design and implement the rules
Deploy the plan
Monitor the results
What’s in the data
Analyze the columns in the tables
Allows for good specifications for programmers
Reduces project risk (as data is now known)
How many rows, number of distinct values in a column, how many null, data type identification
Shows the data pattern
Data Profiling Example
Data Quality is measured as the degree of superiority, or excellence, of the various data that we use to create information products.
“ Reason #1 for the failure of CRM projects : Data is ignored. Enterprise must have a detailed understanding of the quality of their data. How to clean it up, how to keep it clean, where to source it, and what 3 rd -party data is required. Action item: Have a data quality strategy. Devote ½ of the total timeline of the CRM project to data elements.” - Gartner
Data Quality Tools (Gartner Magic Quadrant)
Dimensions of Quality Informatica.com
Data Quality Measures
Conformance: The degree to which data values are consistent with their agreed upon definitions.
A detailed definition must first exist before this can be measured.
Information quality begins with a comprehensive understanding of the data inventory. The information about the data is as important as the data itself.
A Data Dictionary must exist! An organized, authoritive collection of attributes is equivalent to the old “Card Catalog” in a library, or the “Parts and List Description” section of an inventory system. It must contain all the know usage rules and an acceptable list of values. All known caveats and anomalies must be descried.
The degree to which a piece of data is correct and believable. The value can be compared to the original source for correctness, but it can still be unbelievable. Conformed values can be compared to lists of reference values.
Zip code 35244 is correct and believable.
Zip code 3524B is incorrect and unbelievable.
Zip code 35290 is incorrect but believable (it looks right, but does not exist).
AL is a correct and believable state code (compared to the list of valid state codes)
A1 is an incorrect and unbelievable state code (compared to the list of valid state codes)
AA is an incorrect but believable state code (compared to the list of valid state codes)
The Degree to which all information expected is received. This is measured in two ways:
Do we have all the records that were sent to us?
Counts from the provider can be compared against counts of data received.
Did the provider send us all the records that they have or just some of them?
This is difficult to measure without auditing and trending the source.
How would we know that the provider had a ‘glitch’ in their system and records were missing from our feed?
Measures of Completeness
The following questions can be answered for counts:
How many records per batch by provider?
How is this batch’s counts compared to the previous month’s average.
How is the batch’s counts compared to the same time period last year?
How does this batch’s counts compare to a 12 month average?
The degree to which all fields are populated with data. Columns of data can be measured for % of missing values and compared to expected % missing.
i.e. Sale Type Code is expected to be populated 100% by all sources for Sales documents.
The degree to which provider files are received, processed and made available to for assembly to data marts. Expected receipt times are compared to actual receipt times.
Late or missing files are flagged and reported on.
Proactive alerts trigger communication with the provider contact.
Proactive communication can alert to assembly processes.
Excessive lag times can be reported to providers in order to request delivery sooner.
The degree to which the relationships between different data are valid.
Zip code 48108 is accurate. State code AL is accurate. Zip code 48108 is invalid for the state of AL.
Data Quality Measures
How do you know if your data is of high quality?
Agree upon the measure that are important to the organization and consistently report them out.
Use the data measures to communicate and inform.
Exercise: Changing the Data (1 of 2)
So, you need to add a new source
Or, you need to receive additional data from an existing source
Could be the data quality is an issue
Could be that the business rules weren’t defined adequately
Brainstorming Group Exercise (2 of 2)
The data changed due to DQ measures – what do we have to do in the DW?
What has to change
Estimate the change
Implement the change
How do we make sure it doesn’t happen again?
What DQ measure can help?
MDM Master Data Management
The newest ‘buzz word’
The recent emphasis on regulatory compliance, SOA, and mergers and acquisitions has made the creating and maintaining of accurate and complete master data a business imperative.
The pain that organizations are experiencing around consistent reporting, regulatory compliance, strong interest in Service-Oriented Architecture (SOA), and Software as a Service (SaaS) has prompted a great deal of interest in Master Data Management (MDM).
What Is Master Data Management?
Master data is the technology, tools, and processes an organization needs to create and maintain consistent and accurate inventory of its data.
5 Types of Data for MDM
Unstructured —This is data found in e-mail, white papers like this, magazine articles, corporate intranet portals, product specifications, marketing collateral, and PDF files.
Transactional —This is data related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions.
Metadata —This is data about other data and may reside in a formal repository or in various other forms such as XML documents, report definitions, column descriptions in a database, log files, connections, and configuration files.
Hierarchical —Hierarchical data stores the relationships between other data. It may be stored as part of an accounting system or separately as descriptions of real-world relationships, such as company organizational structures or product lines. Hierarchical data is sometimes considered a super MDM domain, because it is critical to understanding and sometimes discovering the relationships between master data.
5 types of data cont’d
Master —the critical nouns of a business and fall generally into four groupings:
Further categorizations within those groupings are called subject areas, domain areas, or entity types.
within people, there are customer, employee, and salesperson.
Within things, there are product, part, store, and asset.
Within concepts, there are things like contract, warrantee, and licenses.
Within places, there are office locations and geographic divisions.
Some of these domain areas may be further divided. Customer may be further segmented, based on incentives and history. A company may have normal customers, as well as premiere and executive customers. Product may be further segmented by sector and industry. (4)
What processes need to be put in place for MDM
Who needs to be involved
Who owns it
What is business intelligence
What are BI tools
What is a business intelligence competency center (BICC)
What jobs are available
What is business intelligence
Turning raw data into information.
Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include the activities of decision support systems , query and reporting, online analytical processing ( OLAP ), statistical analysis, forecasting, and data mining . (1)
What is BI
It is about making better business decisions easier and quicker.
Data Mining is a BI technique which is done to extract valid, useful and previously unknown information from a companies data sources.
BI solutions examples by industry
Ordering & supply
Space management …..
Claims & premium analysis
BI term coined Sept 1996 by Gartner Group in a report
“ By 2000, Information Democracy will emerge in forward-thinking enterprises, with Business Intelligence information and applications available broadly to employees, consultants, customers, suppliers, and the public. The key to thriving in a competitive marketplace is staying ahead of the competition. Making sound business decisions based on accurate and current information takes more than intuition. Data analysis, reporting, and query tools can help business users wade through a sea of data to synthesize valuable information from it - today these tools collectively fall into a category called Business Intelligence.” (1)
Magic Quadrant for BI (Gartner)
BI is a term categorizing a variety of software applications that are used to analyze a business’ raw data.
It is also a discipline categorizing activities that include data quality, data mining, OLAP (online analytical processing), querying and reporting. (2)
What kinds of companies use BI
All kinds, restaurants, sports franchises, retailers….any company.
Examples include: New England Patriots, Walmart, Harrah’s, Amazon, Yahoo, Capital One…..
When are you doing BI?
When looking at your market share or profitability you are doing BI.
Looking at the best area to increase your sales you are doing BI.
Anytime you analyze data and turn it into information you are doing BI.
How do you know if you are really doing BI?
Efforts around changing individual and team work practices arise, from the individual and from the teams
New jobs are posted talking about analyzing data and delivering reports
The CEO and CIO start talking about it
BI Tools & What they do
Oracle (Siebel & Hyperion)
QlikView ….. etc
Querying & Reporting
And its sisters:
A Business Intelligence Competency Center (BICC) is a cross-functional organizational team that has defined tasks, roles, responsibilities and processes for supporting and promoting the effective use of Business Intelligence (BI) across an organization.
As early as 2001, Gartner , an information technology research and advisory company, started advocating that companies need a BICC to develop and focus resources to be successful using business intelligence.  Since then, the BICC concept has been further refined through practical implementations in organizations that have implemented BI and analytical software.
Taken directly from Wikipedia
In practice, the term "BICC" is not well integrated into the nomenclature of business or public sector organizations and there are a large degree of variances in the organizational design for BICCs. Nevertheless, the popularity of the BICC concept has caused the creation of units that focus on ensuring the use of the information for decision-making from BI software and increasing the return on investment (ROI) of BI. 
A BICC coordinates the activities and resources to ensure that a fact-based approach to decision making is systematically implemented throughout an organization. It has responsibility for the governance structure for BI and analytical programs, projects, practices, software, and architecture. It is responsible for building the plans, priorities, infrastructure, and competencies that the organization needs to take forward-looking strategic decisions by using the BI and analytical software capabilities.
A BICC’s influence transcends that of a typical business unit, playing a crucial central role in the organizational change and strategic process. Accordingly, the BICC’s purpose is to empower the entire organization to coordinate BI from all units. Through centralization, it "…ensures that information and best practices are communicated and shared through the entire organization so that everyone can benefit from successes and lessons learned." 
The BICC also plays an important organizational role facilitating interaction among the various cultures and units within the organization. Knowledge transfer, enhancement of analytic skills, coaching and training are central to the mandate of the BICC. A BICC should be pivotal in ensuring a high degree of information consumption and a ROI for BI.
Taken directly from Wikipedia
Jobs in Business Intelligence
BI Support Engineer
1000+ jobs on washingtonpost.com
5,357 jobs on indeed.com
Data Management and Integration Topic, Gartner, http://www.gartner.com/it/products/research/asset_137953_2395.jsp
Articles: Key Issues for Implementing an Enterprise wide Data Quality Improvement Project, 2008, Key Issues for Enterprise Information Management Initiatives, 2008, Key Issues for Establishing Information Governance Policies, Processes and Organization, 2008
Data Quality Management, The Most Critical Initiative You Can Implement, J. G. Geiger, http://www2.sas.com/proceedings/sugi29/098-29.pdf
Information Management, How to Measure and Monitor the Quality of Master Data, http://www.information-management.com/issues/2007_58/master_data_management_mdm_quality-10015358-1.html?ET=informationmgmt:e963:2046487a:&st=email
Data Management Assn of Michigan Bits & Bytes, Critical Data Quality Controls, D Jeffries, Fall 2006 http://dama-michigan.org/2%20Newsletter.pdf