This document provides an introduction to data warehousing. It discusses why data warehouses are used, as they allow organizations to store historical data and perform complex analytics across multiple data sources. The document outlines common use cases and decisions in building a data warehouse, such as normalization, dimension modeling, and handling changes over time. It also notes some potential issues like performance bottlenecks and discusses strategies for addressing them, such as indexing and considering alternative data storage options.
2.
Principal Consultant (Data and Analytics),
CSpring Inc.
Business Analytics Adjunct Professor, Wake
Tech
MS in Business Intelligence
Passion in developing BI solutions that provide
end users easy access to necessary data to
find the answers they demand (even the ones
they don’t know yet!)
Twitter: @OpenDataAlex LinkedIn: alexmeadows
GitHub: OpenDataAlex Email: ameadows@cspring.com
About Alex
3.
4.
5. Agenda
Why Data Warehousing
Use Case Modeling Decisions
Building Your Data Warehouse
Data Warehouse Gotchas
Q&A
Please feel free to ask questions throughout the presentation!
6. Why Data Warehouses?
Started being discussed in 1970
While databases existed, they were not relational/normalized
− Network/hierarchical in nature
− Design for query, not for data model
Reporting was hard
− System/application queries were not the same as management reporting queries
8. Bill Inmon
Data warehouses: subject-
oriented, integrated, time-variant
and non-volatile collection of
data in support of management's
decision making process
12. Business Intelligence Use Cases
Traditional Data Warehousing focuses on Descriptive and
Diagnostic questions
Predictive and Prescriptive questions require other types of tooling
(i.e. simulation modeling, statistics, etc.)
14. Case 1: Operational Data Store
Holds All Data
● Classic use-case
● System bogged down with historic ‘valuable’
data
● Applications may try to take advantage of
hosting the data for features
15. Case 2: Adding Unnecessary
Objects
● Adding objects that meet more reporting needs
than application requirements
● Impacting maintainability through duplication of
data
● Database/application handling data movement
to these objects
16. Case 3: Complex Relationships
Based On Filtering Factors
● Storing data by clear delimiters
– Time – Quarters, Months, etc.
– Geography – Region, State, City, etc.
– Business logic
● Makes querying very complicated
● Also can impact high availability architecture
17. Do You Need A DWH?
● Case 1 – Data volume/historical data
● Case 2/3 – Transactional database not matching
reporting/analysis requirements
● If performance isn’t an issue (yet) then you have
some time
● If data volume is tiny (under ~50 GB) then maybe not
● We’re going to assume that the DWH is needed ;)
20. Inmon: 3rd
Normal Form
● Normalize on Objects, Relationships
● Focus on all data stores
– Join data sets as necessary
– Look into Master Data Management practices for
true store merging
21.
22. Classroom Transaction System
Student
First Name
Student
Last Name
Student
System ID
Bob Young 1
Robert Young 2
Jennifer Owens 3
Andrew Collins 4
Student
ID
Class ID Student
Grade
1 1A
2 1B
3 1B
4 1C
2 2A
Class
ID
Class Name Class
Program
Class
Credits
1Intro to
Computer
Science
Business
Admin
4
2Clay Sculpting
101
Art 2
23. Classroom 3NF Data Warehouse
Student
First Name
Student
Last Name
Student
System ID
Studen
t ID
Bob Young 1 100
Robert Young 2 101
Jennifer Owens 3 102
Andrew Collins 4 103
Class ID Class
System ID
Class Name Class
Program
Class
Credits
200 1Intro to
Computer
Science
Business
Admin
4
201 2Clay Sculpting
101
Art 2
Stude
nt ID
Class ID Student
Grade
100 200A
101 200B
102 200B
103 200C
101 201A
24. DWH = Version Control
Class
ID
Class
System
ID
Class Name Class
Program
Class
Credits
Create
Date
Update Date Version
200 1Intro to
Computer
Science
Business
Admin
4 01/01/17 02/01/17 1
202 1Intro to
Computer
Science
Business
Admin
6 02/01/17 02/01/17 2
201 2Clay Sculpting
101
Art 2 01/01/17 01/01/17 1
25. Also For Relationships!
● Historical vs current
relationships
● Different ways of
handling version
control
(dimensionality)
Student ID Class ID Student
Grade
Create
Date
Update
Date
100 200A 01/01/17 01/01/17
101 200B 01/01/17 01/01/17
102 200B 01/01/17 01/01/17
103 200C 02/01/17 02/01/17
104 200D 02/01/17 02/01/17
103 202C 02/01/17 02/01/17
104 202D 02/01/17 02/01/17
104 201A 01/01/17 01/01/17
101 201A 01/01/17 01/01/17
30. Dimensionality
● Concept originated with star schema
● Store data changes based on what is being
done with the data/long term utilization
● Build models based on objects and
relationships
34. Dimension Example
Class ID Class Name Class Program Topic
200Intro to Computer ScienceBusiness Admin Computer
Science
202Intro to Computer ScienceBusiness Admin Computer
Science
201Clay Sculpting 101 Art Sculpture
35. Fact Table Example
Student ID Class ID Credit
Earned
Credit
Maximum
Date ID
100 200 26 120 20170101
101 200 37 120 20170101
102 200 12 120 20170101
103 200 42 120 20170101
104 200 16 120 20170101
103 202 80 120 20170101
104 202 120 120 20170101
104 201 90 120 20170101
101 201 26 120 20170101
36. Data Vault
Hybrid between 3NF and star schema
Created by Dan Linstedt
Persistent data layer – keep everything
Bring data over as needed
− Once touching an object, bring it all over
Can be hybrid between relational databases and Hadoop
Massive parallel loading, eventual consistency (with Hadoop)
40. Resetting The Data Warehouse
● Especially Star Schema
– Business Logic changes
– Missing requirements
● What to do?
– Reload from permanent
storage (3NF DWH/Data
Vault)
41. Performance Issues*
● Hitting the same/similar performance bottlenecks as
transactional system
● What to do?
– Check for proper indexing (can get complicated with star schema)
– Volume too high for platform? Consider alternatives (other data
stores, NoSQL for large static data sets)
– Matching too closely to transactional model? Look at tuning the
model for purpose
– Composite keys in star schema?
– Too many joins?
*This is a huge area, and trying to generalize it is difficult. There are other solutions we can
discuss :)
Editor's Notes
So here’s a bit about me. There are three things I’m going to ask of you, the first being – please feel free to reach out! I love talking and learning about what folks are using out in the wild and sharing. If you want to know more or chat more about any topic within data science/business intelligence just message me via one of the above methods.
The second thing I’ll ask is to be aware that some of these solutions may fix your particular problems and you’ll iterate on them and we’ll find them super-awesome and maybe you’ll be able to give back and talk about your experiences at a conference or in a trade paper. Note that the business side might not realize the undertaking or super awesome things being done – they are designed to be seamless and make users lives easier.
The final ask before we get fully started is please don’t be the pointy-haired boss! We’re covering a lot of topics at a very high level and a lot of nuances aren’t being discussed (it’s only a 40 minute presentation after all). Please dig further and ask plenty of questions.
By the end of this presentation, you will know where traditional data warehousing is failing and have a basic understanding of what technologies and methodologies are helping to address the needs of more data savvy customer bases.
The concept of data warehouses started in the 1970s and fully came into their own during the late 80s and well into the 90s. Before relational databases, data was stored based on query usage and not necessarily based on the data itself. As a result, reporting was hard. Data would either have to be merged out piece-meal or stored again based on the specific query requirements.
Into that mess, a gentleman named Bill Inmon created the initial concept of separating reporting and analysis needs away from the OLTP layer.
With that said, here is a typical model/workflow. From OLTP systems, Excel files, etc. The data is moved into a 3NF model. From the 3NF model, star schema are built on top to handle all the reporting/analytics requirements. This model has worked very well but there are several problems that have come out with this model. While I don’t have an exact number, a high number of data warehouse projects are considered failures due to these issues. What are they? Glad you asked!
There are distinct groups of requirements that business intelligence tries to answer. Traditional data warehousing can answer the first two – what happened and why it did happen. Where it starts to fail is in the predictive analytics space where again, data scientists want data that is not cleansed and conformed, but still easy to access. Then there is proscriptive analytics – applying the predictions found and making automated decisions based on them.
Graph Source: http://www.odoscope.com/technology/prescriptive-analysis/
Here is our basic example that we’ll be using through the rest of this presentation. It’s a simple student/teacher/class model that, while not modeled 100% ‘correct’, will provide a good example going forward.
Of the newer architectures, Data Vault is one of the easier to implement because it is a combination of both the Kimball and Inmon methods. Data is only brought over from source systems as needed as opposed to bringing everything from the source all at once.
The other really cool thing about Data Vault is that data can be offloaded into Hadoop as it ages and becomes non-volitile.
Image Source: https://pixabay.com/en/vault-strongbox-security-container-154023/
Here is that same model in data vault form. Business entities become hub tables. Relationships between hubs get stored in many to many relationship tables called links. Off both hubs and links are dimension-like tables called satellites that store all relative information of their related hub or link. Satellites version data as changes occur.
There’s not a large amount of information publicly available outside the book, shown above. The original series of articles can be found on TDAN. There is also certification thru the learn data vault website.