Introduction To Data
Warehousing
Triangle MySQL Users Group
March 24, 2017
Alex Meadows

Principal Consultant (Data and Analytics),
CSpring Inc.

Business Analytics Adjunct Professor, Wake
Tech

MS in Business Intelligence

Passion in developing BI solutions that provide
end users easy access to necessary data to
find the answers they demand (even the ones
they don’t know yet!)
Twitter: @OpenDataAlex LinkedIn: alexmeadows
GitHub: OpenDataAlex Email: ameadows@cspring.com
About Alex
Agenda

Why Data Warehousing

Use Case Modeling Decisions

Building Your Data Warehouse

Data Warehouse Gotchas

Q&A
Please feel free to ask questions throughout the presentation!
Why Data Warehouses?

Started being discussed in 1970

While databases existed, they were not relational/normalized
− Network/hierarchical in nature
− Design for query, not for data model

Reporting was hard
− System/application queries were not the same as management reporting queries
Design For Query
Shopping Order
Widgets
Thingys
Odds and Ends
Customers
Country
State
City
Bill Inmon
Data warehouses: subject-
oriented, integrated, time-variant
and non-volatile collection of
data in support of management's
decision making process
Business
Requirements
DATA
How Many Customers
Like Animals?
Dogs?
Dogs with short hair?
Customers Sales Marketing Employee
Enterprise Data Warehouse
STAR
SCHEMAS
Traditional Model
Business Intelligence Use Cases

Traditional Data Warehousing focuses on Descriptive and
Diagnostic questions

Predictive and Prescriptive questions require other types of tooling
(i.e. simulation modeling, statistics, etc.)
Use Case
Modeling Decisions
-or-
Do You Need A Data Warehouse?
Case 1: Operational Data Store
Holds All Data
● Classic use-case
● System bogged down with historic ‘valuable’
data
● Applications may try to take advantage of
hosting the data for features
Case 2: Adding Unnecessary
Objects
● Adding objects that meet more reporting needs
than application requirements
● Impacting maintainability through duplication of
data
● Database/application handling data movement
to these objects
Case 3: Complex Relationships
Based On Filtering Factors
● Storing data by clear delimiters
– Time – Quarters, Months, etc.
– Geography – Region, State, City, etc.
– Business logic
● Makes querying very complicated
● Also can impact high availability architecture
Do You Need A DWH?
● Case 1 – Data volume/historical data
● Case 2/3 – Transactional database not matching
reporting/analysis requirements
● If performance isn’t an issue (yet) then you have
some time
● If data volume is tiny (under ~50 GB) then maybe not
● We’re going to assume that the DWH is needed ;)
Building Your Data Warehouse
Traditional
Iterations On Existing Architecture
Inmon: 3rd
Normal Form
● Normalize on Objects, Relationships
● Focus on all data stores
– Join data sets as necessary
– Look into Master Data Management practices for
true store merging
Classroom Transaction System
Student
First Name
Student
Last Name
Student
System ID
Bob Young 1
Robert Young 2
Jennifer Owens 3
Andrew Collins 4
Student
ID
Class ID Student
Grade
1 1A
2 1B
3 1B
4 1C
2 2A
Class
ID
Class Name Class
Program
Class
Credits
1Intro to
Computer
Science
Business
Admin
4
2Clay Sculpting
101
Art 2
Classroom 3NF Data Warehouse
Student
First Name
Student
Last Name
Student
System ID
Studen
t ID
Bob Young 1 100
Robert Young 2 101
Jennifer Owens 3 102
Andrew Collins 4 103
Class ID Class
System ID
Class Name Class
Program
Class
Credits
200 1Intro to
Computer
Science
Business
Admin
4
201 2Clay Sculpting
101
Art 2
Stude
nt ID
Class ID Student
Grade
100 200A
101 200B
102 200B
103 200C
101 201A
DWH = Version Control
Class
ID
Class
System
ID
Class Name Class
Program
Class
Credits
Create
Date
Update Date Version
200 1Intro to
Computer
Science
Business
Admin
4 01/01/17 02/01/17 1
202 1Intro to
Computer
Science
Business
Admin
6 02/01/17 02/01/17 2
201 2Clay Sculpting
101
Art 2 01/01/17 01/01/17 1
Also For Relationships!
● Historical vs current
relationships
● Different ways of
handling version
control
(dimensionality)
Student ID Class ID Student
Grade
Create
Date
Update
Date
100 200A 01/01/17 01/01/17
101 200B 01/01/17 01/01/17
102 200B 01/01/17 01/01/17
103 200C 02/01/17 02/01/17
104 200D 02/01/17 02/01/17
103 202C 02/01/17 02/01/17
104 202D 02/01/17 02/01/17
104 201A 01/01/17 01/01/17
101 201A 01/01/17 01/01/17
Slowly Changing
Dimensions/Managing Changes
SCD 1
SCD 2
SCD 3
Dimensionality
● Concept originated with star schema
● Store data changes based on what is being
done with the data/long term utilization
● Build models based on objects and
relationships
Dimensionality
BUS Architecture
Star Schema Example
Student Dim
Class Dim
Student
Fact
Professor Dim
Dimension Example
Class ID Class Name Class Program Topic
200Intro to Computer ScienceBusiness Admin Computer
Science
202Intro to Computer ScienceBusiness Admin Computer
Science
201Clay Sculpting 101 Art Sculpture
Fact Table Example
Student ID Class ID Credit
Earned
Credit
Maximum
Date ID
100 200 26 120 20170101
101 200 37 120 20170101
102 200 12 120 20170101
103 200 42 120 20170101
104 200 16 120 20170101
103 202 80 120 20170101
104 202 120 120 20170101
104 201 90 120 20170101
101 201 26 120 20170101
Data Vault

Hybrid between 3NF and star schema

Created by Dan Linstedt

Persistent data layer – keep everything

Bring data over as needed
− Once touching an object, bring it all over

Can be hybrid between relational databases and Hadoop

Massive parallel loading, eventual consistency (with Hadoop)

1.0 documentation found at:

TDAN Article

2.0 documentation ->

Certification/training:

http://learndatavault.com/
Data Warehouse
Gotchas
Resetting The Data Warehouse
● Especially Star Schema
– Business Logic changes
– Missing requirements
● What to do?
– Reload from permanent
storage (3NF DWH/Data
Vault)
Performance Issues*
● Hitting the same/similar performance bottlenecks as
transactional system
● What to do?
– Check for proper indexing (can get complicated with star schema)
– Volume too high for platform? Consider alternatives (other data
stores, NoSQL for large static data sets)
– Matching too closely to transactional model? Look at tuning the
model for purpose
– Composite keys in star schema?
– Too many joins?
*This is a huge area, and trying to generalize it is difficult. There are other solutions we can
discuss :)
Introduction To Data Warehousing

Introduction To Data Warehousing

  • 1.
    Introduction To Data Warehousing TriangleMySQL Users Group March 24, 2017 Alex Meadows
  • 2.
     Principal Consultant (Dataand Analytics), CSpring Inc.  Business Analytics Adjunct Professor, Wake Tech  MS in Business Intelligence  Passion in developing BI solutions that provide end users easy access to necessary data to find the answers they demand (even the ones they don’t know yet!) Twitter: @OpenDataAlex LinkedIn: alexmeadows GitHub: OpenDataAlex Email: ameadows@cspring.com About Alex
  • 5.
    Agenda  Why Data Warehousing  UseCase Modeling Decisions  Building Your Data Warehouse  Data Warehouse Gotchas  Q&A Please feel free to ask questions throughout the presentation!
  • 6.
    Why Data Warehouses?  Startedbeing discussed in 1970  While databases existed, they were not relational/normalized − Network/hierarchical in nature − Design for query, not for data model  Reporting was hard − System/application queries were not the same as management reporting queries
  • 7.
    Design For Query ShoppingOrder Widgets Thingys Odds and Ends Customers Country State City
  • 8.
    Bill Inmon Data warehouses:subject- oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process
  • 9.
  • 10.
    How Many Customers LikeAnimals? Dogs? Dogs with short hair? Customers Sales Marketing Employee Enterprise Data Warehouse STAR SCHEMAS
  • 11.
  • 12.
    Business Intelligence UseCases  Traditional Data Warehousing focuses on Descriptive and Diagnostic questions  Predictive and Prescriptive questions require other types of tooling (i.e. simulation modeling, statistics, etc.)
  • 13.
    Use Case Modeling Decisions -or- DoYou Need A Data Warehouse?
  • 14.
    Case 1: OperationalData Store Holds All Data ● Classic use-case ● System bogged down with historic ‘valuable’ data ● Applications may try to take advantage of hosting the data for features
  • 15.
    Case 2: AddingUnnecessary Objects ● Adding objects that meet more reporting needs than application requirements ● Impacting maintainability through duplication of data ● Database/application handling data movement to these objects
  • 16.
    Case 3: ComplexRelationships Based On Filtering Factors ● Storing data by clear delimiters – Time – Quarters, Months, etc. – Geography – Region, State, City, etc. – Business logic ● Makes querying very complicated ● Also can impact high availability architecture
  • 17.
    Do You NeedA DWH? ● Case 1 – Data volume/historical data ● Case 2/3 – Transactional database not matching reporting/analysis requirements ● If performance isn’t an issue (yet) then you have some time ● If data volume is tiny (under ~50 GB) then maybe not ● We’re going to assume that the DWH is needed ;)
  • 18.
  • 19.
  • 20.
    Inmon: 3rd Normal Form ●Normalize on Objects, Relationships ● Focus on all data stores – Join data sets as necessary – Look into Master Data Management practices for true store merging
  • 22.
    Classroom Transaction System Student FirstName Student Last Name Student System ID Bob Young 1 Robert Young 2 Jennifer Owens 3 Andrew Collins 4 Student ID Class ID Student Grade 1 1A 2 1B 3 1B 4 1C 2 2A Class ID Class Name Class Program Class Credits 1Intro to Computer Science Business Admin 4 2Clay Sculpting 101 Art 2
  • 23.
    Classroom 3NF DataWarehouse Student First Name Student Last Name Student System ID Studen t ID Bob Young 1 100 Robert Young 2 101 Jennifer Owens 3 102 Andrew Collins 4 103 Class ID Class System ID Class Name Class Program Class Credits 200 1Intro to Computer Science Business Admin 4 201 2Clay Sculpting 101 Art 2 Stude nt ID Class ID Student Grade 100 200A 101 200B 102 200B 103 200C 101 201A
  • 24.
    DWH = VersionControl Class ID Class System ID Class Name Class Program Class Credits Create Date Update Date Version 200 1Intro to Computer Science Business Admin 4 01/01/17 02/01/17 1 202 1Intro to Computer Science Business Admin 6 02/01/17 02/01/17 2 201 2Clay Sculpting 101 Art 2 01/01/17 01/01/17 1
  • 25.
    Also For Relationships! ●Historical vs current relationships ● Different ways of handling version control (dimensionality) Student ID Class ID Student Grade Create Date Update Date 100 200A 01/01/17 01/01/17 101 200B 01/01/17 01/01/17 102 200B 01/01/17 01/01/17 103 200C 02/01/17 02/01/17 104 200D 02/01/17 02/01/17 103 202C 02/01/17 02/01/17 104 202D 02/01/17 02/01/17 104 201A 01/01/17 01/01/17 101 201A 01/01/17 01/01/17
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    Dimensionality ● Concept originatedwith star schema ● Store data changes based on what is being done with the data/long term utilization ● Build models based on objects and relationships
  • 31.
  • 32.
  • 33.
    Star Schema Example StudentDim Class Dim Student Fact Professor Dim
  • 34.
    Dimension Example Class IDClass Name Class Program Topic 200Intro to Computer ScienceBusiness Admin Computer Science 202Intro to Computer ScienceBusiness Admin Computer Science 201Clay Sculpting 101 Art Sculpture
  • 35.
    Fact Table Example StudentID Class ID Credit Earned Credit Maximum Date ID 100 200 26 120 20170101 101 200 37 120 20170101 102 200 12 120 20170101 103 200 42 120 20170101 104 200 16 120 20170101 103 202 80 120 20170101 104 202 120 120 20170101 104 201 90 120 20170101 101 201 26 120 20170101
  • 36.
    Data Vault  Hybrid between3NF and star schema  Created by Dan Linstedt  Persistent data layer – keep everything  Bring data over as needed − Once touching an object, bring it all over  Can be hybrid between relational databases and Hadoop  Massive parallel loading, eventual consistency (with Hadoop)
  • 38.
     1.0 documentation foundat:  TDAN Article  2.0 documentation ->  Certification/training:  http://learndatavault.com/
  • 39.
  • 40.
    Resetting The DataWarehouse ● Especially Star Schema – Business Logic changes – Missing requirements ● What to do? – Reload from permanent storage (3NF DWH/Data Vault)
  • 41.
    Performance Issues* ● Hittingthe same/similar performance bottlenecks as transactional system ● What to do? – Check for proper indexing (can get complicated with star schema) – Volume too high for platform? Consider alternatives (other data stores, NoSQL for large static data sets) – Matching too closely to transactional model? Look at tuning the model for purpose – Composite keys in star schema? – Too many joins? *This is a huge area, and trying to generalize it is difficult. There are other solutions we can discuss :)

Editor's Notes

  • #3 So here’s a bit about me. There are three things I’m going to ask of you, the first being – please feel free to reach out! I love talking and learning about what folks are using out in the wild and sharing. If you want to know more or chat more about any topic within data science/business intelligence just message me via one of the above methods.
  • #4 The second thing I’ll ask is to be aware that some of these solutions may fix your particular problems and you’ll iterate on them and we’ll find them super-awesome and maybe you’ll be able to give back and talk about your experiences at a conference or in a trade paper. Note that the business side might not realize the undertaking or super awesome things being done – they are designed to be seamless and make users lives easier.
  • #5 The final ask before we get fully started is please don’t be the pointy-haired boss! We’re covering a lot of topics at a very high level and a lot of nuances aren’t being discussed (it’s only a 40 minute presentation after all). Please dig further and ask plenty of questions.
  • #6 By the end of this presentation, you will know where traditional data warehousing is failing and have a basic understanding of what technologies and methodologies are helping to address the needs of more data savvy customer bases.
  • #7 The concept of data warehouses started in the 1970s and fully came into their own during the late 80s and well into the 90s. Before relational databases, data was stored based on query usage and not necessarily based on the data itself. As a result, reporting was hard. Data would either have to be merged out piece-meal or stored again based on the specific query requirements.
  • #9 Into that mess, a gentleman named Bill Inmon created the initial concept of separating reporting and analysis needs away from the OLTP layer.
  • #12 With that said, here is a typical model/workflow. From OLTP systems, Excel files, etc. The data is moved into a 3NF model. From the 3NF model, star schema are built on top to handle all the reporting/analytics requirements. This model has worked very well but there are several problems that have come out with this model. While I don’t have an exact number, a high number of data warehouse projects are considered failures due to these issues. What are they? Glad you asked!
  • #13 There are distinct groups of requirements that business intelligence tries to answer. Traditional data warehousing can answer the first two – what happened and why it did happen. Where it starts to fail is in the predictive analytics space where again, data scientists want data that is not cleansed and conformed, but still easy to access. Then there is proscriptive analytics – applying the predictions found and making automated decisions based on them. Graph Source: http://www.odoscope.com/technology/prescriptive-analysis/
  • #22 Here is our basic example that we’ll be using through the rest of this presentation. It’s a simple student/teacher/class model that, while not modeled 100% ‘correct’, will provide a good example going forward.
  • #37 Of the newer architectures, Data Vault is one of the easier to implement because it is a combination of both the Kimball and Inmon methods. Data is only brought over from source systems as needed as opposed to bringing everything from the source all at once. The other really cool thing about Data Vault is that data can be offloaded into Hadoop as it ages and becomes non-volitile. Image Source: https://pixabay.com/en/vault-strongbox-security-container-154023/
  • #38 Here is that same model in data vault form. Business entities become hub tables. Relationships between hubs get stored in many to many relationship tables called links. Off both hubs and links are dimension-like tables called satellites that store all relative information of their related hub or link. Satellites version data as changes occur.
  • #39 There’s not a large amount of information publicly available outside the book, shown above. The original series of articles can be found on TDAN. There is also certification thru the learn data vault website.
  • #43 Image Source: https://pixabay.com/p-1014060/?no_redirect