Introduction To Data Warehousing

Introduction To Data
Warehousing
Triangle MySQL Users Group
March 24, 2017
Alex Meadows


Principal Consultant (Data and Analytics),
CSpring Inc.

Business Analytics Adjunct Professor, Wake
Tech

MS in Business Intelligence

Passion in developing BI solutions that provide
end users easy access to necessary data to
find the answers they demand (even the ones
they don’t know yet!)
Twitter: @OpenDataAlex LinkedIn: alexmeadows
GitHub: OpenDataAlex Email: ameadows@cspring.com
About Alex

Agenda

Why Data Warehousing

Use Case Modeling Decisions

Building Your Data Warehouse

Data Warehouse Gotchas

Q&A
Please feel free to ask questions throughout the presentation!

Why Data Warehouses?

Started being discussed in 1970

While databases existed, they were not relational/normalized
− Network/hierarchical in nature
− Design for query, not for data model

Reporting was hard
− System/application queries were not the same as management reporting queries

Design For Query
Shopping Order
Widgets
Thingys
Odds and Ends
Customers
Country
State
City

Bill Inmon
Data warehouses: subject-
oriented, integrated, time-variant
and non-volatile collection of
data in support of management's
decision making process

How Many Customers
Like Animals?
Dogs?
Dogs with short hair?
Customers Sales Marketing Employee
Enterprise Data Warehouse
STAR
SCHEMAS

Business Intelligence Use Cases

Traditional Data Warehousing focuses on Descriptive and
Diagnostic questions

Predictive and Prescriptive questions require other types of tooling
(i.e. simulation modeling, statistics, etc.)

Use Case
Modeling Decisions
-or-
Do You Need A Data Warehouse?

Case 1: Operational Data Store
Holds All Data
● Classic use-case
● System bogged down with historic ‘valuable’
data
● Applications may try to take advantage of
hosting the data for features

Case 2: Adding Unnecessary
Objects
● Adding objects that meet more reporting needs
than application requirements
● Impacting maintainability through duplication of
data
● Database/application handling data movement
to these objects

Case 3: Complex Relationships
Based On Filtering Factors
● Storing data by clear delimiters
– Time – Quarters, Months, etc.
– Geography – Region, State, City, etc.
– Business logic
● Makes querying very complicated
● Also can impact high availability architecture

Do You Need A DWH?
● Case 1 – Data volume/historical data
● Case 2/3 – Transactional database not matching
reporting/analysis requirements
● If performance isn’t an issue (yet) then you have
some time
● If data volume is tiny (under ~50 GB) then maybe not
● We’re going to assume that the DWH is needed ;)

Traditional
Iterations On Existing Architecture

Inmon: 3rd
Normal Form
● Normalize on Objects, Relationships
● Focus on all data stores
– Join data sets as necessary
– Look into Master Data Management practices for
true store merging

Classroom Transaction System
Student
First Name
Student
Last Name
Student
System ID
Bob Young 1
Robert Young 2
Jennifer Owens 3
Andrew Collins 4
Student
ID
Class ID Student
Grade
1 1A
2 1B
3 1B
4 1C
2 2A
Class
ID
Class Name Class
Program
Class
Credits
1Intro to
Computer
Science
Business
Admin
4
2Clay Sculpting
101
Art 2

Classroom 3NF Data Warehouse
Student
First Name
Student
Last Name
Student
System ID
Studen
t ID
Bob Young 1 100
Robert Young 2 101
Jennifer Owens 3 102
Andrew Collins 4 103
Class ID Class
System ID
Class Name Class
Program
Class
Credits
200 1Intro to
Computer
Science
Business
Admin
4
201 2Clay Sculpting
101
Art 2
Stude
nt ID
Class ID Student
Grade
100 200A
101 200B
102 200B
103 200C
101 201A

DWH = Version Control
Class
ID
Class
System
ID
Class Name Class
Program
Class
Credits
Create
Date
Update Date Version
200 1Intro to
Computer
Science
Business
Admin
4 01/01/17 02/01/17 1
202 1Intro to
Computer
Science
Business
Admin
6 02/01/17 02/01/17 2
201 2Clay Sculpting
101
Art 2 01/01/17 01/01/17 1

Also For Relationships!
● Historical vs current
relationships
● Different ways of
handling version
control
(dimensionality)
Student ID Class ID Student
Grade
Create
Date
Update
Date
100 200A 01/01/17 01/01/17
101 200B 01/01/17 01/01/17
102 200B 01/01/17 01/01/17
103 200C 02/01/17 02/01/17
104 200D 02/01/17 02/01/17
103 202C 02/01/17 02/01/17
104 202D 02/01/17 02/01/17
104 201A 01/01/17 01/01/17
101 201A 01/01/17 01/01/17

Slowly Changing
Dimensions/Managing Changes

Dimensionality
● Concept originated with star schema
● Store data changes based on what is being
done with the data/long term utilization
● Build models based on objects and
relationships

Star Schema Example
Student Dim
Class Dim
Student
Fact
Professor Dim

Dimension Example
Class ID Class Name Class Program Topic
200Intro to Computer ScienceBusiness Admin Computer
Science
202Intro to Computer ScienceBusiness Admin Computer
Science
201Clay Sculpting 101 Art Sculpture

Fact Table Example
Student ID Class ID Credit
Earned
Credit
Maximum
Date ID
100 200 26 120 20170101
101 200 37 120 20170101
102 200 12 120 20170101
103 200 42 120 20170101
104 200 16 120 20170101
103 202 80 120 20170101
104 202 120 120 20170101
104 201 90 120 20170101
101 201 26 120 20170101

Data Vault

Hybrid between 3NF and star schema

Created by Dan Linstedt

Persistent data layer – keep everything

Bring data over as needed
− Once touching an object, bring it all over

Can be hybrid between relational databases and Hadoop

Massive parallel loading, eventual consistency (with Hadoop)


1.0 documentation found at:

TDAN Article

2.0 documentation ->

Certification/training:

http://learndatavault.com/

Resetting The Data Warehouse
● Especially Star Schema
– Business Logic changes
– Missing requirements
● What to do?
– Reload from permanent
storage (3NF DWH/Data
Vault)

Performance Issues*
● Hitting the same/similar performance bottlenecks as
transactional system
● What to do?
– Check for proper indexing (can get complicated with star schema)
– Volume too high for platform? Consider alternatives (other data
stores, NoSQL for large static data sets)
– Matching too closely to transactional model? Look at tuning the
model for purpose
– Composite keys in star schema?
– Too many joins?
*This is a huge area, and trying to generalize it is difficult. There are other solutions we can
discuss :)

Introduction To Data Warehousing

Introduction To Data Warehousing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Introduction To Data Warehousing

Similar to Introduction To Data Warehousing (20)

More from Alex Meadows

More from Alex Meadows (16)

Recently uploaded

Recently uploaded (20)

Introduction To Data Warehousing

Editor's Notes