Called to speak to a bunch of final year students on the topic of data warehouse, this pack was prepared to give a snapshot of the data platform industry
2. 2
This pack is put together for a 2 hour
session to final year students on the
topic of Data Warehouse.
The contents are author’s own view
as a practitioner with no claim to any
originality, accuracy or completeness.
The narrative is intended to help
reader gain a brief insight into what
goes into design of a modern data
platform along with considerations
and constraints.
As the author, I am keen to hear your
comments. Please send your
feedback to pravbs@gmail.com
3. 3
Let us look at
a typical data
model for a
ticket
reservation
system
1
16
4. As a Traveler, I want to be able to reserve a ticket
4
3NF data
model for
ticket
reservation
1. Data model is friendly for
CREATE, READ, UPDATE
and DELETE
2. Most importantly UPDATES
are limited to say, one
group of tables
3. Describe an event through
use of JOINS to merge
data from two or more
tables
4. INDEXES are used on
tables to improve query
performance
5. Such systems are
bracketed as Transaction
Processing Systems (OLTP)
6. Answering Performance Queries
6
Which provider has
cancelled trips the most?
What is the average
number of seats booked
by customers for a trip?
What customer age
brackets prefer using my
platform?
What is the coverage of
cities in a given region?
What is the impact of
high peak day pricing on
booking cancellation?
1. 3NF data models are non-
performant for analysis
2. Aggregation queries are
long running and impact
performance of regular
queries negatively
3. It is common to separate
systems that are customer
facing vs business facing
4. Data duplication through
copying is a common form
of segregation
5. Choosing timing and
frequency of copying is a
well thought out decision
8. Division of Responsibilities
8
Ticket
Reservation
System
customers
Business
Performance
Analysis
executives
Data Copy
1. Separating workloads
ensures optimal response
time and/or throughput
2. A variety of means is
adopted to copy data
between systems
3. However, a time delay is
introduced between an
event and its analysis
4. Data structures for a
performance analysis
system is de-normalized
5. Design of data copy
mechanism is a trade-off
between acceptable time
delay and analysis needs
Online
Transaction
Processing
System
Do I have tickets for this
Friday?
Will an additional bus be
filled-up this Friday?
Expected
response
time in
milliseconds
Inference
possibly after
a day long
analysis
Decision
Support
Systems
10. Dimensions help gain Insights for Decision Making
10
1. Dimensions help pivot a
simple fact or an
computed fact / measure
2. Using multiple dimensions
allows, comparison for
example
3. Charts plotted using
dimensions help visualize
and gain better insights
4. Good insights in turn
allow business to make
better decisions
5. Success of decision
support systems depend
on dimensions it supports
for analysis
* All images are courtesy, their original owners. Intent is only to provide examples. The author does not claim any responsibility or credit
12. Decisions Impact Multiple Business Processes
12
1. Continuous optimization
requires continuous
analysis and decisions
2. Each department
potentially has its own
transaction system
3. Common functions across
departments might be
hosted on a core system,
e.g., ERP*
4. Integrated view of data is
necessary for analyzing
performance
5. Decision support system
hence can have copy of
data from all systems
* Enterprise Resource Planning
An Organization can be represented as a value chain as below
14. Components of a Decision Support System
14
Business Performance Analysis
Source
systems
Data
transfer
systems
File
transfer
Streaming
.
.
.
Extract + Load
Staging
Data
Warehouse
Data
Marts
Transform + Load
Transform + Load
1. Performance analysis and
decision support needs
data with high fidelity*
2. Ecosystem of decision
support is an assembly of
data flows and data
storage / persistence
3. Large enterprises typically
use off-the-shelf products
to achieve objectives
4. Data warehouse is usually
mandated to store single
version of truth
5. Path to decision making is
continuously evolving,
influencing entire chain
* Data fidelity means that as datatravels from the
point of originationto consumption,it retains its
granularityand meaning.
Decision Makers
Reports /
Business Intelligence /
Data Analytics
16. Organizing a Data Warehouse
16
1. Data from processes and
events are captured and
published into data
warehouse
2. Data from a business
process can spawn
multiple OLTP databases
3. Physical tables in a data
warehouse are organized
into subject areas
4. Inter-table relationships
are pre-established to
enforce data quality
5. For speed of analysis, key
facts are computed and
stored as well
Business Processes and Events
Subject Areas
Tables and Relationships
Facts and data about processes
Process examples
Customer on-boarding
Order management
Product fulfillment
Payment
Subject area examples
Customers
Orders
Products
Sales
18. Facts and Dimensions
18
1. Replicating OLTP database
schema is sufficient for
many query needs
2. Design of schema is a
function of READ needs
3. Example schema did not
need any aggregation or
inference
4. However, some
enrichment on the
dimension improves READ-
ability
5. Traceability of facts is an
important requirement
for data warehouse tables
Following is just one way of arranging facts and describing them with dimensions*
*This is the famous STAR SCHEMA. However, if Trip_Dimension is
linked with additional dimension tables, it soon becomes a SNOW
FLAKE SCHEMA
20. Analysts Need a way to ROLL-UP or DRILL-DOWN
20
1. Roll-up and drill-down
across dimensions are key
capabilities of DSS
2. More the number of
dimensions the better for
analysis and insights gain
3. OLTP system supplies
facts. Some dimension
attributes are inferred
4. Aggregations across
hierarchy is either pre-
computed and stored or
dynamically ascertained
5. Pre-computation improves
performance at the cost
of data currency
i6
i5 i4
i3
i1 i2
Hierarchy
Roll-up: Summarizing
while traversing up the
hierarchy
Drill-down: Getting into
details while moving
down the hierarchy
measures
Place Dimension Time Dimension
drilldown roll-up
Facts or measures
22. Multidimensional Modeling – The Data Cube
22
Source: Conceptual Modeling Solutions for the Data Warehouse – Stefano Rizzi DEIS – University of Bologna, Italy
1. Decision making is
enabled by cubes,
dimensions and measures
2. Decision activity is called
online analytical
processing – OLAP
3. ROLAP systems store data
in relational form and
creates cube at runtime
4. MOLAP systems store
precomputed data in the
form of multi-dimensional
cubes
5. HOLAP is a hybrid of both
ROLAP and MOLAP – just
enough aggregation
24. A Play on Storage and Memory Technologies
24
Accessed https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers on 6 Jul 2019
Continuous innovation is bound to change the product capabilities over time
A snapshot of different products and their *OLAP support
1. Reporting use cases need
authentic lineage thereby
insisting on precomputing
2. The trend has been to
reduce precomputation to
eliminate data staleness
3. For use cases with near
real time data needs,
dynamic cube is useful
4. Raw facts are held in disks
and are aggregated at
runtime in-memory
5. Speed of computing is
enhanced with parallel
processing using clusters
a
b
c
d
e
g
h
MOLAP – a b c d
ROLAP – e h
HOLAP – a b e, g h
storage
memory
Cube
processing
Multi-
Dimensional
Data
Relational
Data
26. Business Need creates a case for Technology
26
Re-imagined from Providing OLAP to User-Analysts: an IT Mandate ,
accessed on 6 July 2019
1. Data warehouses are
designed to cater to
business users
2. All the processing and
storage helps in effective
operations, tactics and
strategy
3. Most business users prefer
data for analysis in
Microsoft Excel
4. Data scientists use tools
such as SAS, MATLAB to
conduct experiments
5. Choice of technology is
evolving to satisfy
changing business needs
Operational
Tactical
Strategic
Formulaic
Goal-seeking: How can I increase sale of housing loan
in Tier 2 cities
Contemplative
What-if: What is the effect of decreasing interest
rates on sales of housing loan
Exegetical
Slice and Dice: Understand impact of price of houses
on housing loan take off
Categorical
Explanatory: How many people opted for housing
loan during summer vacations?
Categories of Business Analysis
SQL
Dimensional
files
Data
Platform
Sample technologies used at various levels
Indicative technologies only, varies based on needs
and organizations
APIs
28. Integration – Data, Transport and Frequencies
28
1. Primary purpose of
integration is to get data
to business users
2. Traditional methods
sourced data through files
and loaded into databases
3. Need for agility – sense
and act – increase use of
data directly from sources
4. Traditional extract-
transform-load is giving
way to data virtualization
5. All integration - data and
application – is converging
– to reduce latencies as
much as possible
Traditional Data
Warehouse
Real – Time Data
Warehouse
Logical Data Warehouse
Context – Independent
Data Warehouse
Data warehouse Use Cases*
Data Sources
Flat files
Relational
Databases
Message
Topics
Frequencies
Continuously
Updated Log
Files
Change Data
Periodic (Batch)
e.g. once in a day
Near real time
e.g. once in 1 minute
Pre-processed
Data
External Data
Real time
e.g. on business event
Direct access to
source
Data moved into
intermediate location
Data enriched and
precomputed
Data Access / Transport
Data anonymized
before use
* Definitions accessed here on 6 July 2019
Data made available
in-memory
30. It is Important to Understand Data
30
1. The old adage of quality
of output depending on
quality of input still holds
2. A continuously updated
metadata engine ensures
authenticity of outcomes
3. Changing dimensions can
impact data access
performance and accuracy
4. Speed of availability can
be a trade-off with
accuracy
5. Data governance driven
by data architecture is
key to managing sanctity
of data warehouse
Uniqueness and
Relationships
Changing
Dimensions
Historical Facts
+ Time Series
Primary Key, Business Key, Surrogate
Key, Foreign Key
Type 1 – no change, e.g. date of Birth
Type 2 – Infrequently change, e.g. manager,
stored with an effective start date
Type 3 – similar to Type 3, but stores both old
and new values together in the same row
Facts that indicate an event or
influenced by event, e.g. Open and close
price of stock, over a period of time
Computed
Measures Facts that are aggregated, inferred, derived
etc., for making sense of eventsMissing
Values
Facts that genuinely indicate absence of any
values or erroneously not captured for event
Partly
Unstructured
Information in textual format requiring
parsing, tokenizing and context setting to help
understand and derive insights
Graph
Datasets having high
degree of relationships
– making relationships
first class citizens
Spatial
Data
Datasets with multiple
layers describing locations
Unstructured Data
Audio, video and images; usually
accompanied by metadata
Each needs
its own way
of handling
Data can be held in a variety of formats - relational, XML, JSON, key-value, geospatial, graph
32. Alternative Data Platforms
32
1. Corporate information
factory uses central
integration database as a
single version of truth
2. Data lake bring together
all sorts of data from
within and outside
3. Data vault design ensures
availability of all data –
both clean and unclean
4. Data Archive, (not
depicted), stores old data
for reference purpose
5. Technology choice is
dependent on perceived
benefits and cost
Corporate InformationFactory Data Lake Data Vault
Data ingested into a
central integration
database in 3NF
Data marts for specific purpose
External
source
Data ingested into a central
cluster in raw format
Data accessed on demand
using variety of means
Data ingested into a
central database as-is
with time factor
Historical view of all data
across operational data stores
Allow enhanceability while
guaranteeing data accuracy
Allow insight creation merging
data from multiple sources
Allow data to be viewed as it
arrived rather than as it should
have arrived
34. Data Platforms are Available as-a-Service
34
1. On-premise data sources
are connected to a cloud
service provider
2. Oracle, SAP, Snowflake,
Microsoft, Amazon etc.,
provide data warehouse
as a platform (DWaaS)
3. Key value proposition is
reduced capital expense
in procuring and setting
up infrastructure
4. An enterprise can quickly
acquire a data warehouse
software capability
5. Running costs and security
need to be monitored
Copied from https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/