Whats A Data Warehouse

What is a Data Warehouse?
And Why Are So Many
Schools Setting Them Up?

Richard Goerwitz

What Is a Data Warehouse?
 Nobody can agree
 So I’m not actually going to define a DW
 Don’t feel cheated, though
 By the end of this talk, you’ll
• Understand key concepts that underlie all
warehouse implementations (“talk the talk”)
• Understand the various components out of
which DW architects construct real-world data
warehouses
• Understand what a data warehouse project
looks like

Why Are Schools Setting Up
Data Warehouses?
 A data warehouse makes it easier to:
• Optimize classroom, computer lab usage
• Refine admissions ratings systems
• Forecast future demand for courses, majors
• Tie private spreadsheet data into central repositories
• Correlate admissions and IR data with outcomes such as:
 GPAs
 Placement rates
 Happiness, as measured by alumni surveys
• Notify advisors when extra help may be needed based on
 Admissions data (student vitals; SAT, etc.)
 Special events: A-student suddenly gets a C in his/her major
 Slower trends: Student’s GPA falls for > 2 semesters/terms
• (Many other examples could be given!)
 Better information = better decisions
• Better admission decisions
• Better retention rates
• More effective fund raising, etc.

Talking The Talk
 To think and communicate usefully about data warehouses
you’ll need to understand a set of common terms and
concepts:
• OLTP
• ODS
• OLAP, ROLAP, MOLAP
• ETL
• Star schema
• Conformed dimension
• Data mart
• Cube
• Metadata
 Even if you’re not an IT person, pay heed:
• You’ll have to communicate with IT people
• More importantly:
Evidence shows that IT will only build a successful warehouse if you
are intimately involved!

OLTP
 OLTP = online transaction processing
 The process of moving data around to
handle day-to-day affairs
• Scheduling classes
• Registering students
• Tracking benefits
• Recording payments, etc.
 Systems supporting this kind of activity
are called transactional systems

Transactional Systems
 Transactional systems are optimized primarily for
the here and now
• Can support many simultaneous users
• Can support heavy read/write access
• Allow for constant change
• Are big, ugly, and often don’t give people the data they
want
 As a result a lot of data ends up in shadow databases
 Some ends up locked away in private spreadsheets
 Transactional systems don’t record all previous
data states
 Lots of data gets thrown away or archived, e.g.:
• Admissions data
• Enrollment data
• Asset tracking data (“How many computers did we
support each year, from 1996 to 2006, and where do we
expect to be in 2010?”)

Simple Transactional Database
 Map of Microsoft
Windows Update
Service (WUS)
back-end database
• Diagrammed using
Sybase
PowerDesigner
 Each green box is a
database “table”
 Arrows are “joins” or
foreign keys
 This is simple for an
OLTP back end

More Complex Example
 Recruitment Plus back-end
database
 Used by many admissions
offices
 Note again:
• Green boxes are tables
• Lines are foreign key
relationships
• Purple boxes are views
 Considerable expertise is
required to report off this
database!
 Imagine what it’s like for
even more complex
systems
• Colleague
• SCT Banner (over 4,000
tables)

The “Reporting Problem”
 Often we require OLTP data as a snapshot, in a
spreadsheet or report
 Reports require querying back-end OLTP support
databases
 But OLTP databases are often very complex, and
typically
• Contain many, often obscure, tables
• Utilize cryptic, unintuitive field/column names
• Don’t store all necessary historical data
 As a result, reporting becomes a problem –
• Requires special expertise
• May require modifications to production OLTP systems
• Becomes harder and harder for staff to keep up!

Workarounds
 Ways of working around the reporting
problem include:
1. Have OLTP system vendors do the work
• Provide canned reports
• Write reporting GUIs for their products
2. Hire more specialists
• To create simplified views of OLTP data
• To write reports, create snapshots
3. Periodically copy data from OLTP systems to
a place where
• The data is easier to understand
• The data is optimized for reporting
• Easily pluggable into reporting tools

ODS
 ODS = operational data store
 ODSs were an early workaround to the “reporting
problem”
 To create an ODS you
• Build a separate/simplified version of an OLTP system
• Periodically copy data into it from the live OLTP system
• Hook it to operational reporting tools
 An ODS can be an integration point or real-time
“reporting database” for an operational system
 It’s not enough for full enterprise-level, cross-
database analytical processing

OLAP
 OLAP = online analytical processing
 OLAP is the process of creating and
summarizing historical, multidimensional
data
• To help users understand the data better
• Provide a basis for informed decisions
• Allow users to manipulate and explore data
themselves, easily and intuitively
 More than just “reporting”
 Reporting is just one (static) product of
OLAP

OLAP Support Databases
 OLAP systems require support databases
 These databases typically
• Support fewer simultaneous users than OLTP
back ends
• Are structured simply; i.e., denormalized
• Can grow large
 Hold snapshots of data in OLTP systems
 Provide history/time depth to our analyses
• Are optimized for read (not write) access
• Updated via periodic batch (e.g., nightly) ETL
processes

ETL Processes
 ETL = extract, transform, load
• Extract data from various sources
• Transform and clean the data from those sources
• Load the data into databases used for analysis and
reporting
 ETL processes are coded in various ways
• By hand in SQL, UniBASIC, etc.
• Using more general programming languages
• In semi-automated fashion using specialized ETL tools
like Cognos Decision Stream
 Most institutions do hand ETL; but note well:
• Hand ETL is slow
• Requires specialized knowledge
• Becomes extremely difficult to maintain as code
accumulates and databases/personnel change!

Where Does the Data Go?
 What sort of a database do the ETL
processes dump data into?
 Typically, into very simple table
structures
 These table structures are:
• Denormalized
• Minimally branched/hierarchized
• Structured into star schemas

So What Are Star Schemas?
 Star schemas are collections of data arranged
into star-like patterns
• They have fact tables in the middle, which contain
amounts, measures (like counts, dollar amounts, GPAs)
• Dimension tables around the outside, which contain
labels and classifications (like names, geocodes, majors)
• For faster processing, aggregate fact tables are
sometimes also used (e.g., counts pre-averaged for an
entire term)
 Star schemas should
• Have descriptive column/field labels
• Be easy for users to understand
• Perform well on queries

A Very Simple Star Schema
Data Center UPS
Power Output

Dimensions:
Phase
Time
Date
Facts:
Volts
Amps
Etc.

A More Complex Star Schema
 Freshman survey
data (HERI/CIRP)

 Dimensions:
• Questions
• Survey years
• Data about test
takers
 Facts:
• Answer (text)
• Answer (raw)
• Count (1)
 Oops
• Not a star
Oops, answers should have been placed in their • Snowflaked!
own dimension (creating a “factless fact table”).
I’ll demo a better version of this star later!

Data Marts
 One definition:
• One or more star schemas that present data on a single
or related set of business processes
 Data marts should not be built in isolation
 They need to be connected via dimensional tables
that are
• The same or subsets of each other
• Hierarchized the same way internally
 So, e.g., if I construct data marts for…
• GPA trends, student major trends, enrollments
• Freshman survey data, senior survey data, etc.
 …I connect these marts via a conformed student
dimension
• Makes correlation of data across star schemas intuitive
• Makes it easier for OLAP tools to use the data
• Allows nonspecialists to do much of the work

Simple Data Mart Example
UPS
Battery star
By battery
Run-time
% charged
Current
Input star
By phase
Voltage
Current
Output star
By phase
Voltage
Current
Sensor star
By sensor
Temp
Humidity
Note conformed date,
time dimensions!

CIRP Star/Data Mart
 CIRP
Freshman
survey data
 Corrected
from a
previous
slide
 Note the
CirpAnswer
dimension
 Note student
dimension
(ties in with
other marts)

ROLAP, MOLAP
 ROLAP = OLAP via direct relational query
• E.g., against a (materialized) view
• Against star schemas in a warehouse
 MOLAP = OLAP via multidimensional
database (MDB)
• MDB is a special kind of database
• Treats data kind of like a big, fast spreadsheet
• MDBs typically draw data in from a data
warehouse
 Built to work best with star schemas

Data Cubes
 The term data cube
means different things to
different people
 Various definitions:
• A star schema
• Any DB view used for
reporting
• A three-dimensional
array in a MDB
• Any multidimensional
MDB array (really a
hypercube)
 Which definition do you
suppose is technically
correct?

Metadata
 Metadata = data about data
 In a data warehousing context it can mean many
things
• Information on data in source OLTP systems
• Information on ETL jobs and what they do to the data
• Information on data in marts/star schemas
• Documentation in OLAP tools on the data they
manipulate
 Many institutions make metadata available via
data malls or warehouse portals, e.g.:
• University of New Mexico
• UC Davis
• Rensselear Polytechnic Institute
• University of Illinois
 Good ETL tools automate the setup of
malls/portals!

The Data Warehouse
 OK now we’re experts in terms like OLTP, OLAP,
star schema, metadata, etc.
 Let’s use some of these terms to describe how a
DW works:
• Provides ample metadata – data about the data
• Utilizes easy-to-understand column/field names
• Feeds multidimensional databases (MDBs)
• Is updated via periodic (mainly nightly) ETL jobs
• Presents data in a simplified, denormalized form
• Utilizes star-like fact/dimension table schemas
• Encompasses multiple, smaller data “marts”
• Supports OLAP tools (Access/Excel, Safari, Cognos BI)
• Derives data from (multiple) back-end OLTP systems
• Houses historical data, and can grow very big

A Data Warehouse is Not…
 Vendor and consultant proclamations
aside, a data warehouse is not:
• A project
 With a specific end date
• A product you buy from a vendor
 Like an ODS (such as SCT’s)
 A canned “warehouse” supplied by iStrategy
 Cognos ReportNet
• A database schema or instance
 Like Oracle
 SQL Server
• A cut-down version of your live transactional
database

Kimball & Caserta’s Definition
 According to Ralph Kimball and Joe
Caserta, a data warehouse is:
A system that extracts, cleans, conforms, and
delivers source data into a dimensional data
store and then supports and implements
querying and analysis for the purpose of
decision making.

 Another def.: The union of all the enterprise’s data marts
 Aside: The Kimball model is not without some critics:
• E.g., Bill Inmon

Example Data Warehouse (1)
 This one is
RPI’s
 5 parts:
• Sources
• ETL stuff
• DW proper
• Cubes etc.
• OLAP apps

Example Data Warehouse (2)
 Caltech’s DW
 Five Parts:
• Source systems
• ETL processes
• Data marts
• FM/metadata
• Reporting and
analysis tools
• Note: They’re
also customers
of Cognos!

So Where is Colorado College?
 Phil Goldstein (Educause Center for Applied
Research fellow) identifies the major deployment
levels:
• Level 1: Transactional systems only
• Level 2a: ODS or single data mart; no ETL
• Level 2: ODS or single data mart with ETL tools
• Level 3a: Warehouse or multiple marts; no ETL; OLAP
• Level 3b: Warehouse or multiple marts; ETL; OLAP
• Level 3: Enterprise-wide warehouse or multiple marts;
ETL tools; OLAP tools
 Goldstein’s study was just released in late 2005
 It’s very good; based on real survey data
 Which level is Colorado College at?

Implementing a Data Warehouse
 In many organizations IT people want to huddle and work
out a warehousing plan, but in fact
• The purpose of a DW is decision support
• The primary audience of a DW is therefore College decision
makers
• It is College decision makers therefore who must determine
 Scope
 Priority
 Resources
 Decision makers can’t make these determinations without
an understanding of data warehouses
 It is therefore imperative that key decision makers first be
educated about data warehouses
• Once this occurs, it is possible to
 Elicit requirements (a critical step that’s often skipped)
 Determine priorities/scope
 Formulate a budget
 Create a plan and timeline, with real milestones and deliverables!

Is This Really a Good Plan?
 Sure, according to Phil Goldstein (Educause Center for
Applied Research)
 He’s conducted extensive surveys on “academic analytics”
(= business intelligence for higher ed)
 His four recommendations for improving analytics:
• Key decisionmakers must lead the way
• Technologists must collaborate
• Must collect requirements
• Must form strong partnerships with functional sponsors
• IT must build the needed infrastructure
• Carleton violated this rule with Cognos BI
• As we discovered, without an ETL/warehouse infrastructure,
success with OLAP is elusive
• Staff must train and develop deep analysis skills
 Goldstein’s findings mirror closely the advice of industry
heavyweights – Ralph Kimball, Laura Reeves, Margie Ross,
Warren Thornthwaite, etc.

Isn’t a DW a Huge Undertaking?
 Sure, it can be huge
 Don’t hold on too tightly to the big-
sounding word, “warehouse”
 Luminaries like Ralph Kimball have shown
that a data warehouse can be built
incrementally
• Can start with just a few data marts
• Targeted consulting help will ensure proper,
extensible architecture and tool selection

What Takes Up the Most Time?
 You may be surprised 90

to learn what DW step 80
70
takes the most time 60
Hardware

Try guessing which:
50 East
Database
 ETL
40 West
Schemas

• Hardware 30 North
OLAP tools

20
• Physical database setup
10
• Database design 0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
• ETL
• OLAP setup

Acc. to Kimball & Caserta, ETL will eat up 70% of the time.
Other analysts give estimates ranging from 50% to 80%.
The most often underestimated part of the warehouse
project!

Eight Month Initial Deployment
Step Duration Step Duration
Begin educating decision makers 21 days Secure, configure network 1 day
Collect requirements 14 days Deploy physical “target” DB 4 days
Decide general DW design 7 days Learn/deploy ETL tool 28 days
Determine budget 3 days Choose/set up modeling tool 21 days
Identify project roles 1 day Design initial data mart 7 days
Eval/choose ETL tool 21 days Design ETL processes 28 days
Eval/choose physical DB 14 days Hook up OLAP tools 7 days
Spec/order, configure server 20 days Publicize, train, train 21 days

Conclusion
 Information is held in transactional systems
• But transactional systems are complex
• They don’t talk to each other well; each is a silo
• They require specially trained people to report off of
 For normal people to explore institutional data, data in
transactional systems needs to be
• Renormalized as star schemas
• Moved to a system optimized for analysis
• Merged into a unified whole in a data warehouse
 Note: This process must be led by “customers”
• Yes, IT people must build the infrastructure
• But IT people aren’t the main customers
 So who are the customers?
• Admissions officers trying to make good admission decisions
• Student counselors trying to find/help students at risk
• Development offers raising funds that support the College
• Alumni affairs people trying to manage volunteers
• Faculty deans trying to right-size departments
• IT people managing software/hardware assets, etc….

Whats A Data Warehouse

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Whats A Data Warehouse

Similar to Whats A Data Warehouse (20)

Recently uploaded

Recently uploaded (20)

Whats A Data Warehouse