What is a Data Warehouse?
 And Why Are So Many
Schools Setting Them Up?

      Richard Goerwitz
What Is a Data Warehouse?
   Nobody can agree
   So I’m not actually going to define a DW
   Don’t feel cheated, though...
Why Are Schools Setting Up
           Data Warehouses?
   A data warehouse makes it easier to:
    •   Optimize classroom...
Talking The Talk
   To think and communicate usefully about data warehouses
    you’ll need to understand a set of common...
OLTP
   OLTP = online transaction processing
   The process of moving data around to
    handle day-to-day affairs
    •...
Transactional Systems
   Transactional systems are optimized primarily for
    the here and now
    •   Can support many ...
Simple Transactional Database
                  Map of Microsoft
                   Windows Update
                   Ser...
More Complex Example
   Recruitment Plus back-end
    database
   Used by many admissions
    offices
   Note again:
  ...
The “Reporting Problem”
   Often we require OLTP data as a snapshot, in a
    spreadsheet or report
   Reports require q...
Workarounds
   Ways of working around the reporting
    problem include:
    1. Have OLTP system vendors do the work
    ...
ODS
   ODS = operational data store
   ODSs were an early workaround to the “reporting
    problem”
   To create an ODS...
OLAP
   OLAP = online analytical processing
   OLAP is the process of creating and
    summarizing historical, multidime...
OLAP Support Databases
   OLAP systems require support databases
   These databases typically
    • Support fewer simult...
ETL Processes
   ETL = extract, transform, load
    • Extract data from various sources
    • Transform and clean the dat...
Where Does the Data Go?
   What sort of a database do the ETL
    processes dump data into?
   Typically, into very simp...
So What Are Star Schemas?
   Star schemas are collections of data arranged
    into star-like patterns
    • They have fa...
A Very Simple Star Schema
Data Center UPS
 Power Output

Dimensions:
   Phase
   Time
   Date
Facts:
   Volts
   Amps
   E...
A More Complex Star Schema
                                                       Freshman survey
                       ...
Data Marts
   One definition:
    • One or more star schemas that present data on a single
      or related set of busine...
Simple Data Mart Example
UPS
Battery star
   By battery
        Run-time
        % charged
        Current
Input star
   B...
CIRP Star/Data Mart
                   CIRP
                    Freshman
                    survey data
                ...
CIRP Mart in Cognos BI 8
ROLAP, MOLAP
   ROLAP = OLAP via direct relational query
    • E.g., against a (materialized) view
    • Against star sch...
Data Cubes
   The term data cube
    means different things to
    different people
   Various definitions:
    •   A st...
Metadata
   Metadata = data about data
   In a data warehousing context it can mean many
    things
    •   Information ...
The Data Warehouse
   OK now we’re experts in terms like OLTP, OLAP,
    star schema, metadata, etc.
   Let’s use some o...
A Data Warehouse is Not…
   Vendor and consultant proclamations
    aside, a data warehouse is not:
    • A project
     ...
Kimball & Caserta’s Definition
   According to Ralph Kimball and Joe
    Caserta, a data warehouse is:
         A system ...
Example Data Warehouse (1)
   This one is
    RPI’s
   5 parts:
    •   Sources
    •   ETL stuff
    •   DW proper
    ...
Example Data Warehouse (2)
                     Caltech’s DW
                     Five Parts:
                      • So...
So Where is Colorado College?
   Phil Goldstein (Educause Center for Applied
    Research fellow) identifies the major de...
Implementing a Data Warehouse
   In many organizations IT people want to huddle and work
    out a warehousing plan, but ...
Is This Really a Good Plan?
   Sure, according to Phil Goldstein (Educause Center for
    Applied Research)
   He’s cond...
Isn’t a DW a Huge Undertaking?
   Sure, it can be huge
   Don’t hold on too tightly to the big-
    sounding word, “ware...
What Takes Up the Most Time?
   You may be surprised          90

    to learn what DW step         80
                  ...
Eight Month Initial Deployment
              Step                Duration                Step              Duration
Begin ...
Conclusion
   Information is held in transactional systems
    • But transactional systems are complex
    • They don’t t...
Upcoming SlideShare
Loading in …5
×

Whats A Data Warehouse

2,366 views
2,228 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,366
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
121
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Whats A Data Warehouse

  1. 1. What is a Data Warehouse? And Why Are So Many Schools Setting Them Up? Richard Goerwitz
  2. 2. What Is a Data Warehouse?  Nobody can agree  So I’m not actually going to define a DW  Don’t feel cheated, though  By the end of this talk, you’ll • Understand key concepts that underlie all warehouse implementations (“talk the talk”) • Understand the various components out of which DW architects construct real-world data warehouses • Understand what a data warehouse project looks like
  3. 3. Why Are Schools Setting Up Data Warehouses?  A data warehouse makes it easier to: • Optimize classroom, computer lab usage • Refine admissions ratings systems • Forecast future demand for courses, majors • Tie private spreadsheet data into central repositories • Correlate admissions and IR data with outcomes such as:  GPAs  Placement rates  Happiness, as measured by alumni surveys • Notify advisors when extra help may be needed based on  Admissions data (student vitals; SAT, etc.)  Special events: A-student suddenly gets a C in his/her major  Slower trends: Student’s GPA falls for > 2 semesters/terms • (Many other examples could be given!)  Better information = better decisions • Better admission decisions • Better retention rates • More effective fund raising, etc.
  4. 4. Talking The Talk  To think and communicate usefully about data warehouses you’ll need to understand a set of common terms and concepts: • OLTP • ODS • OLAP, ROLAP, MOLAP • ETL • Star schema • Conformed dimension • Data mart • Cube • Metadata  Even if you’re not an IT person, pay heed: • You’ll have to communicate with IT people • More importantly: Evidence shows that IT will only build a successful warehouse if you are intimately involved!
  5. 5. OLTP  OLTP = online transaction processing  The process of moving data around to handle day-to-day affairs • Scheduling classes • Registering students • Tracking benefits • Recording payments, etc.  Systems supporting this kind of activity are called transactional systems
  6. 6. Transactional Systems  Transactional systems are optimized primarily for the here and now • Can support many simultaneous users • Can support heavy read/write access • Allow for constant change • Are big, ugly, and often don’t give people the data they want  As a result a lot of data ends up in shadow databases  Some ends up locked away in private spreadsheets  Transactional systems don’t record all previous data states  Lots of data gets thrown away or archived, e.g.: • Admissions data • Enrollment data • Asset tracking data (“How many computers did we support each year, from 1996 to 2006, and where do we expect to be in 2010?”)
  7. 7. Simple Transactional Database  Map of Microsoft Windows Update Service (WUS) back-end database • Diagrammed using Sybase PowerDesigner  Each green box is a database “table”  Arrows are “joins” or foreign keys  This is simple for an OLTP back end
  8. 8. More Complex Example  Recruitment Plus back-end database  Used by many admissions offices  Note again: • Green boxes are tables • Lines are foreign key relationships • Purple boxes are views  Considerable expertise is required to report off this database!  Imagine what it’s like for even more complex systems • Colleague • SCT Banner (over 4,000 tables)
  9. 9. The “Reporting Problem”  Often we require OLTP data as a snapshot, in a spreadsheet or report  Reports require querying back-end OLTP support databases  But OLTP databases are often very complex, and typically • Contain many, often obscure, tables • Utilize cryptic, unintuitive field/column names • Don’t store all necessary historical data  As a result, reporting becomes a problem – • Requires special expertise • May require modifications to production OLTP systems • Becomes harder and harder for staff to keep up!
  10. 10. Workarounds  Ways of working around the reporting problem include: 1. Have OLTP system vendors do the work • Provide canned reports • Write reporting GUIs for their products 2. Hire more specialists • To create simplified views of OLTP data • To write reports, create snapshots 3. Periodically copy data from OLTP systems to a place where • The data is easier to understand • The data is optimized for reporting • Easily pluggable into reporting tools
  11. 11. ODS  ODS = operational data store  ODSs were an early workaround to the “reporting problem”  To create an ODS you • Build a separate/simplified version of an OLTP system • Periodically copy data into it from the live OLTP system • Hook it to operational reporting tools  An ODS can be an integration point or real-time “reporting database” for an operational system  It’s not enough for full enterprise-level, cross- database analytical processing
  12. 12. OLAP  OLAP = online analytical processing  OLAP is the process of creating and summarizing historical, multidimensional data • To help users understand the data better • Provide a basis for informed decisions • Allow users to manipulate and explore data themselves, easily and intuitively  More than just “reporting”  Reporting is just one (static) product of OLAP
  13. 13. OLAP Support Databases  OLAP systems require support databases  These databases typically • Support fewer simultaneous users than OLTP back ends • Are structured simply; i.e., denormalized • Can grow large  Hold snapshots of data in OLTP systems  Provide history/time depth to our analyses • Are optimized for read (not write) access • Updated via periodic batch (e.g., nightly) ETL processes
  14. 14. ETL Processes  ETL = extract, transform, load • Extract data from various sources • Transform and clean the data from those sources • Load the data into databases used for analysis and reporting  ETL processes are coded in various ways • By hand in SQL, UniBASIC, etc. • Using more general programming languages • In semi-automated fashion using specialized ETL tools like Cognos Decision Stream  Most institutions do hand ETL; but note well: • Hand ETL is slow • Requires specialized knowledge • Becomes extremely difficult to maintain as code accumulates and databases/personnel change!
  15. 15. Where Does the Data Go?  What sort of a database do the ETL processes dump data into?  Typically, into very simple table structures  These table structures are: • Denormalized • Minimally branched/hierarchized • Structured into star schemas
  16. 16. So What Are Star Schemas?  Star schemas are collections of data arranged into star-like patterns • They have fact tables in the middle, which contain amounts, measures (like counts, dollar amounts, GPAs) • Dimension tables around the outside, which contain labels and classifications (like names, geocodes, majors) • For faster processing, aggregate fact tables are sometimes also used (e.g., counts pre-averaged for an entire term)  Star schemas should • Have descriptive column/field labels • Be easy for users to understand • Perform well on queries
  17. 17. A Very Simple Star Schema Data Center UPS Power Output Dimensions: Phase Time Date Facts: Volts Amps Etc.
  18. 18. A More Complex Star Schema  Freshman survey data (HERI/CIRP)  Dimensions: • Questions • Survey years • Data about test takers  Facts: • Answer (text) • Answer (raw) • Count (1)  Oops • Not a star Oops, answers should have been placed in their • Snowflaked! own dimension (creating a “factless fact table”). I’ll demo a better version of this star later!
  19. 19. Data Marts  One definition: • One or more star schemas that present data on a single or related set of business processes  Data marts should not be built in isolation  They need to be connected via dimensional tables that are • The same or subsets of each other • Hierarchized the same way internally  So, e.g., if I construct data marts for… • GPA trends, student major trends, enrollments • Freshman survey data, senior survey data, etc.  …I connect these marts via a conformed student dimension • Makes correlation of data across star schemas intuitive • Makes it easier for OLAP tools to use the data • Allows nonspecialists to do much of the work
  20. 20. Simple Data Mart Example UPS Battery star By battery Run-time % charged Current Input star By phase Voltage Current Output star By phase Voltage Current Sensor star By sensor Temp Humidity Note conformed date, time dimensions!
  21. 21. CIRP Star/Data Mart  CIRP Freshman survey data  Corrected from a previous slide  Note the CirpAnswer dimension  Note student dimension (ties in with other marts)
  22. 22. CIRP Mart in Cognos BI 8
  23. 23. ROLAP, MOLAP  ROLAP = OLAP via direct relational query • E.g., against a (materialized) view • Against star schemas in a warehouse  MOLAP = OLAP via multidimensional database (MDB) • MDB is a special kind of database • Treats data kind of like a big, fast spreadsheet • MDBs typically draw data in from a data warehouse  Built to work best with star schemas
  24. 24. Data Cubes  The term data cube means different things to different people  Various definitions: • A star schema • Any DB view used for reporting • A three-dimensional array in a MDB • Any multidimensional MDB array (really a hypercube)  Which definition do you suppose is technically correct?
  25. 25. Metadata  Metadata = data about data  In a data warehousing context it can mean many things • Information on data in source OLTP systems • Information on ETL jobs and what they do to the data • Information on data in marts/star schemas • Documentation in OLAP tools on the data they manipulate  Many institutions make metadata available via data malls or warehouse portals, e.g.: • University of New Mexico • UC Davis • Rensselear Polytechnic Institute • University of Illinois  Good ETL tools automate the setup of malls/portals!
  26. 26. The Data Warehouse  OK now we’re experts in terms like OLTP, OLAP, star schema, metadata, etc.  Let’s use some of these terms to describe how a DW works: • Provides ample metadata – data about the data • Utilizes easy-to-understand column/field names • Feeds multidimensional databases (MDBs) • Is updated via periodic (mainly nightly) ETL jobs • Presents data in a simplified, denormalized form • Utilizes star-like fact/dimension table schemas • Encompasses multiple, smaller data “marts” • Supports OLAP tools (Access/Excel, Safari, Cognos BI) • Derives data from (multiple) back-end OLTP systems • Houses historical data, and can grow very big
  27. 27. A Data Warehouse is Not…  Vendor and consultant proclamations aside, a data warehouse is not: • A project  With a specific end date • A product you buy from a vendor  Like an ODS (such as SCT’s)  A canned “warehouse” supplied by iStrategy  Cognos ReportNet • A database schema or instance  Like Oracle  SQL Server • A cut-down version of your live transactional database
  28. 28. Kimball & Caserta’s Definition  According to Ralph Kimball and Joe Caserta, a data warehouse is: A system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.  Another def.: The union of all the enterprise’s data marts  Aside: The Kimball model is not without some critics: • E.g., Bill Inmon
  29. 29. Example Data Warehouse (1)  This one is RPI’s  5 parts: • Sources • ETL stuff • DW proper • Cubes etc. • OLAP apps
  30. 30. Example Data Warehouse (2)  Caltech’s DW  Five Parts: • Source systems • ETL processes • Data marts • FM/metadata • Reporting and analysis tools • Note: They’re also customers of Cognos!
  31. 31. So Where is Colorado College?  Phil Goldstein (Educause Center for Applied Research fellow) identifies the major deployment levels: • Level 1: Transactional systems only • Level 2a: ODS or single data mart; no ETL • Level 2: ODS or single data mart with ETL tools • Level 3a: Warehouse or multiple marts; no ETL; OLAP • Level 3b: Warehouse or multiple marts; ETL; OLAP • Level 3: Enterprise-wide warehouse or multiple marts; ETL tools; OLAP tools  Goldstein’s study was just released in late 2005  It’s very good; based on real survey data  Which level is Colorado College at?
  32. 32. Implementing a Data Warehouse  In many organizations IT people want to huddle and work out a warehousing plan, but in fact • The purpose of a DW is decision support • The primary audience of a DW is therefore College decision makers • It is College decision makers therefore who must determine  Scope  Priority  Resources  Decision makers can’t make these determinations without an understanding of data warehouses  It is therefore imperative that key decision makers first be educated about data warehouses • Once this occurs, it is possible to  Elicit requirements (a critical step that’s often skipped)  Determine priorities/scope  Formulate a budget  Create a plan and timeline, with real milestones and deliverables!
  33. 33. Is This Really a Good Plan?  Sure, according to Phil Goldstein (Educause Center for Applied Research)  He’s conducted extensive surveys on “academic analytics” (= business intelligence for higher ed)  His four recommendations for improving analytics: • Key decisionmakers must lead the way • Technologists must collaborate • Must collect requirements • Must form strong partnerships with functional sponsors • IT must build the needed infrastructure • Carleton violated this rule with Cognos BI • As we discovered, without an ETL/warehouse infrastructure, success with OLAP is elusive • Staff must train and develop deep analysis skills  Goldstein’s findings mirror closely the advice of industry heavyweights – Ralph Kimball, Laura Reeves, Margie Ross, Warren Thornthwaite, etc.
  34. 34. Isn’t a DW a Huge Undertaking?  Sure, it can be huge  Don’t hold on too tightly to the big- sounding word, “warehouse”  Luminaries like Ralph Kimball have shown that a data warehouse can be built incrementally • Can start with just a few data marts • Targeted consulting help will ensure proper, extensible architecture and tool selection
  35. 35. What Takes Up the Most Time?  You may be surprised 90 to learn what DW step 80 70 takes the most time 60 Hardware Try guessing which: 50 East Database  ETL 40 West Schemas • Hardware 30 North OLAP tools 20 • Physical database setup 10 • Database design 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr • ETL • OLAP setup Acc. to Kimball & Caserta, ETL will eat up 70% of the time. Other analysts give estimates ranging from 50% to 80%. The most often underestimated part of the warehouse project!
  36. 36. Eight Month Initial Deployment Step Duration Step Duration Begin educating decision makers 21 days Secure, configure network 1 day Collect requirements 14 days Deploy physical “target” DB 4 days Decide general DW design 7 days Learn/deploy ETL tool 28 days Determine budget 3 days Choose/set up modeling tool 21 days Identify project roles 1 day Design initial data mart 7 days Eval/choose ETL tool 21 days Design ETL processes 28 days Eval/choose physical DB 14 days Hook up OLAP tools 7 days Spec/order, configure server 20 days Publicize, train, train 21 days
  37. 37. Conclusion  Information is held in transactional systems • But transactional systems are complex • They don’t talk to each other well; each is a silo • They require specially trained people to report off of  For normal people to explore institutional data, data in transactional systems needs to be • Renormalized as star schemas • Moved to a system optimized for analysis • Merged into a unified whole in a data warehouse  Note: This process must be led by “customers” • Yes, IT people must build the infrastructure • But IT people aren’t the main customers  So who are the customers? • Admissions officers trying to make good admission decisions • Student counselors trying to find/help students at risk • Development offers raising funds that support the College • Alumni affairs people trying to manage volunteers • Faculty deans trying to right-size departments • IT people managing software/hardware assets, etc….

×