• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Whats A Data Warehouse
 

Whats A Data Warehouse

on

  • 2,244 views

 

Statistics

Views

Total Views
2,244
Views on SlideShare
2,240
Embed Views
4

Actions

Likes
5
Downloads
177
Comments
0

1 Embed 4

http://www.slideshare.net 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Whats A Data Warehouse Whats A Data Warehouse Presentation Transcript

    • What is a Data Warehouse? And Why Are So Many Schools Setting Them Up? Richard Goerwitz
    • What Is a Data Warehouse?
      • Nobody can agree
      • So I’m not actually going to define a DW
      • Don’t feel cheated, though
      • By the end of this talk, you’ll
        • Understand key concepts that underlie all warehouse implementations (“talk the talk”)
        • Understand the various components out of which DW architects construct real-world data warehouses
        • Understand what a data warehouse project looks like
    • Why Are Schools Setting Up Data Warehouses?
      • A data warehouse makes it easier to:
        • Optimize classroom, computer lab usage
        • Refine admissions ratings systems
        • Forecast future demand for courses, majors
        • Tie private spreadsheet data into central repositories
        • Correlate admissions and IR data with outcomes such as:
          • GPAs
          • Placement rates
          • Happiness, as measured by alumni surveys
        • Notify advisors when extra help may be needed based on
          • Admissions data (student vitals; SAT, etc.)
          • Special events: A-student suddenly gets a C in his/her major
          • Slower trends: Student’s GPA falls for > 2 semesters/terms
        • (Many other examples could be given!)
      • Better information = better decisions
        • Better admission decisions
        • Better retention rates
        • More effective fund raising, etc.
    • Talking The Talk
      • To think and communicate usefully about data warehouses you’ll need to understand a set of common terms and concepts:
        • OLTP
        • ODS
        • OLAP, ROLAP, MOLAP
        • ETL
        • Star schema
        • Conformed dimension
        • Data mart
        • Cube
        • Metadata
      • Even if you’re not an IT person, pay heed:
        • You’ll have to communicate with IT people
        • More importantly:
          • Evidence shows that IT will only build a successful warehouse if you are intimately involved!
    • OLTP
      • OLTP = online transaction processing
      • The process of moving data around to handle day-to-day affairs
        • Scheduling classes
        • Registering students
        • Tracking benefits
        • Recording payments, etc.
      • Systems supporting this kind of activity are called transactional systems
    • Transactional Systems
      • Transactional systems are optimized primarily for the here and now
        • Can support many simultaneous users
        • Can support heavy read/write access
        • Allow for constant change
        • Are big, ugly, and often don’t give people the data they want
          • As a result a lot of data ends up in shadow databases
          • Some ends up locked away in private spreadsheets
      • Transactional systems don’t record all previous data states
      • Lots of data gets thrown away or archived, e.g.:
        • Admissions data
        • Enrollment data
        • Asset tracking data (“How many computers did we support each year, from 1996 to 2006, and where do we expect to be in 2010?”)
    • Simple Transactional Database
      • Map of Microsoft Windows Update Service (WUS) back-end database
        • Diagrammed using Sybase PowerDesigner
          • Each green box is a database “table”
          • Arrows are “joins” or foreign keys
          • This is simple for an OLTP back end
    • More Complex Example
      • Recruitment Plus back-end database
      • Used by many admissions offices
      • Note again:
        • Green boxes are tables
        • Lines are foreign key relationships
        • Purple boxes are views
      • Considerable expertise is required to report off this database!
      • Imagine what it’s like for even more complex systems
        • Colleague
        • SCT Banner (over 4,000 tables)
    • The “Reporting Problem”
      • Often we require OLTP data as a snapshot, in a spreadsheet or report
      • Reports require querying back-end OLTP support databases
      • But OLTP databases are often very complex, and typically
        • Contain many, often obscure, tables
        • Utilize cryptic, unintuitive field/column names
        • Don’t store all necessary historical data
      • As a result, reporting becomes a problem –
        • Requires special expertise
        • May require modifications to production OLTP systems
        • Becomes harder and harder for staff to keep up!
    • Workarounds
      • Ways of working around the reporting problem include:
        • Have OLTP system vendors do the work
          • Provide canned reports
          • Write reporting GUIs for their products
        • Hire more specialists
          • To create simplified views of OLTP data
          • To write reports, create snapshots
        • Periodically copy data from OLTP systems to a place where
          • The data is easier to understand
          • The data is optimized for reporting
          • Easily pluggable into reporting tools
    • ODS
      • ODS = operational data store
      • ODSs were an early workaround to the “reporting problem”
      • To create an ODS you
        • Build a separate/simplified version of an OLTP system
        • Periodically copy data into it from the live OLTP system
        • Hook it to operational reporting tools
      • An ODS can be an integration point or real-time “reporting database” for an operational system
      • It’s not enough for full enterprise-level, cross-database analytical processing
    • OLAP
      • OLAP = online analytical processing
      • OLAP is the process of creating and summarizing historical, multidimensional data
        • To help users understand the data better
        • Provide a basis for informed decisions
        • Allow users to manipulate and explore data themselves, easily and intuitively
      • More than just “reporting”
      • Reporting is just one (static) product of OLAP
    • OLAP Support Databases
      • OLAP systems require support databases
      • These databases typically
        • Support fewer simultaneous users than OL T P back ends
        • Are structured simply; i.e., denormalized
        • Can grow large
          • Hold snapshots of data in OLTP systems
          • Provide history/time depth to our analyses
        • Are optimized for read (not write) access
        • Updated via periodic batch (e.g., nightly) ETL processes
    • ETL Processes
      • ETL = extract, transform, load
        • Extract data from various sources
        • Transform and clean the data from those sources
        • Load the data into databases used for analysis and reporting
      • ETL processes are coded in various ways
        • By hand in SQL, UniBASIC, etc.
        • Using more general programming languages
        • In semi-automated fashion using specialized ETL tools like Cognos Decision Stream
      • Most institutions do hand ETL; but note well:
        • Hand ETL is slow
        • Requires specialized knowledge
        • Becomes extremely difficult to maintain as code accumulates and databases/personnel change!
    • Where Does the Data Go?
      • What sort of a database do the ETL processes dump data into?
      • Typically, into very simple table structures
      • These table structures are:
        • Denormalized
        • Minimally branched/hierarchized
        • Structured into star schemas
    • So What Are Star Schemas?
      • Star schemas are collections of data arranged into star-like patterns
        • They have fact tables in the middle, which contain amounts, measures (like counts, dollar amounts, GPAs)
        • Dimension tables around the outside, which contain labels and classifications (like names, geocodes, majors)
        • For faster processing, aggregate fact tables are sometimes also used (e.g., counts pre-averaged for an entire term)
      • Star schemas should
        • Have descriptive column/field labels
        • Be easy for users to understand
        • Perform well on queries
    • A Very Simple Star Schema
      • Data Center UPS
      • Power Output
      • Dimensions:
        • Phase
        • Time
        • Date
      • Facts:
        • Volts
        • Amps
        • Etc.
    • A More Complex Star Schema
      • Freshman survey data (HERI/CIRP)
      • Dimensions:
        • Questions
        • Survey years
        • Data about test takers
      • Facts:
        • Answer (text)
        • Answer (raw)
        • Count (1)
      • Oops
        • Not a star
        • Snowflaked!
      Oops, answers should have been placed in their own dimension (creating a “factless fact table”). I’ll demo a better version of this star later!
    • Data Marts
      • One definition:
        • One or more star schemas that present data on a single or related set of business processes
      • Data marts should not be built in isolation
      • They need to be connected via dimensional tables that are
        • The same or subsets of each other
        • Hierarchized the same way internally
      • So, e.g., if I construct data marts for…
        • GPA trends, student major trends, enrollments
        • Freshman survey data, senior survey data, etc.
      • … I connect these marts via a conformed student dimension
        • Makes correlation of data across star schemas intuitive
        • Makes it easier for OLAP tools to use the data
        • Allows nonspecialists to do much of the work
    • Simple Data Mart Example
      • UPS
      • Battery star
        • By battery
          • Run-time
          • % charged
          • Current
      • Input star
        • By phase
          • Voltage
          • Current
      • Output star
        • By phase
          • Voltage
          • Current
      • Sensor star
        • By sensor
          • Temp
          • Humidity
      Note conformed date, time dimensions!
    • CIRP Star/Data Mart
      • CIRP Freshman survey data
      • Corrected from a previous slide
      • Note the CirpAnswer dimension
      • Note student dimension (ties in with other marts)
    • CIRP Mart in Cognos BI 8
    • ROLAP, MOLAP
      • ROLAP = OLAP via direct relational query
        • E.g., against a (materialized) view
        • Against star schemas in a warehouse
      • MOLAP = OLAP via multidimensional database (MDB)
        • MDB is a special kind of database
        • Treats data kind of like a big, fast spreadsheet
        • MDBs typically draw data in from a data warehouse
          • Built to work best with star schemas
      • The term data cube means different things to different people
      • Various definitions:
        • A star schema
        • Any DB view used for reporting
        • A three-dimensional array in a MDB
        • Any multidimensional MDB array (really a hyper cube)
      • Which definition do you suppose is technically correct?
      Data Cubes
    • Metadata
      • Metadata = data about data
      • In a data warehousing context it can mean many things
        • Information on data in source OLTP systems
        • Information on ETL jobs and what they do to the data
        • Information on data in marts/star schemas
        • Documentation in OLAP tools on the data they manipulate
      • Many institutions make metadata available via data malls or warehouse portals, e.g.:
        • University of New Mexico
        • UC Davis
        • Rensselear Polytechnic Institute
        • University of Illinois
      • Good ETL tools automate the setup of malls/portals!
    • The Data Warehouse
      • OK now we’re experts in terms like OLTP, OLAP, star schema, metadata, etc.
      • Let’s use some of these terms to describe how a DW works:
        • Provides ample metadata – data about the data
        • Utilizes easy-to-understand column/field names
        • Feeds multidimensional databases (MDBs)
        • Is updated via periodic (mainly nightly) ETL jobs
        • Presents data in a simplified, denormalized form
        • Utilizes star-like fact/dimension table schemas
        • Encompasses multiple, smaller data “marts”
        • Supports OLAP tools (Access/Excel, Safari, Cognos BI)
        • Derives data from (multiple) back-end OLTP systems
        • Houses historical data, and can grow very big
    • A Data Warehouse is Not…
      • Vendor and consultant proclamations aside, a data warehouse is not:
        • A project
          • With a specific end date
        • A product you buy from a vendor
          • Like an ODS (such as SCT’s)
          • A canned “warehouse” supplied by iStrategy
          • Cognos ReportNet
        • A database schema or instance
          • Like Oracle
          • SQL Server
        • A cut-down version of your live transactional database
    • Kimball & Caserta’s Definition
      • According to Ralph Kimball and Joe Caserta, a data warehouse is:
        • A system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.
      • Another def.: The union of all the enterprise’s data marts
      • Aside: The Kimball model is not without some critics:
        • E.g., Bill Inmon
    • Example Data Warehouse (1)
      • This one is RPI’s
      • 5 parts:
        • Sources
        • ETL stuff
        • DW proper
        • Cubes etc.
        • OLAP apps
    • Example Data Warehouse (2)
      • Caltech’s DW
      • Five Parts:
        • Source systems
        • ETL processes
        • Data marts
        • FM/metadata
        • Reporting and analysis tools
        • Note: They’re also customers of Cognos!
    • So Where is Colorado College?
      • Phil Goldstein (Educause Center for Applied Research fellow) identifies the major deployment levels:
        • Level 1: Transactional systems only
        • Level 2a: ODS or single data mart; no ETL
        • Level 2: ODS or single data mart with ETL tools
        • Level 3a: Warehouse or multiple marts; no ETL; OLAP
        • Level 3b: Warehouse or multiple marts; ETL; OLAP
        • Level 3: Enterprise-wide warehouse or multiple marts; ETL tools; OLAP tools
      • Goldstein’s study was just released in late 2005
      • It’s very good; based on real survey data
      • Which level is Colorado College at?
    • Implementing a Data Warehouse
      • In many organizations IT people want to huddle and work out a warehousing plan, but in fact
        • The purpose of a DW is decision support
        • The primary audience of a DW is therefore College decision makers
        • It is College decision makers therefore who must determine
          • Scope
          • Priority
          • Resources
      • Decision makers can’t make these determinations without an understanding of data warehouses
      • It is therefore imperative that key decision makers first be educated about data warehouses
        • Once this occurs, it is possible to
          • Elicit requirements (a critical step that’s often skipped)
          • Determine priorities/scope
          • Formulate a budget
          • Create a plan and timeline, with real milestones and deliverables!
    • Is This Really a Good Plan?
      • Sure, according to Phil Goldstein (Educause Center for Applied Research)
      • He’s conducted extensive surveys on “academic analytics” (= business intelligence for higher ed)
      • His four recommendations for improving analytics:
        • Key decisionmakers must lead the way
        • Technologists must collaborate
          • Must collect requirements
          • Must form strong partnerships with functional sponsors
        • IT must build the needed infrastructure
          • Carleton violated this rule with Cognos BI
          • As we discovered, without an ETL/warehouse infrastructure, success with OLAP is elusive
        • Staff must train and develop deep analysis skills
      • Goldstein’s findings mirror closely the advice of industry heavyweights – Ralph Kimball, Laura Reeves, Margie Ross, Warren Thornthwaite, etc.
    • Isn’t a DW a Huge Undertaking?
      • Sure, it can be huge
      • Don’t hold on too tightly to the big-sounding word, “warehouse”
      • Luminaries like Ralph Kimball have shown that a data warehouse can be built incrementally
        • Can start with just a few data marts
        • Targeted consulting help will ensure proper, extensible architecture and tool selection
    • What Takes Up the Most Time?
      • You may be surprised to learn what DW step takes the most time
      • Try guessing which:
        • Hardware
        • Physical database setup
        • Database design
        • ETL
        • OLAP setup
      Acc. to Kimball & Caserta, ETL will eat up 70% of the time. Other analysts give estimates ranging from 50% to 80%. The most often underestimated part of the warehouse project!
    • Eight Month Initial Deployment 21 days Publicize, train, train 7 days Hook up OLAP tools 28 days Design ETL processes 7 days Design initial data mart 21 days Choose/set up modeling tool 28 days Learn/deploy ETL tool 4 days Deploy physical “target” DB 1 day Secure, configure network Duration Step 20 days Spec/order, configure server 14 days Eval/choose physical DB 21 days Eval/choose ETL tool 1 day Identify project roles 3 days Determine budget 7 days Decide general DW design 14 days Collect requirements 21 days Begin educating decision makers Duration Step
    • Conclusion
      • Information is held in transactional systems
        • But transactional systems are complex
        • They don’t talk to each other well; each is a silo
        • They require specially trained people to report off of
      • For normal people to explore institutional data, data in transactional systems needs to be
        • Renormalized as star schemas
        • Moved to a system optimized for analysis
        • Merged into a unified whole in a data warehouse
      • Note: This process must be led by “customers”
        • Yes, IT people must build the infrastructure
        • But IT people aren’t the main customers
      • So who are the customers?
        • Admissions officers trying to make good admission decisions
        • Student counselors trying to find/help students at risk
        • Development offers raising funds that support the College
        • Alumni affairs people trying to manage volunteers
        • Faculty deans trying to right-size departments
        • IT people managing software/hardware assets, etc….