Your SlideShare is downloading. ×
0
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Data Warehouses: A Whistle-Stop Tour
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Warehouses: A Whistle-Stop Tour

405

Published on

Presentation first presented to the IEEE Computer Society Chapter, New Orleans Section

Presentation first presented to the IEEE Computer Society Chapter, New Orleans Section

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
405
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Warehouses: A Whistle-Stop Tour Cade Roux [email_address]
  • 2. Typical Business "Design"
  • 3. Typical Goal Scenario
  • 4. What success still looks like - version 1
  • 5. What success still looks like - version 2
  • 6. What success should look like
  • 7. Glossary
    • Data warehouse
    • Bill Inmon - "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process" - typically associated with top-down design
    • Ralph Kimball - "A copy of transaction data specifically structured for query and analysis." - Typically associated with bottom-up design
  • 8. Glossary (2)
    • ETL
    • Extract...Transform...Load
    • Shorthand for any number of ways of getting the data into the warehouse.
    • Sometimes it's really transform...extract...load, sometimes it's extract...load...transform...load.
    • Key things are to have a strategy and principles for when data is changed/cleaned/conformed/exceptions reported.
  • 9. Glossary (3)
    • Facts
    • Dimensions
    • Conformed Dimensions
    • Slowly Changing Dimensions
    • Granularity
    • Dimensionally modelled data is mostly associated with Kimball.
    • Huge advantages in analyzing large amounts of data.
    • Modelling is problematic, but not nearly as hard as normalizing a non-normalized database.
  • 10. Glossary (4)
    • Business Intelligence
    • Single version of the truth
    • These are relatively meaningless, but they point to the problem trying to be solved:
    • Get good decision support information to the business - every business is different, and there isn't a silver bullet
    • Eliminate, as much as possible, the ability for users to generate inconsistent information from the same data
  • 11. Glossary (5)
    • Data Mart
    • Silos
    • Silos are mini-data warehouses that are specialized to a subject area - typically from a bottom-up approach.
    • Data Marts are the components of a data warehouse in the top-down design, the building blocks of a data warehouse in a bottom-up design.
    • Typically, you cannot really do JUST top-down or JUST bottom-up.  The reality is always hybrid, because you have to look forward to enterprise-level integration.
  • 12. Glossary (6)
    • Operational Data Store
    • Enterprise data warehouse
    • ODS is a place where data is combined before load.  Sometimes there are services performed off this.  Typically, the data model has not changed dramatically from the original operational source systems, but it is (another) copy of the data.
    • EDW is an Inmon term which means that the data warehouse covers the enterprise in an integrated fashion.  It is mainly used to distinguish from a data warehouse which does not cover the entire enterprise.
  • 13. Glossary (7)
    • OLTP
    • OLAP
    • OnLine Transaction Processing: Typical online systems, may maintain coherent temporal history, may overwrite themselves when data is changed, usually modelled in third normal form or better, Entity-Relationship modeling.
    • OnLine Analytical Processing: Fast analysis of multi-dimensional data - generally refers to tools running against dimensional data warehouses because the dimensions are explicit - often precalculated "cubes" are created
  • 14. Normal Form
  • 15.  
  • 16. Facts and Dimensions
  • 17.  
  • 18. Dimensional Modelling
    • Facts:
    • Usually scalar quantities
    • Typically can be:
    • SUM, AVG, etc.
    • Modelling:
    • View all data as either facts or dimensions
    • Determine the nature of the changes in the dimensions
    • Then divide up dimensions for convenience - based on usage/data patterns
    • Combination of art and science
  • 19. Topics
    • Conformed dimensions
    • NULLs
    • Junk Dimensions
    • Too Few Dimensions
    • Too Many Dimensions
    • Parallel ETL
  • 20. Conformed Dimensions
    • Reduces the learning curve
    • Allows models to be combined
    • Account number padding, e.g.
  • 21. NULLs
    • Usually represent unknowns
    • Big problem for users in face of model evolution
    • If you have a derived stat/measure like customer.allfees = customer.latefees + customer.nsffees
    • Model starts out like
    • -latefees money NOT NULL
    • -nsffees money NOT NULL
    • Now we branch out into mailbox rental:
    • -customer.rentalfees NULL (or NOT NULL?)
    • customer.allfees = customer.latefees + customer.nsffees + customer.rentalfees
    • Handle with a view, or populate old data with 0
  • 22. Best Practices
    • There are tons of top ten lists of tips and keys to success in articles and books. I will give you my top two.
    • Incremental Delivery – Show successes early, win people over, prove concepts and approach
    • Proactively Manage Quality - Test thoroughly and automate – Testing is usually considered important, but people don’t approach it systematically. Round-trip the data, know the dimensional behavior with benchmarking, automate exception reporting and make sure false positives don’t make the warning system ignored. Get confidence by showing the tests are working. Add tests as defects are found, documenting expectations.
  • 23. Worst Practices
    • Again, there are plenty of online tips – every one of the best practices has a corresponding anti-practice, but these are my top two.
    • Avoid understanding the data, the business motivations, or the details because there are far too many feeds of data coming into the warehouse. Avoid looking ahead to how the data will be used because you shouldn’t change the ETL process to accommodate expectations or provide services.
    • Handle every model the same way, so the data warehouse is consistent, even if some models are awkward and difficult for users to use and difficult to change over time as the business evolves.
  • 24. Some things to keep in mind
    • Terminology is confusing and inconsistent – only your architecture matters – keep eyes open to approaches, but terminology is not as important as conventions chosen matching environment desired.
    • Overriding concern is practicality – get the information into users hands, this will drive the need for more information and guide you into managing the data.
    • Decoupling produces a lot of redundancy: Source->Flat File->EDI gateway->Stage->DW – understand where the redundancy can be removed, and where decoupling is the goal.
  • 25. Performance Issues
    • Cleansing/Manipulation:
    • Y/N, M/F, Codes – standardize in the ETL, use data types efficiently
    • Indexes – keep end goals in mind, index according to expected usage
    • SARG'able, leading zeros/spaces – look for consistency where posible to avoid data manipulation prior to joins, this helps index usage
    • Partitioning tables – for sliding windows of data retention, partitioning the tables allows old data to be dropped off the end of the fact tables fairly easily
  • 26. Application Logic
    • Shared work should be pushed into ETL when:
    • Not likely to change
    • Expensive
    • Everybody needs it
    • Examples:
    • Trivial - Scaling to convention (rates)
    • Intermediate - Simple calculations (Patient Age)
    • Marginal - Interest rates, risk ratings
    • Alternative to marginal cases - generate additional facts, either in their own fact tables or as late arriving facts
    • When a DW is shared, lots more applications to worry about
  • 27. Links
    • http://delicious.com/caderoux1/ieee-nola-dw
    • http://www.speakerrate.com/caderoux

×