An introduction to data warehousing

5,302 views
5,163 views

Published on

An old presentation from 2007. Knowledge sharing with coworkers at Central 1 Credit Union

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,302
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
219
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

An introduction to data warehousing

  1. 1. Data Warehousing An Introduction 1Presented by Shahed Khalili on June 18th , 2007
  2. 2. Knowledge Management & Business Intelligence • Transform data to usable Information that makes sense • Support Business decisions • Gain competitive advantage • Identify and analyze market and user trends • Identify popular and profitable services and products • Increase effectiveness of marketing • Have the ability and flexibility to create various reports for management and Clients quickly • Have the ability to priorities product enhancements based on studying customer’s behavioural trends • Detecting anomalies (e.g. Fraud detection) • and more… 2
  3. 3. Value Business Intelligence DataData InformationInformation KnowledgeKnowledge DecisionDecision 3
  4. 4. What is a Data Warehouse • A way to store large amount of operational data to be able to analyze and create comprehensive and intuitive reports • A tool that gives management the ability to access and analyze information about its business 4
  5. 5. What is a Data Warehouse • A data warehouse is a copy of transaction data specifically structured for querying and reporting. • Large collection of integrated, non-volatile, time variant data from multiple sources, processed for storage in a multi-dimensional model Source: Ralph Kimball, Margy Ross, “Data Warehouse Toolkit”, 5
  6. 6. Characteristics of a DW • Subject-oriented – Data that gives information about a particular subject instead of about a company's on-going operations (e.g. CUSTOMER, FINANCIAL INSTITUTION, VENDOR). • Integrated – Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole, (standardize encoding, impose consistency in units of measure e.g. one standard way to record a customer’s transaction across all systems). • Non-volatile – Data once loaded into the data warehouse does not change. Each data record represents a distinct state (event). • Time-variant – Data is expected to store for long durations with time stamp to record its state (e.g.. sampling, summary or trend analysis). 6
  7. 7. Typical DW Architecture 7 Source: Connolly, Begg
  8. 8. OLTP vs. Data Warehouse • OLTP – Online transaction processing is used at the routine operation level and supported by transactional databases optimized for insertion, updates, deletions and some low level queries. • Data Warehousing – Optimized for data retrieval, not routine transaction processing and supports decision- support applications. 8
  9. 9. OLTP vs. Data Warehouse OLTP Data Warehouse Current Data Historic Data Detailed Lightly and highly Summarized data Dynamic Static High transaction throughput Low transaction throughput Transaction driven Analysis driven Serves large number users – low volume Low number of users – large volume 9
  10. 10. Designing a DW • Top down – Business Questions – Interview to see what the business needs to know • Bottom Up – What data sources are available and what data is stored 10
  11. 11. Reminder • What is a Data Warehouse – “Large collection of integrated, non-volatile, time variant data from multiple sources, processed for storage in a multi-dimensional model” 11
  12. 12. Dimensional Modeling • Every dimensional model (DM) is composed of one table with a composite primary key, called the FACT table, and a set of smaller tables called DIMENSION tables. 12
  13. 13. Dimensional Modeling Customer Vendor Tim e PaymentsPayments Notes: This is a simple 3 dimensional data model (Cube) that stores Payment Facts. The x, y, z axis are representing the dimension tables and what’s inside the cube is representing the FACT table. A real Dimensional Model is never this simple, this is only a simple visual representation of what it could look like. In real life it will require many more dimensions to describe a business process of a FACT. Notes: This is a simple 3 dimensional data model (Cube) that stores Payment Facts. The x, y, z axis are representing the dimension tables and what’s inside the cube is representing the FACT table. A real Dimensional Model is never this simple, this is only a simple visual representation of what it could look like. In real life it will require many more dimensions to describe a business process of a FACT. 13
  14. 14. Dimensional Modeling Rules • Each DIMENSION table has a simple (non-composite) primary key that corresponds exactly to one of the components of the composite key in the FACT table. • Forms ‘star-like’ structure, which is called a star schema or star join. • All natural keys are replaced with surrogate keys. Means that every join between FACT and DIMENSION tables is based on surrogate keys, not natural keys. • Surrogate keys allows data in the warehouse to have some independence from the data used and produced by the OLTP systems. (e.g. changing BINs) 14
  15. 15. Denormalizing • Denormalizing – DW data schema is denormalized or partially denormalized to speed data retrieval. • e.g. in a normalized DB we don’t store information that can be calculated from stored information. In DW design we do. 15
  16. 16. Star Schema Payment FACT Customer Vendor Time Amount Response Code Customer ID Customer Num Customer Name Customer Age Vendor ID Vendor Num Vendor Name Vendor Account Num Address Time ID Date Day of Week Quarter Month Year Day of Month Note how time is denormalized and stored as a dimension Note how time is denormalized and stored as a dimension 16
  17. 17. Why Denormalizing? • Looking at Time Dimension table: – We’re storing fields that can be calculated (such as day of week) • For example if you are Safeway you want to see what day of week you have the most customers to staff up. The question we ask the DW would be “show me the average number of transactions we process on different days of the week”) – If we weren’t storing the day of week our DW would have to go through millions of transactions, calculate the day of week based on datestamp to match and return the results. – This calculation is very time consuming and the response time would be unacceptable. – We denormalize to reduce the response time by storing more information than [one could argue is] needed. 17
  18. 18. Fact Table • Consists of measured or observed variables and identified via pointers pointing to the dimension tables. • Best to store facts that are numerical measurements, continuously valued and additive (egg. in a Payment Fact table: amount, CustomerVendorAcct, traceNo, returnCode, etc.). • Each measurement is taken at the intersection of all the dimensions. • Queries are made to the fact table which links to multiple records from the various dimension tables to form the result set that will form the report. • Fact table is sparse, if there is no value to be added, it is not filled. • Fact Fields in the FACT table must be as minimal as possible 18
  19. 19. Dimension Tables • Store descriptions of the dimensions of the business. • Each textual description (attribute) helps to describe a property of the respective dimension. • Best to store attributes that are textual, discrete and used as the source of constraints and row headers in the user’s answer set. – For attribute that is a numerical measurement, if it varies continuously every time it is sampled, store it as a fact, otherwise, store as a dimensional attribute (e.g. standard cost of a product, if it does not change often, store as a dimensional attribute). 19
  20. 20. FACTless Tables • Something that happens, but nothing happens – e.g. To track the Customers that are registered to use Mobile Banking • Answers Business questions like: “How many signed up for this service but never used it?” • A FACTless table contains only the keys linking the defined dimension tables 20
  21. 21. 9 Steps DW Design Methodology 1. Choosing the process 2. Choosing the grain 3. Identifying and confirming the dimensions 4. Choosing the facts 5. Storing pre-calculations in the fact table 6. Rounding out the dimension tables 7. Choosing the duration of the database 8. Tracking slowly changing dimensions 9. Deciding the query priorities and the query modes Source: Ralph Kimball, Margy Ross, “Data Warehouse Toolkit”, 21
  22. 22. 9 Steps DW Design Methodology • Step 1: Choosing Process – The chosen process (function) refers to the subject matter of a particular data mart, for example: a Bill Payment Process • Step 2: Choosing The Grain – Decide what a record of the fact table is to represent, i.e.. the grain. For example, the grain is a single Payment • Step 3: Identifying and conforming the dimensions – Dimensions set the context for asking questions about the facts in the fact table. e.g. Who made the Bill Payment • Step 4: Choosing the Facts – Facts should be numeric and additive. 22
  23. 23. 9 Steps DW Design Methodology • Step 5: Storing pre-calculations in the fact table – Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations. (denormalization) • Step 6: Rounding out the dimension tables – What properties to include in dimension table to best describe it. Should be intuitive and understandable • Step 7: Choosing the duration of the database – How long to keep the data for • Step 8: Tracking slowly changing dimensions – Type 1: where a changed dimension attribute is overwritten – Type 2: where a changed dimension attribute causes a new dimension record to be created – Type 3: where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension record 23
  24. 24. 9 Steps DW Design Methodology • Step 9: Deciding the query priorities and the query modes – Consider physical decision issues • Indexing for performance, Indexed Views, partitioning, physical sort order, etc. • Storage, backup, security 24
  25. 25. Data Warehouse Data Sources ETL Data Warehouse REPORTINGREPORTING 1 2 3 n 25
  26. 26. ETL • Extraction, Transformation, Loading • Tasks of capturing data by extracting from source systems, cleansing (Transforming) it, and finally loading results into target system. • Can be carried out either by separate products, or by a single integrated solution. 26
  27. 27. DW – Technology and DBMS • MySQL – Scale out not scale up. • MySQL supports Clustering, Replication, etc. You can distribute the DW across multiple Servers – Fast database engine, specially for bulk inserts and selects – Lots of Open Source tools available for ETL – MySQL is a cheaper solution which makes it more attractive to business to make the initial investment 27
  28. 28. Thank you Questions…. 28

×