Star schema

4,274
-1

Published on

Very basic points of datawarehousing, OLTP, OLAP and star schema.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,274
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
203
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Star schema

  1. 1. What is a Data Warehouse? :<br /><ul><li>Data Warehouses and Data Marts are conceptually different – in scope- they are built using the exact same methods and procedures,
  2. 2. A data warehouse (or mart) is way of storing data for later retrieval. This retrieval isalmost always used to support decision-making in the organization. That is why manydata warehouses are considered to be DSS (Decision-Support Systems).
  3. 3. Both a data warehouse and a data mart are storage mechanismsfor read-only, historical, aggregated data
  4. 4. Both a data warehouse and a data mart are storage mechanismsfor read-only, historical, aggregated data.
  5. 5. A data warehouse stores current and historical data</li></ul>OLTP:<br /><ul><li>OLTP stand for Online Transaction Processing.
  6. 6. This is a standard, normalized database structure.
  7. 7. OLTP is designed for transactions, which means that inserts, updates, and deletes must be fast.</li></ul>OLAP:<br /><ul><li>OLAP stands for Online Analytical Processing.
  8. 8. OLAP is a term that means many things to many people.</li></ul>Difference between OLTP and OLAP:<br /> OLTP OLAPCurrent data onlyCurrent+historical dataShort database transactions Long database transactions Online update/insert/delete Batch update/insert/delete Normalization is promoted Denormalization is promoted High volume transactions Low volume transactions Transaction recovery is necessary Transaction recovery is not necessary Few indexesMany indexesMany joinsFew joins<br /><ul><li>With normalization, we may also have fewer indexes per table. This means that inserts,updates, and deletes run faster, because each insert, update, and delete may affect one ormore indexes.
  9. 9. Therefore, with each transaction, these indexes must be updated along withthe table. This overhead can significantly decrease our performance.
  10. 10. There are some disadvantages to an OLTP structure, especially when we go to retrieve thedata for analysis.
  11. 11. For one, we now must utilize joins and query multiple tables to get allthe data we want. Joins tend to be slower than reading from a single table, so we want tominimize the number of tables in any single query.
  12. 12. One of the advantages of OLTP is also a disadvantage: fewer indexes per table.
  13. 13. In general terms,the fewer indexes we have, the faster inserts, updates, and deletes will be.
  14. 14. However, againin general terms, the fewer indexes we have, the slower select queries will run.
  15. 15. Since one of our design goals to speed transactions is to minimize the numberof indexes, we are limiting ourselves when it comes to doing data retrieval.
  16. 16. Creating two separate database structures: an OLTP system for transactions,and an OLAP system for data retrieval.</li></ul>Star Schema:<br /><ul><li>Star Schema is a relational database schema for representing multimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables.
  17. 17. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multipledimensions.
  18. 18. The center of the star schema consists of a large fact table and it points towards the dimension tables.</li></ul>Steps in designing Star Schema:<br /><ul><li>Identify a business process for analysis(like sales).
  19. 19. Identify measures or facts (sales dollar).
  20. 20. Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
  21. 21. List the columns that describe each dimension.(region name, branch name, region name).
  22. 22. Determine the lowest level of summary in a fact table(sales dollar).</li></ul>Important aspects of Star Schema & Snow Flake Schema:<br /><ul><li>In a star schema every dimension will have a primary key.
  23. 23. In a star schema, a dimension table will not have any parent table.
  24. 24. Whereas in a snow flake schema, a dimension table will have one or more parent tables.
  25. 25. Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
  26. 26. Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill down the data from topmost hierachies to the lowermost hierarchies.</li></ul>Designing a Star Schema:<br /><ul><li>First, there is a time element to each one. Second, they all are looking for aggregated data; they are asking for sums or counts, not individual transactions. Finally, they are looking at data in terms of “by” conditions.
  27. 27. When I talk about “by” conditions, I am referring to looking at data by certain conditions
  28. 28. For example, if we take the question “On a quarterly and then monthly basis, are DairyProduct sales cyclical” we can break this down into this: “We want to see total sales bycategory (just Dairy Products in this case),by quarter or by month.”
  29. 29. Here we are looking at an aggregated value, the sum of sales, by specific criteria.
  30. 30. When we talk about the way we want to look at data, we usually want to see some sort ofaggregated data. These data are called measures.
  31. 31. These measures are numeric values that are measurable and additive.
  32. 32. We need to look at our measures using those “by” conditions. These “by” conditions are called dimensions.
  33. 33. When we say we want to know our sales dollars, we almost always mean by day, or by quarter, or by year.
  34. 34. These by conditions will map into dimensions:there is almost always a time dimension, and product and geographic dimensions are verycommon as well.
  35. 35. Therefore, in designing a star schema, our first order of business is usually to determine
  36. 36. what we want to see (our measures) and how we want to see it (our dimensions).</li></ul>Mapping Dimensions into Tables<br /><ul><li>When we start building dimension tables, there are a few rules to keep in mind. First, all dimension tables should have a single-field primary key.
  37. 37. This key is often just an identity column, consisting of an automatically incrementing number.
  38. 38. (The value of the primary key is meaningless; our information is stored in the other fields.)
  39. 39. These other fields contain the full descriptions of what we are after.
  40. 40. For example, if we have a Product dimension (which is common) we have fields in it that contain the description, the category name, the sub-category name, etc.
  41. 41. These fields do not contain codes that link us to other tables. Because the fields are the full descriptions, the dimension tables are often fat; they contain many large fields.
  42. 42. Dimension tables are often short, however. We may have many products, but even so, the dimension table cannot compare in size to a normal fact table.
  43. 43. Dimension tables are often short, however. We may have many products, but even so, the dimension table cannot compare in size to a normal fact table.
  44. 44. Our dimension table might look something like this:
  45. 45. Notice that both Category and Subcategory are stored in the table and not linked in through joined tables that store the hierarchy information.
  46. 46. The hierarchies are contained in the individual dimension tables. No additional tables are needed to hold hierarchical information.</li></ul>Fact Table:<br /><ul><li>A table in a star schema that contains facts and connected to dimensions.
  47. 47. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables.
  48. 48. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
  49. 49. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).
  50. 50. A fact table usually contains facts with the same level of aggregation.</li></ul>Steps in designing Fact Table:<br /><ul><li>Identify a business process for analysis(like sales).
  51. 51. Identify measures or facts (sales dollar).
  52. 52. Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
  53. 53. List the columns that describe each dimension.(region name, branch name, region name).
  54. 54. Determine the lowest level of summary in a fact table(sales dollar).</li></ul>Building the Fact Table:<br /><ul><li>The Fact Table holds our measures, or facts.
  55. 55. The measures are numeric and additive across some or all of the dimensions.
  56. 56. For example, sales are numeric and we can look at total sales for a product, or category, and we can look at total sales by any time period.
  57. 57. While the dimension tables are short and fat, the fact tables are generally long and skinny.
  58. 58. They are long because they can hold the number of records represented by the product of the counts in all the dimension tables.
  59. 59. In this schema, we have product, time and store dimensions. If we assume we have ten years of daily data, 200 stores, and we sell 500 products, we have a potential of 365,000,000 records (3650 days * 200 stores * 500 products). As you can see, this makes the fact table long.
  60. 60. The fact table is skinny because of the fields it holds. The primary key is made up of foreign keys that have migrated from the dimension tables.
  61. 61. These fields are just some sort of numeric value. In addition, our measures are also numeric. Therefore, the size of each record is generally much smaller than those in our dimension tables.
  62. 62. we have many, many more records in our fact table.</li></ul>Measure Types:<br /><ul><li>Additive - Measures that can be added across all dimensions.
  63. 63. Non Additive - Measures that cannot be added across all dimensions.
  64. 64. Semi Additive - Measures that can be added across few dimensions and not with others.

×