Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Business Intelligence Data Warehouse System

What is Business Intelligence Data Warehouse System?

  • Be the first to comment

Business Intelligence Data Warehouse System

  1. 1. Presentation Prepared by: Kiran Kumar Pentaho BI Consultant
  2. 2. Objective At the end of this module, you will be able to know Trainer Introduction What is Data Warehousing ? What is Data Warehouse Architecture ? What is Dimensional Modelling & Design ? What is Business Intelligence ?
  3. 3. Person, Academic & Professional Information Name Kiran Kumar Academic BE Companies Graymatter Software Service Pvt. Lmt. India BI/DWH Technologies Exposure Domain Knowledge
  4. 4. s Refers to a Database, Which is maintianed seperately from an organization’s operational database A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Loosely Speaking Officially Speaking What is Data Warehouse
  5. 5. Data Warehouse Properties DW Integrated Non-volatileTime-Variant Subject-Oriented
  6. 6. Subject Oriented: Retail Management Systmes
  7. 7. Integrated: Retail Management Systmes
  8. 8. Time Variant: Retail Management Systmes
  9. 9. Non - Volatile: Retail Management Systmes
  10. 10. Goals of Data Warehousing / Business Intelligence • DW/BI system must make information easily accessible. • DW/BI system must present information consistently. • DW/BI system must adapt to change. • DW/BI system must be a secure bastion that protects the information assets. • DW/BI system must serve as the authoritative and trustworthy foundation for improved decision making. • DW/BI system present informaion in a timely way. • Business community must accept the DW/BI system to deem it successful.
  11. 11. Strategic uses of Data Warehousing Industry Functional areas of use Strategic use Airline Operations; marketing Crew assignment, aircraft development, mix of fares, analysis of route profitability, frequent flyer program promotions Banking Product development; Operations; marketing Customer service, trend analysis, product and service promotions, reduction of IS expenses Credit card Product development; marketing Customer service, new information service, fraud detection Health care Operations Reduction of operational expenses Investment and Insurance Product development; Operations; marketing Risk management, market movements analysis, customer tendencies analysis, portfolio management Retail chain Distribution; marketing Trend analysis, buying pattern analysis, pricing policy, inventory control, sales promotions, optimal distribution channel Telecommunications Product development; Operations; marketing New product and service promotions, reduction of IS budget, profitability analysis Personal care Distribution; marketing Distribution decisions, product promotions, sales decisions, pricing policy Public sector Operations Intelligence gathering
  12. 12. Evolution in Organizational use of data warehouses • Off line Data Warehouse Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data is stored in a data structure designed to facilitate reporting. • Real Time Data Warehouse Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking.)
  13. 13. Data Marts • A data mart is a scaled down version of a data warehouse that focuses on a particular subject area. • A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs. • Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. • Usually designed to support the unique business requirements of a specified department or business process • Implemented as the first step in proving the usefulness of the technologies to solve business problems Reasons for creating a data mart • Easy access to frequently needed data • Creates collective view by a group of users • Improves end-user response time • Ease of creation in less time • Lower cost than implementing a full Data warehouse • Potential users are more clearly defined than in a full Data warehouse
  14. 14. From the Data Warehouse to Data Marts Departmentally Structured Individually Structured Data Warehouse Organizationally Structured Less More History Normalized Detailed Data Information
  15. 15. Characteristics of the Departmental Data Mart • Small • Flexible • Customized by Department • Source is departmentally structured data warehouse Data mart Data warehouse
  16. 16. Inmon Vs Ralph Kimball Characterictics
  17. 17. Data warehousing Integration DATA SOURCES (databases) End Users: Decision making and other tasks: CRM, DSS, EIS Information Data Warehouse (storage) Analytical processing, Data mining Data visualization Generate knowledge Direct use Direct use Use Use Use of knowledge Data organization ; storage use
  18. 18. Design the BI & DWH Architecture
  19. 19. DWH Architecture Cont.. • Data Source Layer • Data Extraction Layer • Staging Area • ETL Layer • Data Storage Layer • Data Logic Layer • Data Presentation Layer • Metadata Layer
  20. 20. Adv & DisAdv of Data Warehouse Advantage: Data warehouses tend to have a very high query success as they have complete control over the four main areas of data management systems. • Bottom Up Appoarch • Clean data • Indexes: multiple types • Query processing: multiple options • Security: data and access • Easy report creation • Enhanced access to data and information Disadvantages: • Preparation may be time consuming • Long initial implementation time and associated high cost • Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data.
  21. 21. OTLP VS OLAP System’s
  22. 22. To Summarize
  23. 23. Data, Data everywhere yet ... • I can’t find the data I need – data is scattered over the network – many versions, subtle differences • I can’t get the data I need – need an expert to get the data • I can’t understand the data I found – available data poorly documented • I can’t use the data I found – results are unexpected – data needs to be transformed from one form to other
  24. 24. Business Intelligence • One ultimate use of the data gathered and processed in the data life cycle is for business intelligence. • Business intelligence generally involves the creation or use of a data warehouse and/or data mart for storage of data, and the use of front-end analytical tools such as Pentaho BI Suite, SAP BO, MSBI, Oracle’s Sales Analyzer and Financial Analyzer or Micro Strategy’s Web. • Such tools can be employed by end users to access data, ask queries, request ad hoc (special) reports, examine scenarios, create CRM activities, devise pricing strategies, and much more.
  25. 25. A producer wants to know…. Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom- -otions have the biggest impact on revenue? What impact will new products/services have on revenue and margins? Which customers are most likely to go to the competition ?
  26. 26. How Business Intelligence works? • The process starts with raw data which are usually kept in corporate data bases. For example, a national retail chain that sells everything from grills and patio furniture to plastic utensils had data about inventory, customer information, data about past promotions, and sales numbers in various databases. • Though all this information may be scattered across multiple systems and may seem unrelated-business intelligence software can being it together. This is done by using a data warehouse. • In the data warehouse (or mart) tables can be linked, and data cubes are formed. For instance, inventory information is linked to sales numbers and customer databases, allowing for deep analysis of information. • Using the business intelligence software the user can ask queries, request ad-hoc reports, or conduct any other analysis. • For example, deep analysis can be carried out by performing multilayer queries. Because all the databases are linked, one can search for what products a store has too much of, determine which of these products commonly sell with popular items, bases on previous sales. After planning a promotion to move the excess stock along with the popular products (by bundling them together, for example), one can dig deeper to see where this promotion would be most popular (and most profitable). • The results of the request can be reports, predictions, alerts, and/or graphical presentations. These can be disseminated to decision makers to help them in their decision-making tasks.
  27. 27. Dimension Tables • Dimension table is one that Contain text and descriptive information of the business entities of an enterprise, represent as hierarchical, categorical information such as Customer, Product, Date, Location, Department etc. • 1 in a 1-M relationship • Also called as lookup or reference tables • Typically contain the attributes for the SQL answer set.
  28. 28. Type of Dimension Tables • Standard / Common Dimension • Conformed Dimension • Junk Dimension • Degenerated Dimension • Role-Playing dimension • Denormalized Flattened Dimension • Snowflaked Dimension • Outrigger Dimension • Shrunken Dimension
  29. 29. Slowly Changing Dimensions • Dimensions attributes that change slowly over time, rather than changing on regular schedule, time-base. • In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. • Ex: Person chaging his/her city from Bangalore to Mumbai. Type of SCD: – Type 1: Store only the current value ( Overwrite) – Type 2: Maintain History changes ( Add New Row) – Type 3: Create an attribute in the dimension record for previous value ( Add New Attribute) – Type 4: Using historical table ( Add Mini – Dimension table) – Type 5: Add Mini-Dimensional & Type 1 Outrigger
  30. 30. SCD 1 – Overwrite the Old Value
  31. 31. SCD 2 – Add a New Row
  32. 32. SCD 2 – Add a New Row
  33. 33. SCD 3 – Add a New Column
  34. 34. SCD 4 • What is Mini Dimension ? – In case of a dimension, whre there are attributes which change rapidly or at a frequent interval of time, they are split off to form a dimension table named as mini-dimension Ex: Age of a Customer or Employee, Salary Band, Designation etc. • Design aspects of Mini Dimension – Should have its own surrogate key of mini dimension table. – There is no direct connection btw the base & mini dimension table. – Fact table contains Primary Key of both Base & Mini Dimension table. • What is SCD4 ? – Involves usage of 2 or more dimension table in which one would act as a base dimension and one or more mini dimension tables • When to use ? – Handling Rapidly changing attributes
  35. 35. SCD 5 • What is SCD 5 ? – Scd 5 involves usage of one or more mini dimension tables and a base dimension table with a reference to mini dimension key in the base dimension table. – This reference key in base dimension should be of Type 1 in nature. Therefore it would reflect the current version of mini dimension attributes in the dimension table • When to use ? – When there is a need to access the current values in the mini-dimension directly from the base dimension without joining a fact table • What is SCD 5 ? – Type 1 referential key should get updated in the base dimension in all the version of the dimension records whenever there is a change involved in corresponding mini dimension attributes values • Design aspects of Mini Dimension – Should have its own surrogate key of mini dimension table. – There is direct connection btw the base & mini dimension table. – Fact table contains Primary Key of both Base & Mini Dimension table.
  36. 36. Fact Tables • Stores the performance measurements resulting from an organization’s business process events • Store the low-level measurement data resulting from a business process in a single dimensional model • The term fact represents a business measure. • Each row in a fact table corresponds to a measurement event • Contains two or more foreign keys • Tend to have huge numbers of records • Useful facts tend to be numeric and additive Types of Fact Table: 1. Transactional Fact Table 2.Factless Fact Table 3. Snapshot Fact Table 4. Accumulating Fact Table 5. Aggregate Fact Table 6. Consolidated Fact Tables
  37. 37. Transactional Fact table • These fact tables represent an event that occurred at an instantaneous point in time. A row exists in the fact table for a given customer or product only if a transaction has occurred • Grain is the individual transaction • Mostly Additive Facts
  38. 38. Periodic Snapshot Fact table • Fact table summarizes many measuresment events occuring over a standard period such as a day, week, month or Quarter • Grain is the period not the individual transaction • If we have 1000 peopleliving in a region at the end of month 1 and 1500 people living in the same region at the end of month 2 then the total number of people will not be 2500 • Semi Additive & Non – Additive Facts
  39. 39. Aggregate Fact table • Fact table contains Aggregated Data • Mostly Additive Facts
  40. 40. Factless Fact table • Factless fact table contains no measures • Only Keys from Dimension tables
  41. 41. Accumulated Fact table
  42. 42. Consolidated Fact table It is often convenient to combine facts from multiple processes together into a single consolidated fact table if they can be expressed at the same grain. For example, sales actuals can be consolidated with sales forecasts in a single fact table to make the task of analyzing actuals versus forecasts simple and fast, as compared to assembling a drill-across application using separate fact tables. Consolidated fact tables add burden to the ETL processing, but ease the analytic burden on the BI applications. They should be considered for cross-process metrics that are frequently analyzed together.
  43. 43. Type of Fact / Measure • Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. • Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. • Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. • • The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.
  44. 44. Type of Fact / Measure Cont.. • Additive The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represents the total sales amount for that week. • Semi-Additive & Non-Additive: The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.
  45. 45. Dimensional Models • A denormalized relational model – Made up of tables with attributes – Relationships defined by keys and foreign keys • Organized for understandability and ease of reporting rather than update. • Queried and maintained by SQL or special purpose management tools. • Star Schemas Versus OLAP Cubes – Dimensional models implemented in relational database management systems are referred to as star schemas because of their resemblance to a star-like structure. – Dimensional models implemented in multidimensional database environments are referred to as online analytical processing (OLAP) cubes. – Both stars and cubes have a common logical design with recognizable dimensions; however, the physical implementation differs
  46. 46. OLAP • OLAP stands for On-Line Analytical Processing • For people on the business side, the key feature out of the above list is "Multidimensional." In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For example, sales for the company are up. - What region is most responsible for this increase? - Which store in this region is most responsible for the increase? - What particular product category contributed the most to the increase? Answering these types of questions in order means that you are performing an OLAP analysis. • In the OLAP world, there are mainly two different types: 1. Multidimensional OLAP (MOLAP) 2. Relational OLAP (ROLAP) 3. Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
  47. 47. MOLAP • This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: • Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations. • Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages: • Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. • Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
  48. 48. MOLAP Operation • Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data. • Here is the list of OLAP operations: 1. Roll-up 2. Drill-down 3.Slice and dice 4. Pivot (rotate)
  49. 49. MOLAP Operation – Roll Up • Roll-up performs aggregation on a data cube in any of the following ways: – By climbing up a concept hierarchy for a dimension – By dimension reduction • The following diagram illustrates how roll-up works – Roll-up is performed by climbing up a concept hierarchy for the dimension location. – Initially the concept hierarchy was "street < city < province < country". – On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country. – The data is grouped into cities rather than countries. – When roll-up is performed, one or more dimensions from the data cube are removed.
  50. 50. MOLAP Operation – Drill Down • Drill-down is the reverse operation of roll-up. It is performed by either of the following ways: – By stepping down a concept hierarchy for a dimension – By introducing a new dimension. • The following diagram illustrates how drill-down works: – Drill-down is performed by stepping down a concept hierarchy for the dimension time. – Initially the concept hierarchy was "day < month < quarter < year." – On drilling down, the time dimension is descended from the level of quarter to the level of month. – When drill-down is performed, one or more dimensions from the data cube are added. – It navigates the data from less detailed data to highly detailed data.
  51. 51. MOLAP Operation – Slice • The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works. – Here Slice is performed for the dimension "time" using the criterion time = "Q1". – It will form a new sub-cube by selecting one or more dimensions.
  52. 52. MOLAP Operation – Dice • Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation. • The dice operation on the cube based on the following selection criteria involves three dimensions. – (location = "Toronto" or "Vancouver") – (time = "Q1" or "Q2") – (item =" Mobile" or "Modem")
  53. 53. MOLAP Operation – Pivot • The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation
  54. 54. ROLAP • This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: • Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. • Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities. Disadvantages: • Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. • Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.
  55. 55. Type’s of Relational Dimensional Models • Star Schema • Snow Flake’s Schema • Fact Centipede schema • Fact Constellation Schema
  56. 56. Star Schema
  57. 57. Snow Flake Schema Same as Star Schema, but Dimension tables are normalized (Spilt)
  58. 58. Fact Centipede Schema Every Dimension tables are connected to Fact Table
  59. 59. Fact Constellation Schema • For each star schema it is possible to construct fact constellation schema (for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables. • The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.
  60. 60. HOLAP • HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.
  61. 61. Difference btw ERD & Dimensional Model • One table per entity • Minimize data redundancy • Optimize update / insert • The Transaction Processing Model • One fact table for data organization • Maximize understandability • Optimized for retrieval • The data warehousing model
  62. 62. Choosing the Data Mart / Dimensional Design Process 1. Select the business process 2. Declare the grain 3. Identify the dimensions 4. Identify the facts
  63. 63. Business Process • Businnes process are the operational activities performed by your organization, such taking an order, registring students etc. • It is important to determine the identity of the transaction table and specify exactly what it represents. • Represent a process or reporting environment that is of value to the organization
  64. 64. Grain (unit of analysis) • Atomic graing refers to the lowest level at which data is captured by a given business process • The grain determines what each fact record represents: the level of detail • For example – Individual transactions – Snapshots (points in time) – Line items • Generally better to focus on the smallest grain
  65. 65. Dimensions • A table (or hierarchy of tables) connected with the fact table with keys and foreign keys • Preferably single valued for each fact record (1:m) • Connected with surrogate (generated) keys, not operational keys • Dimension tables contain text or numeric attributes
  66. 66. Facts • Normally numeric Keys and additive measures • Measurements associated with fact table records at fact table granularity • Non-key attributes in the fact table Attributes in dimension tables are constants. Facts vary with the granularity of the fact table