What is Business Intelligence?
Why Business Intelligence?
Data analysis problems
Data Warehouse (DW) introduction
DW topics
Multidimensional modeling
ETL
Performance optimization
2. Overview
• What is Business Intelligence?
• Why Business Intelligence?
• Data analysis problems
• Data Warehouse (DW) introduction
• DW topics
• Multidimensional modeling
• ETL
• Performance optimization
3. What is Business Intelligence (BI)?
From Encyclopedia of Database Systems:
“[BI] refers to a set of tools and techniques that
enable a company to transform its business
data into timely and accurate information for the
decisional process, to be made available to the
right persons in the most suitable form.”
4. BI vs AI ?
BI is different from Artificial Intelligence (AI)
• AI systems make decisions for the users
• BI systems help the users make the right decisions, based
on available data
Combination of technologies
• Data Warehousing (DW)
• On-Line Analytical Processing (OLAP)
• Data Mining (DM)
• ……
Cette photo par Auteur inconnu est soumise à la licence CC BY-SA-NC
5. Why is BI Important?
• Worldwide BI revenue in 2005 = US$ 5.7 billion
• 10% growth each year
• A market where players like IBM, Microsoft, Oracle,
and SAP compete and invest
• BI is not only for large enterprises
• Small and medium-sized companies can also benefit
from BI
• The financial crisis has increased the focus on BI
• You cannot afford not to use the “gold” in your data
6. BI and the Web
• The Web makes BI even more useful
• Customers do not appear “physically” in a store; their behaviors
• Cannot be observed by traditional methods
• A website log is used to capture the behavior of each customer,
• e.g., sequence of pages seen by a customer, the products viewed
• Idea: understand your customers using data and BI!
• Utilize website logs, analyze customer behavior in more detail than before
(e.g., what was not bought?)
• Combine web data with traditional customer data
8. Case Study of an Enterprise
• Example of a chain (e.g., fashion stores or car dealers)
• Each store maintains its own customer records and sales records
• Hard to answer questions like: “find the total sales of Product X from stores in town
Y”
• The same customer may be viewed as different customers for different stores
• hard to detect duplicate customer information
• Imprecise or missing data in the addresses of some customers
• Purchase records maintained in the operational system for limited time (e.g., 6
months)
• then they are deleted or archived
• The same “product” may have different prices, or different discounts in different
stores
• Can you see the problems of using those data for business analysis?
9. Data Analysis Problems
• The same data found in many different systems
• Example: customer data across different stores and departments
• The same concept is defined differently
• Heterogeneous sources
• Relational DBMS, On-Line Transaction Processing (OLTP)
• Unstructured data in files (e.g., MS Word)
• Legacy systems
• …
10. Data Analysis Problems (cont’)
• Data is suited for operational systems
• Accounting, billing, etc.
• Do not support analysis across business functions
• Data quality is bad
• Missing data, imprecise data, different use of systems
•Data are “volatile”
• Data deleted in operational systems (6 months)
• Data change over time – no historical information
12. Data Warehousing
• Solution: new analysis environment (DW) where data are
• Subject oriented (versus function oriented)
• Integrated (logically and physically)
• Time variant (data can always be related to time)
• Stable (data not deleted, several versions)
• Supporting management decisions (different organization)
• Data from the operational systems are
• Extracted
• Cleansed
• Transformed
• Aggregated
• Loaded into the DW
• A good DW is a prerequisite for successful BI
13. DW: Purpose and Definition
• DW is a store of information organized in a unified data model
• Data collected from a number of different sources
• Finance, billing, website logs, personnel, …
• Purpose of a data warehouse (DW): support decision making
• Easy to perform advanced analysis
• Ad-hoc analysis and reports
• Data mining: discovery of hidden patterns and trends
14. DB vs DW
• Data base schema 2D app functions OLTP: online
transaction processing
• DW DB for optimised analysis xD OLAP: online
analytical processing
Cette photo par Auteur inconnu est soumise à la licence CC BY-SA-NC
17. Process of Query-Driven Approach
This approach was used to build wrappers and integrators on top of
multiple heterogeneous databases. These integrators are also known as
mediators.
1. When a query is issued to a client side, a metadata dictionary
translates the query into an appropriate form for individual
heterogeneous sites involved.
2. These queries are mapped and sent to the local query processor.
3. The results from heterogeneous sites are integrated into a global
answer set.
19. Update-Driven Approach
• Today's data warehouse systems follow update-driven approach.
• In update-driven approach, the information from multiple
heterogeneous sources are integrated in advance and are stored in a
warehouse.
• This information is available for direct querying and analysis.
High performance
Data is copied, integrated, cleaned, annotated and
restructured in semantic store in advance
Querying does not require an interface to process data
at local sources
21. • Data source
• DM: data mart mini dataware house limited in scope
• ETL: Extract/Transform/Load
From edureka
metadata
22. Data Mart
• Data marts contain a subset of organization-wide data that is valuable
to specific groups of people in an organization.
• A data mart contains only those data that is specific to a particular
group.
• For example, the marketing data mart may contain only data related
to items, customers, and sales.
• Data marts are confined to subjects.
23. Dat Mart (contd.)
• Windows-based or Unix/Linux-based servers are used to implement data
marts.
• They are implemented on low-cost servers.
• The implementation cycle of a data mart is measured in short periods of
time, i.e., in weeks rather than months or years.
• The life cycle of data marts may be complex in the long run, if their
planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data warehouse.
• Data marts are flexible.
24. Metadata
• Metadata is simply defined as data about data.
• The data that are used to represent other data is known as metadata.
• For example, the index of a book serves as a metadata for the
contents in the book.
• We can say that metadata is the summarized data that leads us to the
detailed data.
25. Metadata
In terms of data warehouse, we can define metadata as following:
• Metadata is a roadmap to data warehouse.
• Metadata in data warehouse defines the warehouse objects.
• Metadata acts as a directory.
• This directory helps the decision support system to locate the
contents of a data warehouse.
30. What is OLTP?
• Online transaction processing shortly known as OLTP.
• Supports transaction-oriented applications in a 3-tier architecture.
OLTP administers day to day transaction of an organization.
• The primary objective is data processing and not data analysis
31. Example: ATM
• Assume that a couple has a joint account with a bank.
• One day both simultaneously reach different ATM centers at precisely
the same time and want to withdraw total amount present in their
bank account.
• However, the person that completes authentication process first will
be able to get money.
• In this case, OLTP system makes sure that withdrawn amount will be
never more than the amount present in the bank. The key to note
here is that OLTP systems are optimized for transactional superiority
instead data analysis.
32. What is OLAP?
• Online Analytical Processing.
• A category of software tools which provide analysis of data for
business decisions.
• OLAP systems allow users to analyze database information from
multiple database systems at one time.
• The primary objective is data analysis and not data processing.
33. Example
• Any Datawarehouse system is an OLAP system.
• Uses of OLAP are as follows:
• A company might compare their mobile phone sales in September with sales
in October, then compare those results with another location which may be
stored in a sperate database.
• Amazon analyzes purchases by its customers to come up with a personalized
homepage with products which likely interest to their customer.
34. Why not use the existing
databases (OLTP) for business
analysis?
35. Hard/Infeasible Queries for OLTP
• Business analysis queries
• In the past five years, which product is the most profitable?
• Which public holiday we have the largest sales?
• Which week we have the largest sales?
• Does the sales of dairy products increase over time?
• Difficult to express these queries in SQL
• 3rd query: may extract the “week” value using a function
• But the user has to learn many transformation functions …
• 4th query: use a “special” table to store IDs of all dairy products, in advance
• There can be many different dairy products; there can be many other
product types as well …
• The need of multidimensional modeling
36. Multidimensional Modeling
• Example: sales of supermarkets
• Facts and measures
• Each sales record is a fact, and its sales value is a measure
• Dimensions
• Group correlated attributes into the same dimension easier for analysis
tasks
• Each sales record is associated with its values of Product, Store, Time
37. Multidimensional Modeling
• How do we model the Time
dimension?
• Hierarchies with multiple levels
• Attributes, e.g., holiday, event
• Advantage of this model?
• Easy for query (more about this
later)
• Disadvantage?
• Data redundancy (but controlled
redundancy is acceptable)
tid day Day # Week # Month
#
Year Work
day
…
1 January
1st
2009
1 1 1 2009 No …
2 January
2nd
2009
1 1 1 2009 Yes
38. Quick Review: Normalized Database
• Normalized database avoids
• Redundant data
• Modification anomalies
• How to get the original table? (join them)
• No redundancy in OLTP, controlled redundancy in OLAP
39. DW: Not in 3NF!
• Denormalized data
• Data redundancy
• More storage but data more quickly queried
• DW provides aggregate data which is in suitable format for decision
making
40. OLTP vs OLAP
OLTP OLAP
Target Operational needs Business analysis
Model Denormalized
Denormalized/
multidimensional
Data Small, operational Large, historical
Query Language SQL Not unified, mostly MDX
Query Small Large
Updates Frequent and small Infrequent and batch
Transactional recovery Necessary Not necessary
Optimized for Updates Query
41. Data Cube
• A data cube helps us represent data in multiple dimensions.
• It is defined by dimensions and facts.
• The dimensions are the entities with respect to which an enterprise
preserves the records.
43. Example
• Suppose Souq wants to keep track of sales records with the help of
sales data warehouse with respect to time, item, branch, and
location.
• These dimensions allow to keep track of monthly sales and at which
branch the items were sold.
• There is a table associated with each dimension. This table is known
as dimension table.
• For example, "item" dimension table may have attributes such as
item_name, item_type, and item_brand.
44. Location = Jeddah
Time Item type
Entertainment Laptop Mobile Sport
Q1 500 700 100 300
Q2 700 700 300 400
Q3 100 800 180 600
Q4 200 900 200 200
2-D view of Sales Data for a company with respect to time,
item, and location dimensions.
How to represent another dimension let say the
location (Ryadh, Taief, etc.)?
45. Time
Location = Jeddah Location = Ryadh Location = Taeif
Item type Item type Item type
Entertainm
ent
Laptop Mobile Sport Enterta
inment
Laptop Mobile Sport Enterta
inment
Laptop Mobile Sport
Q1 500 700 100 300 100 150 100 400 400 300 250 120
Q2 700 700 300 400 300 300 600 500 1000 100 309 345
Q3 100 800 180 600 200 400 500 300 300 200 399 764
Q4 200 900 200 200 100 200 100 200 400 100 400 897
3-D table can be represented as 3-D data cube
48. OLAP Data Cube: wrap up
• Data cube
• Useful data analysis tool in DW
• Generalized GROUP BY queries
• Aggregate facts based on chosen
dimensions
• Why data cube?
• Good for visualization (i.e., text results
hard to understand)
• Multidimensional, intuitive
• Support interactive OLAP operations
49. Cube operations
Four types of analytical operations in OLAP are:
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
50. Roll-up
• Roll-up is also known as "consolidation" or "aggregation."
• The Roll-up operation can be performed in 2 ways
• Reducing dimensions
• Climbing up concept hierarchy.
• Concept hierarchy is a system of grouping things based on their order or level.
52. Drill-down
• In drill-down data is fragmented into smaller parts. It is the opposite
of the rollup process. It can be done via
• Moving down the concept hierarchy
• Increasing a dimension
53. Slice
• One dimension is selected, and a new sub-cube is created.
• Following diagram explain how slice operation performed:
55. Dice
• This operation is similar to a slice.
• The difference in dice is you select 2 or more dimensions that result in
the creation of a sub-cube.
56. Example: dice for time = Q1 and Location=
Jeddah or Ryadh
Jeddah
Ryadh
Tayef
Entment. Laptop Mobile Sport
Q1
Q2
Q3
500 700 100 300
100 100 309 345
300 200 399 764
400 300 250 120
57. Pivot
• In Pivot, you rotate the data axes to provide a substitute presentation
of data.
Q1
Q2
Q3
Entment.
Laptop
Mobile
Sport
JeddahRyadhTayef
Jeddah
Ryadh
Tayef
Entment. Laptop Mobile Sport
Q1
Q2
Q3
500 700 100 300
100 100 309 345
300 200 399 764
400 300 250 120
58. OLAP Structures
OLAP
MOLAP
multidimensional
Implementes operation in
multidimensional data.
ROLAP
relational
Is an extended RDBMS along with
multidimensional data mapping to
perform the standard relational
operation.
HOLAP
hybrid
The aggregated totals are stored in a
multidimensional database while the
detailed data is stored in the relational
database. This offers both data
efficiency of the ROLAP model and the
performance of the MOLAP model.
59. Advantages of OLAP
• OLAP is a platform for all type of business includes planning, budgeting, reporting, and
analysis.
• Information and calculations are consistent in an OLAP cube. This is a crucial benefit.
• Quickly create and analyze "What if" scenarios
• Easily search OLAP database for broad or specific terms.
• OLAP provides the building blocks for business modeling tools, Data mining tools,
performance reporting tools.
• Allows users to do slice and dice cube data all by various dimensions, measures, and
filters.
• It is good for analyzing time series.
• Finding some clusters and outliers is easy with OLAP.
• It is a powerful visualization online analytical process system which provides faster
response times
60. Disadvantages of OLAP
• OLAP requires organizing data into a star or snowflake schema. These
schemas are complicated to implement and administer
• You cannot have large number of dimensions in a single OLAP cube
• Transactional data cannot be accessed with OLAP system.
• Any modification in an OLAP cube needs a full update of the cube.
This is a time-consuming process
62. Multidimensional schemas?
• Multidimensional schema is especially designed to model data
warehouse systems.
• The schemas are designed to address the unique needs of very large
databases designed for the analytical purpose (OLAP).
63. Types of Data Warehouse
Schema
•Star Schema
•Snowflake Schema
•Galaxy Schema
65. What is a Star Schema?
• The star schema is the simplest type of Data Warehouse schema.
• aka Star Join Schema.
• It is optimized for querying large data sets.
• In the Star Schema:
• the center of the star can have one fact table and a number of associated
dimension tables.
• It is known as star schema as its structure resembles a star.
69. Characteristics of Star Schema
• Every dimension in a star schema is represented with the only one-
dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized.
• The schema is widely supported by BI Tools
72. What is a Snowflake Schema?
• A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions.
• It is called snowflake because its diagram resembles a Snowflake.
• The dimension tables are normalized which splits data into additional
tables.
76. Characteristics of Snowflake Schema
• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts
because of the more lookup tables.
77. Star vs snowflake
Star Schema Snowflake Schema
It contains a fact table surrounded by
dimension tables.
It contains a fact table surrounded by dimension
tables.
Only single join creates the relationship between
the fact table and any dimension tables.
Requires many joins to fetch the data.
Denormalized Data structure query run faster
Normalized Data structure (dimensions) query run
slower
redundancy easy maintenance No edundancy difficult maintenance
Good for data data mart Good for warehouse core
79. What is fact constellation?
• It is a collection of multiple fact tables having some common
dimension tables.
• It can be viewed as a collection of several star schemas and hence,
also known as Galaxy schema.
• It is one of the widely used schema for Data warehouse designing and
it is much more complex than star and snowflake schema.
• For complex systems, we require fact constellations.
81. Characteristics of Galaxy Schema
• The dimensions in this schema are separated into separate dimensions
based on the various levels of hierarchy.
• For example, if geography has four levels of hierarchy like region, country,
state, and city then Galaxy schema should have four dimensions.
• Moreover, it is possible to build this type of schema by splitting the one-
star schema into more Star schemes.
• The dimensions are large in this schema which is needed to build based on
the levels of hierarchy.
• This schema is helpful for aggregating fact tables for better understanding.
82. Wrap up
• Data warehouse is a special DB that aggregates multiple DB
• DW is useful for analytical processing.
• DW is built through ETL
• DW may contain many datamarts
• OLTP is based on multidimensional schema: data cube
• There are 3 fact models: star, snowflake and galaxy
83. Literature
• Multidimensional Databases and Data Warehousing, Christian S.
Jensen, Torben Bach Pedersen, Christian Thomsen, Morgan &
Claypool Publishers, 2010
• Data Warehouse Design: Modern Principles and Methodologies,
Golfarelli and Rizzi, McGraw-Hill, 2009
• Advanced Data Warehouse Design: From Conventional to Spatial and
Temporal Applications, Elzbieta Malinowski, Esteban Zimányi,
Springer, 2008
• The Data Warehouse Lifecycle Toolkit, Kimball et al., Wiley 1998
• The Data Warehouse Toolkit, 2nd Ed., Kimball and Ross, Wiley, 2002