The document discusses topics related to data warehousing. It covers:
1. The key components involved in getting data into a data warehouse, which include extraction, transformation, cleansing, loading, and summarization of data.
2. An overview of the main components of a data warehouse architecture, including source data, data staging, data storage, information delivery, metadata management, and control components.
3. Various topics to be covered related to data warehousing, such as data marts, ERP, knowledge management, and customer relationship management.
What is Data Mining? Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data.
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Data Warehousing is a data architecture that separates reporting and analytics needs from operational transaction systems. This presentation is an introduction into traditional data warehousing architectures and how to determine if your environment requires a data warehouse.
● Data Modeling and Data Models.
● Business Rules (Translating Business Rules into Data Model Components).
● Emerging Data Models: Big Data and NoSQL.
● Degrees of Data Abstraction (External, Conceptual, Internal and Physical model).
Understanding the difference between Data, information and knowledgeNeeti Naag
In decision making process it is very important to use past and present data. This presentation will help in understanding what is data, how it is converted to information and how information becomes knowledge.
What is Data Mining? Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data.
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Data Warehousing is a data architecture that separates reporting and analytics needs from operational transaction systems. This presentation is an introduction into traditional data warehousing architectures and how to determine if your environment requires a data warehouse.
● Data Modeling and Data Models.
● Business Rules (Translating Business Rules into Data Model Components).
● Emerging Data Models: Big Data and NoSQL.
● Degrees of Data Abstraction (External, Conceptual, Internal and Physical model).
Understanding the difference between Data, information and knowledgeNeeti Naag
In decision making process it is very important to use past and present data. This presentation will help in understanding what is data, how it is converted to information and how information becomes knowledge.
History, definition, need, attributes, applications of data warehousing ; difference between data mining, big data, database and data warehouse ; future scope
Big data is the term for any gathering of information sets, so expensive and complex, that it gets to be hard to process for utilizing customary information handling applications. The difficulties incorporate investigation, catch, duration, inquiry, sharing, stockpiling, Exchange, perception, and protection infringement. To reduce spot business patterns, anticipate diseases, conflict etc., we require bigger data sets when compared with the smaller data sets. Enormous information is hard to work with utilizing most social database administration frameworks and desktop measurements and perception bundles, needing rather enormously parallel programming running on tens, hundreds, or even a large number of servers. In this paper there was an observation on Hadoop architecture, different tools used for big data and its security issues.
Chapter 13The Data WarehouseBLCN-534 Fundamentals o.docxketurahhazelhurst
Chapter 13
The Data Warehouse
BLCN-534: Fundamentals of Database Systems
Chapter ObjectivesCompare the data needs of transaction processing systems with those of decision support systems. Describe the data warehouse concept and list its main features. Compare the enterprise data warehouse with the data mart. Design a data warehouse. Build a data warehouse, including the steps of data extraction, data cleaning, data transformation, and data loading. Describe how to use a data warehouse with online analytic processing and data mining.
13-*
Chapter Objectives
List the types of expertise needed to administer a data warehouse.
List the challenges in data warehousing.
13-*
Application SystemsTransaction Processing Systems (TPS)Everyday application systems that support banking and insurance operations, manage the parts inventory on manufacturing assembly lines, keep track of airline and hotel reservations, support Web-based sales, etc.
Decision Support Systems (DSS)specifically designed to aid managers in decision-making tasks.
13-*
The Data Warehouse Concept
A data warehouse is a broad-based, shared database for management decision making that contains data that has been accumulated over time.
Formally, a database warehouse is, “a subject oriented, integrated, non-volatile, and time variant collection of data in support of management’s decisions.”
13-*
Characteristics of
Data Warehouse DataThe data is subject orientedThe data is integratedThe data is non-volatileThe data is time variantThe data must be high qualityThe data may be aggregatedThe data is often denormalizedThe data is not necessarily absolutely current
13-*
The Data is Subject Oriented
Data warehouses are organized around subjects, really the major entities of concern in the business environment.Sales, customers, orders, claims, accounts, employees, other entities that are central to the company’s business.
13-*
The Data is Integrated
Data about each of the subjects in the data warehouse is typically collected from several of the company’s transactional databases, each of which supports one or more applications that have something to do with the particular subject.
All of the data about a subject must be organized or integrated in such a way that it provides a unified, overall picture of all the important details about the subject over time.
Data from disparate application databases must be transformed into common measurements, codes, data types.
13-*
The Data is Non-Volatile
Once data is added to the data warehouse, it doesn’t change.
It will never change. Changing it would be like going back and rewriting history.
13-*
The Data is Time Variant
Data warehouse data, with its historic nature, always includes some kind of a timestamp.
If we are storing sales data on a weekly or monthly basis and we have accumulated ten years of such historic data, each weekly or monthly sales figure must be accompanied by a timestamp indicating the week or month (and ...
Data Warehouse – Introduction, characteristics, architecture, scheme and modelling, Differences between operational database systems and data warehouse.
Modern Integrated Data Environment - Whitepaper | QuboleVasu S
A whit-paper is about building a modern data platform for data driven organisations with using cloud data warehouse with modern data platform architecture
https://www.qubole.com/resources/white-papers/modern-integrated-data-environment
2. Topics to be covered
1. Getting data into the Data warehouse.
2. Extraction
3. Transformation
4. Cleansing
5. Loading
6. Summarization
7. Meta data
8. Data Marts
9. Data warehousing & ERP
10. Data warehousing & KM
11. Data warehousing & CRM
2/12/2013 2.Data warehouse/D.S.Jagli 2
3. A Sales Manager wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?
What product prom- Which customers
-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
2/12/2013 2.Data warehouse/D.S.Jagli 3
4. Data, Data everywhere
yet ... I can’t find the data I need
data is scattered over the network.
many versions, slight differences.
I can’t get the data I need
need an expert to get the data.
I can’t understand the data I found
available data poorly documented.
I can’t use the data I found
results are unexpected.
data needs to be transformed from one
format to other.
2/12/2013 2.Data warehouse/D.S.Jagli 4
5. What is a Data Warehousing?
A process of transforming data
into information and making it
available to users in a timely
enough manner to make a
difference
[Forrester Research, April 1996]
2/12/2013 2.Data warehouse/D.S.Jagli 5
6. What is a Data warehouse
A data warehouse is a repository (collection of resources that
can be accessed to retrieve information) of an organization's
electronically stored data, designed to facilitate reporting and
analysis.
In simple form data warehouse is a collection of large amount
of data.
A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in
what they can understand and use in a business context.
[Barry Devlin]
2/12/2013 2.Data warehouse/D.S.Jagli 6
7. What are the users saying...
Data should be integrated across the enterprise
Summary data has a real value to the
organization
Historical data holds the key to understanding
data over time
2/12/2013 2.Data warehouse/D.S.Jagli 7
8. Data Warehouse
A data warehouse is a
1. Subject-oriented
2. Integrated
3. Time-varying
4. Non-volatile
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data Warehouse
2/12/2013 2.Data warehouse/D.S.Jagli 8
9. Subject Oriented: Data is organized by business topic not
by customer_id
Data is Integrated and Loaded by Subject
Cust 1996
1996
Prod D/W
Data
1997
Order
1998
payt
2/12/2013 2.Data warehouse/D.S.Jagli 9
10. Integrate: Integrated means that data are stored as a single
unit not as a collection of files that may have different structures
.
Eg:
Operatio nal S ystems
Order Pro cessing Order ID = 10
D /W
A cco unts Receivable Order ID = 12
Order ID = 16
Pro duct Management Order ID = 8
HR S y stem S ex = M/F
D /W
Pay ro ll S ex = 1/2
S ex = M/F
Pro duct Management S ex = 0/1
2/12/2013 2.Data warehouse/D.S.Jagli 10
11. Time Variant: it means that a time dimension is explicitly
included in the data so that trends and changes over time can
be studied .
Data elements won’ t change i.e a single ,complete and
consistent store of data obtained from a variety of sources and
made available to end users in a way they can understand and
use in business context.
Eg: Designated Time Frame (3 - 10 Years)
Key Includes Date
2/12/2013 2.Data warehouse/D.S.Jagli 11
12. Non-Volatile: It means that the data don’t keep changing, new
data may be added on a scheduled basic but old data aren’t
discarded
Data Warehouse
• No Data Update
Load Read
Read
Read
Read
2/12/2013 2.Data warehouse/D.S.Jagli 12
13. These definitions are quite different on the surface .yet
several common threads emerge from reading them as
group:
1. A DWH containing a lot of data some of the data comes
from operational source and outside.
2. A DWH is organized so as to facilitate the use of this data for
decision making purpose.
3. The DWH provides tools by which end users can access this
data
2/12/2013 2.Data warehouse/D.S.Jagli 13
14. ODS Vs DW
ODS DATAWARE HOUSE
1. Stores current operational data 1. Stores historic and current data
2. Perform numerous quick and 2. Perform complex queries on
simple queries on small large amounts of data
amounts of data
3. Contains small amount of 3. contains large amounts of static
transactional data data
4. Day To Day Decisions, 4. Long-Term Decisions,
5. Current operational results, 5. Strategic reporting,
6. Tactical reporting, 6. Trend detection
7. Near to Normal form 7. Star Schema
8. Frequency of Load: Twice 8. Frequency of Load: Weekly,
Daily , Daily, Weekly Monthly, Quarterly
14
2/12/2013 2.Data warehouse/D.S.Jagli 14
15. OLTP vs. Data Warehouse
OLTP systems are tuned for known transactions and workloads
while workload is not known a priori in a data warehouse.
Special data organization, access methods and implementation
methods are needed to support data warehouse queries.
e.g. average amount spent on phone calls between 9AM-5PM in Pune
during the month of December.
2/12/2013 2.Data warehouse/D.S.Jagli 15
16. OLTP vs. Data Warehouse
OLTP Warehouse (DSS)
Application Oriented Subject Oriented
Used to run business Used to analyze business
Detailed data Summarized and refined
Current up to date Snapshot data
Isolated Data Integrated Data
Repetitive access Ad-hoc access
Clerical User Knowledge User (Manager)
2/12/2013 2.Data warehouse/D.S.Jagli 16
17. Who uses DWHs?
Ideal DWH user is something like this:
1. This person’s job involves drawing conclusions from ,making decisions
based on, large masses of data
2. This person also doesn’t want to access a database in a highly technical
fashion
3. Any person who wants to analyze data and take a decision
Eg: Tourists: Browse information harvested by farmers
Farmers: Harvest information from known access paths
2/12/2013 2.Data warehouse/D.S.Jagli 17
18. Data Warehouse Architecture
Source Data
Data mining
Data Storage
Data Staging
Relational Information
Databases Delivery
1. EXTRACT Optimized Loader
2. TRANSFORM
3. CLEAN
ERP 4. LOAD
5. REFRESH Report/User Query
Systems
Data Warehouse
Engine
Analyze
Purchased Query
Data OLAP
Meta Data
Legacy
Data Metadata Repository
2/12/2013 2.Data warehouse/D.S.Jagli 18
19. Overview of the Components
1. Source Data Component
2. Data Staging Component
3. Data Storage Component
4. Information Delivery Component
5. Metadata Component
6. Management and Control Component
2/12/2013 2.Data warehouse/D.S.Jagli 19
20. Overview of the Components
1. Source Data Component:
1. Production Data : data comes from operational systems of the
Enterprise.
2. Internal Data: in every organization keeps some private data.
3. Archived Data: it comes from back ups.
2/12/2013 2.Data warehouse/D.S.Jagli 20
21. Topics to be covered
1. Getting data into the Data warehouse.
2. Extraction
3. Transformation
4. Cleansing
5. Loading
6. Summarization
7. Meta data
8. Data Marts
9. Data warehousing & ERP
10. Data warehousing & KM
11. Data warehousing & CRM
2/12/2013 2.Data warehouse/D.S.Jagli 21
22. Overview of the Components: Getting data into
the DWH
2.Data Staging Component
5 steps of data staging
1. Extraction
2. Transformation
3. Cleansing
4. Loading
5. Summarization
2/12/2013 2.Data warehouse/D.S.Jagli 22
23. Overview of the Components: Getting data
into the DWH
Data staging component
1. Extracting
Capture of relevant data from operational source in “as is” status
Sources for data generally in legacy mainframes in, IMS, DB2; more data
today in relational databases on Unix.
2. Transformation:
In the case Multiple input sources to a data warehouse, inconsistency
can sometimes make data unusable.
Transformation is the process of dealing with these inconsistencies
Eg: Use of different names/formats
Mumbai, Bombay
Cust_id, C_id
dd/mm/yy, mm/dd/yy
2/12/2013 2.Data warehouse/D.S.Jagli 23
24. Overview of the Components: Getting data into
the DWH
Data staging component
3. Cleansing
It is necessary to go through data entered into DWH and make it
error free, this process is called Data cleansing.
This include missing data , incorrect data in one source, inconsistent
data and conflicting data when two or more sources are involved.
Clean data is vital for the success of the warehouse
Eg:Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri,
etc. are the same person
2/12/2013 2.Data warehouse/D.S.Jagli 24
25. Overview of the Components: Getting data into the
DWH
Data staging component
4. Loading
• It implies physical movement of data from the computers storing the
source databases to that which will store the data warehouse.
• Most common channel for the data movement process is a high-speed
communication link.
• It is always necessary to close off access to the DWH when the
loading is taking place.
2/12/2013 2.Data warehouse/D.S.Jagli 25
26. Overview of the Components: Getting data into the
DWH
Data staging component
5. Summarization :
In which any desired summaries of the data warehouse data are
precalculated for later use
Once the DWH database has been loaded it is possible to create
summaries.
Eg: base customer (1985-87)
custid, from date, to date, name, phone, dob
base customer (1988-90)
custid, from date, to date, name, credit rating, employer
customer activity (1986-89) -- monthly summary
customer activity detail (1987-89)
custid, activity date, amount, clerk id, order no
2/12/2013 2.Data warehouse/D.S.Jagli 26
27. Overview of the Components: Getting data into the DWH
Data staging component
• Summarized data stored : advantages
1. reduce storage costs
2. reduce CPU usage
3. increases performance since smaller number of records to be
processed
4. design around traditional high level reporting needs
2/12/2013 2.Data warehouse/D.S.Jagli 27
28. Overview of the Components
Data storage Component
The data storage for data warehouse is a separate repository
Propagate updates on source data to the warehouse
periodically (e.g., every night, every week) or after
significant events.
refresh policy set by administrator based on user needs and
traffic.
possibly different policies for different sources.
This function is time consuming.
2/12/2013 2.Data warehouse/D.S.Jagli 28
29. Overview of the Components
Information delivery component
• In order to provide information to the wide community
of data warehouse users the information delivery
component includes different methods.
• E g: Online ,intranet, internet, email.
2/12/2013 2.Data warehouse/D.S.Jagli 29
30. Overview of the Components
Meta data component
Meta data are data that describe data in the data warehouse
Administrative metadata
1. source databases and their contents
2. gateway descriptions
3. warehouse schema, view & derived data definitions
4. dimensions, hierarchies
5. pre-defined queries and reports
6. data mart locations and contents
7. data partitions
8. data extraction, cleansing, transformation rules
9. data refresh
10. user profiles, user groups
11. security: user authorization, access control
2/12/2013 2.Data warehouse/D.S.Jagli 30
31. Meta data component
Business meta data
business terms and definitions
ownership of data
operational metadata
data lineage: history of migrated data and sequence of
transformations applied
currency of data: active, archived,
monitoring information: warehouse usage statistics, error
reports, audit trails.
End- user meta data
2/12/2013 2.Data warehouse/D.S.Jagli 31
32. Overview of the Components
Management and control component
This component of the data warehouse sits on the top of all
other components. it coordinates services and activities with
within the DWH
It interacts with meta data component to perform the
management and control functions
2/12/2013 2.Data warehouse/D.S.Jagli 32
33. Content of DWH Data
Operational data to warehouse data
4 Data levels:
1. Operational data Query processing
2. Atomic data
3. Summary data Summary data
4. Query processing
Atomic data
Operational data
Eg1: My checking account balance right now is “ 1200/-Rs”
Eg2: My checking account balance at the end of January was “25000/-Rs”
Eg3: At the end of January 2011 we have 12500 customers
Eg4: Our customers in Mumbai zone grew by 10% during last three months
2/12/2013 2.Data warehouse/D.S.Jagli 33
35. Data mart
Data mart
Data mart a subset of a data warehouse that supports the requirements of
particular department or business function.
Data mart focuses on only the requirements of users associated with one
department or business function.
Data marts do not normally contain detailed operational data, unlike data
warehouses.
As data marts contain less data compared with data warehouses, data marts
are more easily understood and navigated.
2/12/2013 2.Data warehouse/D.S.Jagli 35
36. From the Data Warehouse to Data Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data 36
37. Reasons for creating a data mart
1. To give users access to the data they need to analyze most often
2. To provide data in a form that matches the collective view of the data by a
group of users in a department or business function
3. To improve end-user response time due to the reduction in the volume of
data to be accessed
4. To provide appropriately structured data as dictated by the requirements of
end-user access tools
5. data cleansing, loading, transformation, and integration are far easier, and
hence implementing and setting up a data mart is simpler.
6. The cost of implementing data marts is normally less
7. The potential users of a data mart are more clearly defined and can be
more easily targeted to obtain support for a data mart project
2/12/2013 2.Data warehouse/D.S.Jagli 37
38. Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
38
39. Characteristics of the Departmental Data Mart
1. OLAP
2. Small
3. Flexible
4. Customized by Department
5. Source is departmentally
structured data warehouse
39
40. Techniques for Creating Departmental Data
Mart
OLAP
Sales Finance Mktg. Subset
Summarized
Superset
Indexed
Arrayed
40
44. Bill Inman Vs Ralph Kimball
Top -down Vs Bottom -up w.r.t Data Mart
2/12/2013 2.Data warehouse/D.S.Jagli 44
45. A Comparison DM vs DWH
DATA MART DATAWARE HOUSE
Limited subject areas (one or two) Many subject areas
One data mart for one business Enterprise repository with multiple
theme/ subject area data marts
Has limited dimensions and Has all dimensions and measures
measures depends on the subject area required
Time to build is short Long time activity
Can be built as smaller scale DW Large scale
2/12/2013 2.Data warehouse/D.S.Jagli 45