20IT501_DWDM_PPT_Unit_I.ppt

20IT501 – Data Warehousing and
Data Mining
III Year / VI Semester

OBJECTIVES
Be familiar with the concepts of data warehouse
Know the fundamentals of data mining
Understand the importance of association rule mining
Understand the techniques of classification and
clustering
Be aware of the recent trends of data mining

UNIT I- DATA WAREHOUSING
Data Warehouse – Data warehousing Components – Building a
Data warehouse – Mapping the Data Warehouse to a
Multiprocessor Architecture – DBMS Schemas for Decision
Support. Business Analysis: Reporting and Query tools and
Applications - Online Analytical Processing (OLAP) – Need –
Multidimensional Data Model – OLAP Guidelines -
Categories of OLAP Tools.

Data Warehouse
A warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data
in support of management’s decision making
process.

Data Warehouse
Subject Oriented:
 it provides information around a subject rather than the
organization's ongoing operations.
 These subjects can be product, customers, suppliers, sales,
revenue, etc.
 A data warehouse does not focus on the ongoing
operations, rather it focuses on modeling and analysis of
data for decision making.

Data Warehouse
Integrated:
 A data warehouse is constructed by integrating
data from heterogeneous sources such as relational
databases, flat files, etc.
 This integration enhances the effective analysis of
data.

Data Warehouse
Time Variant:
 The data collected in a data warehouse is
identified with a particular time period.
 The data in a data warehouse provides information
from the historical point of view.

Data Warehouse
Non-volatile
Non-volatile means the previous data is not erased
when new data is added to it.
 A data warehouse is kept separate from the operational
database and therefore frequent changes in operational
database is not reflected in the data warehouse.

Data Warehouse - Terminologies
 Current detail data: It is organized along
subject lines(customer profile data, sales data)
 Data Mart: Summarized departmental data
and is customized to suit the needs of a
particular department that owns the data.

Data Warehouse - Terminologies
 Drill-down: To perform business analysis in a top-
down fashion.
 Metadata: Data about data. It contains the location and
description of warehouse system components; names,
definition, structure and content of the data warehouse
and end-users views.

Data Warehouse
Operational Data Information Data
Content Current Values Summarized, Derived
Organization Application Subject
Stability Dynamic Static until refreshed
Data Structure Optimized for transactions Optimized for complex
queries
Access High Medium to low
Response time Sub second (<1s) to 2-3s Several seconds to minutes

Data Warehouse Components
1.Data Warehouse Database:
 It is based on a relational database management system server
that functions as the central repository for informational data.
 This repository is surrounded by a number of key components
designed to make the entire functional, manageable and
accessible by both the operational systems that source data into
the warehouse and by end-user query and analysis tool.

2. Sourcing, Acquisition, Clean up, and
Transformation Tools:
 They perform conversions, summarization, key
changes, structural changes and condensation.
 The data transformation is required so that the
information can be used by decision support tools.

 Sourcing, Acquisition, Clean up, and
Transformation Tools – Functionalities:
 To remove unwanted data from operational db
 Converting to common data names and attributes
 Calculating summaries and derived data
 Establishing defaults for missing data
 Accommodating source data definition changes

 Sourcing, Acquisition, Clean up, and
Transformation Tools – Issues:
 Data heterogeneity: It refers to the different way
the data is defined and used in different modules.

Meta data:
 It is data about data. It is used for maintaining,
managing and using the data warehouse.
 Types:
 Technical Meta data
 Business Meta data

 Meta data - Technical Meta data:
Information about data stores
Transformation descriptions
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info
delivery history, data acquisition history, data access etc.,

Meta data – Business Meta data:
 It contains information that stored in data warehouse
to users.
 Subject areas, and info object type including queries,
reports, images, video, audio clips etc.
 Internet home pages
 Information related to info delivery system

 Meta data:
 Meta data helps the users to understand content and find
the data.
Meta data are stored in a separate data stores which is
known as informational directory or Meta data repository.
It helps to integrate, maintain and view the contents of the
data warehouse

 Meta data – Characteristics:
 It is the gateway to the data warehouse environment
 It supports easy distribution and replication of content for high
performance and availability
 It should be searchable by business oriented key words
 It should support the sharing of information
 It should support and provide interface to other applications

Access Tools:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools

Data Marts:
 Departmental subsets that focus on selected subjects.
They are independent used by dedicated user group.
They are used for rapid delivery of enhanced decision
support functionality to end users.

 Data warehouse admin and management:
 Security and priority management
 Monitoring updates from multiple sources
 Data quality checks
 Managing and updating meta data
 Auditing and reporting data warehouse usage and status
 Replicating, sub setting and distributing data
 Backup and recovery

Information delivery system
It is used to enable the process of subscribing for
data warehouse info.
Delivery to one or more destinations according to
specified scheduling algorithm

Building a Data Warehouse -
Reason
There are two reasons why organizations consider data
warehousing a critical need
1.Business Factor:
 Decisions need to be made quickly and correctly, using all
available data.
 Users are business domain expert, not computer
professionals

Building a Data Warehouse -
Reason
2.Technological factor:
 The price of computer processing speed continues to
decline.
IT infrastructure is changing rapidly. Its capacity is
increasing and cost is decreasing so that building a data
warehouse is easy.
Network bandwidth is increasing.

Building a Data Warehouse
1. Business considerations:
Approaches
 Top - Down Approach
 Bottom - Up Approach

 Approaches: Top - Down Approach
 To build a centralized repository to house corporate
wide business data.
 This repository is called Enterprise Data Warehouse
(EDW).
 The data in the EDW is stored in a normalized form in
order to avoid redundancy.

 The data in the EDW is stored at the most detail level.
The reason to build the EDW on the most detail level
is to leverage
Flexibility to be used by multiple departments.
Flexibility to cater for future requirements.

 We should implement the top-down approach when
The business has complete clarity on all or multiple subject
areas data warehouse requirements.
The business is ready to invest considerable time and money.

 Approaches: Top - Down Approach –
Advantages:
 This is very important for the data to be reliable.
 Consistent across subject areas.

 Approaches: Top - Down Approach –
Disadvantages:
 It requires more time and initial investment
The business has to wait for the EDW to be
implemented followed by building the data marts
before which they can access their reports

 Approaches: Bottom Up Approach
 It is an incremental approach to build a data
warehouse.
 Here we build the data marts separately at
different points of time as and when the specific
subject area requirements are clear.

 The data marts are integrated or combined
together to form a data warehouse.
 Separate data marts are combined through the use
of conformed dimensions and conformed facts.

 We should implement the bottom up approach
when
 We have initial cost and time constraints.
The complete warehouse requirements are not clear. We
have clarity to only one data mart.

 Approaches: Bottom Up Approach -
Advantages
 They do not require high initial costs and have a
faster implementation time;
 Hence the business can start using the marts much
earlier as compared to the top-down approach.

 Approaches: Bottom Up Approach -
Disadvantages
 It stores data in the de normalized format, hence
there would be high space usage for detailed data.
 We have a tendency of not keeping detailed data

2. Design considerations:
Most successful data warehouses that meet these requirements
have these common characteristics:
 Are based on a dimensional model
 Contain historical and current data
 Include both detailed and summarized data
 Consolidate disparate data from multiple sources while retaining
consistency

 Data warehouse is difficult to build due to the
following reason:
 Heterogeneity of data sources
 Use of historical data
 Growing nature of data base

General considerations there are following specific points relevant to
the data warehouse design:
 Data content
 The content and structure of the data warehouse are reflected in its data
model.
 The data model is the template that describes how information will be
organized within the integrated warehouse framework.
 The data warehouse data must be a detailed data. It must be formatted,
cleaned up and transformed to fit the warehouse data model.

Meta data
It defines the location and contents of data in the
warehouse.
Meta data is searchable by users to find definitions
or subject areas.

Data Distribution
 The data can be distributed based on the subject
area, location (geographical region), or time
(current, month, year).

 Tools
 These tools provide facilities for defining the
transformation and cleanup rules, data movement,
end user query, reporting and data analysis.

 Design steps
 Choosing the subject matter
 Deciding what a fact table represents
 Identifying and conforming the dimensions
 Choosing the facts
 Storing pre calculations in the fact table
 Rounding out the dimension table
 Choosing the duration of the db
 The need to track slowly changing dimensions
 Deciding the query priorities and query models

3. Implementation considerations:
The following logical steps needed to implement a data
warehouse:
 Collect and analyze business requirements
 Create a data model and a physical design
 Define data sources
 Choose the database Technology and platform
 Extract the data from operational database, transform it, clean it
up and load it into the warehouse

 Implementation considerations
Choose database access and reporting tools
Choose database connectivity software
Choose data analysis and presentation s/w
Update the data warehouse Access tools

 Access Tools
Simple tabular form data
Ranking data
Multivariable data
Time series data
Graphing, charting and pivoting data

 Access Tools
Statistical analysis data
Predefined repeatable queries
Ad hoc user specified queries
Reporting and analysis data

 Data extraction, clean up, transformation and
migration – Selection Criteria:
 Timeliness of data delivery to the warehouse
 The tool must have the ability to identify the
particular data and that can be read by conversion tool
 The tool must have the capability to merge data from
multiple data stores.

 User levels - Casual users:
They are most comfortable in retrieving information
from warehouse in pre defined formats and running pre
existing queries and reports.
These users do not need tools that allow for building
standard and ad hoc reports

 User levels - Power Users:
The user can use pre defined as well as user defined
queries to create simple and ad hoc reports.
These users can engage in drill down operations.
These users may have the experience of using
reporting and query tools.

 User levels - Expert users:
These users tend to create their own complex
queries and perform standard analysis on the info
they retrieve.
These users have the knowledge about the use of
query and report tools

 Benefits of data warehousing:
Locating the right information
Presentation of information
Testing of hypothesis
Discovery of information
Sharing the analysis

 Tangible benefits (quantified / measureable):It
includes
Improvement in product inventory
Decrement in production cost
Improvement in selection of target markets
Enhancement in asset and liability management

 Intangible benefits (not easy to quantified): It
includes,
Improvement in productivity by keeping all data in
single location and eliminating rekeying of data
Reduced redundant processing
Enhanced customer relation

Mapping the Data Warehouse to
a Multiprocessor Architecture
 Parallel Query Processing:
The accessing and processing of portion of database by
individual threads in parallel can greatly improve the
performance of the query.
 Speed up: The ability to execute the same request on the
same amount of data in less time.
 Scale up: The ability to obtain the same performance on
the same requests as the database size increases

Mapping the Data Warehouse to a
Multiprocessor Architecture
 Types of parallelism:
 Inter query Parallelism: In which different server threads
or processes handle multiple requests at the same time.
 Intra query Parallelism: This form of parallelism
decomposes the serial SQL query into lower level
operations such as scan, join, sort etc. Then these lower
level operations are executed concurrently in parallel.

 Types of parallelism: Intra query parallelism can
be done in either of two ways:
 Horizontal parallelism: which means that the data
base is partitioned across multiple disks and parallel
processing occurs within a specific task that is
performed concurrently on different processors against
different set of data

 Types of parallelism: Intra query parallelism can
be done in either of two ways:
 Vertical parallelism: This occurs among different
tasks. All query components such as scan, join, sort etc
are executed in parallel in a pipelined fashion. In other
words, an output from one task becomes an input into
another task.

 Data partitioning
 Data partitioning is the key component for effective
parallel execution of data base operations.
It spread data from database tables across multiple
disks so that I/O operations such as read and write can
be performed in parallel.
Partition can be done randomly or intelligently

 Random portioning includes random data striping
across multiple disks on a single server. Another
option for random portioning is round robin fashion
partitioning in which each record is placed on the next
disk assigned to the data base.
 Since DBMS doe not know where each record resides.

 Intelligent partitioning assumes that DBMS knows
where a specific record is located and does not waste
time searching for it across all disks. The various
intelligent partitioning include:
 Hash partitioning, Key range partitioning, Schema
portioning and User defined portioning

 Data base architectures of parallel processing:
Shared memory or shared everything Architecture
Shared disk architecture
Shred nothing architecture

 Shared memory or shared everything Architecture:
 Multiple PUs share memory.
 Each PU has full access to all shared memory through a common
bus.
 Communication between nodes occurs via shared memory.
 Performance is limited by the bandwidth of the memory bus.
 It is providing consistent single system to the user.

 Shared memory or shared everything
Architecture:

 Shared memory or shared everything
Architecture:
 Threads provide better resource utilization and faster
context switching, thus providing for better scalability.
 At the same time, threads that are too tightly coupled
with the OS may limit RDBMS portability.
 Scalability of shared-memory architectures is limited.

 Shared disk architecture:

 Each node consists of one or more PUs and associated
memory.
Memory is not shared between nodes.
Communication occurs over a common high-speed bus.
Each node has access to the same disks and other resources.

 It implements the concept of shared ownership of the
entire database between RBMS server, each of which is
running on a node of a distributed memory system.
 Each RDBMS server can read, write, update, and remove
data from the same shared database, which would require
the system to implement a form of a distributed lock
manager(DLM).

 Since the memory is not shared among the nodes, each
node has its own data cache.
Cache consistency must be maintained across the nodes
and a lock manager is needed to maintain the consistency.
 There is additional overhead in maintaining the locks and
ensuring that the data caches are consistent

 Shared disk architecture – Advantages:
 Shared disk systems permit high availability. All data
is accessible even if one node dies.
 These systems have the concept of one database,
which is an advantage over shared nothing systems.
 Shared disk systems provide for incremental growth.

 Shared disk architecture – Disadvantages:
Inter-node synchronization is required, involving DLM
overhead and greater dependency on high-speed
interconnect.
If the workload is not partitioned well, there may be high
synchronization overhead.
There is operating system overhead of running shared disk
software..

 Shared Nothing Architecture

 Shared Nothing Architecture:
 Shared nothing systems are typically loosely coupled.
In shared nothing systems only one CPU is connected
to a given disk.
If a table or database is located on that disk, access
depends entirely on the PU which owns it.

 Shared Nothing Architecture – Advantages:
 Shared nothing systems provide for incremental
growth.
 Failure is local: if one node fails, the others stay
up.

 Shared Nothing Architecture – Advantages:
 More coordination is required
 More overhead is required for a process working
on a disk belonging to another node.

 Parallel DBMS Vendors – Oracle:
Architecture: shared disk architecture
Data partition: Key range, hash, round robin
Parallel operations: hash joins, scan and sort

 Parallel DBMS Vendors – Informix:
Architecture: Shared memory, shared disk and
shared nothing models
Data partition: round robin, hash, schema, key
range and user defined
Parallel operations: INSERT, UPDATE, DELELTE

 Parallel DBMS Vendors – IBM:
 Architecture: Shared nothing models
 Data partition: hash
 Parallel operations: INSERT, UPDATE,
DELELTE, load, recovery, index creation, backup,
table reorganization

 Parallel DBMS Vendors – SYBASE:
Architecture: Shared nothing models
Data partition: hash, key range, Schema
Parallel operations: Horizontal and vertical
parallelism

DBMS Schemas for Decision
Support
A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
 In general terms, dimensions are the perspectives
or entities with respect to which an organization
wants to keep records.

Support
 Example:
A sales data warehouse in order to keep records of the
store’s sales with respect to the dimensions time, item,
branch, and location.
 Each dimension may have a table associated with it, called
a dimension table, which further describes the dimension.
A dimension table for item may contain the attributes item
name, brand, and type.

Support
 Example:
A multidimensional data model is typically organized
around a central theme, like sales, for instance. This theme
is represented by a fact table.
Facts are numerical measures.
Examples of facts for a sales data warehouse include
dollars sold (sales amount in dollars), units sold (number of
units sold), and amount budgeted.

Support
 Example

Support
 Star schema
 The most popular data model for a data warehouse is a
multidimensional model.
 Data warehouse contains a large central table (fact table) containing
the bulk of the data, with no redundancy, and
 A set of smaller associated tables (dimension tables), one for each
dimension.
 The dimension tables displayed in a radial pattern around the central
fact table.

Support
 Star schema

Support
 Star schema – Characteristics:
Simple structure - > easy to understand schema
Great query effectives -> small number of tables to
join
Relatively long time of loading data into dimension
tables -> de-normalization, redundancy data caused
that size of the table could be large.

Support
 Snowflake schema:
 The snowflake schema is a variant of the star schema
model, where some dimension tables are normalized,
thereby further splitting the data into additional tables.
 The resulting schema graph forms a shape similar to a
snowflake.

Support

Support
 The major difference between the snowflake and star
schema models is that the dimension tables of the
snowflake model may be kept in normalized form to
reduce redundancies.
 Such a table is easy to maintain and saves storage
space.

Support
Fact constellation
 Sophisticated applications may require multiple
fact tables to share dimension tables.
 This kind of schema can be viewed as a collection
of stars, and hence is called a galaxy schema or a
fact constellation.

Support
Fact constellation

Business Analysis: Reporting and
Query Tools and Applications
 The principal purpose of data warehousing is to
provide information to business users for strategic
decision making.
 These users interact with the data warehouse using
front-end tools, or by getting the required information
through the information delivery system.

 Tool Categories
 Reporting - Companies generate regular operational reports. Ex: Calculating
and printing paychecks.
 Managed query - make it possible for knowledge workers to access corporate
data without IS intervention
 Executive information systems - It allows developers to build customized,
graphical decision support applications
 OLAP - A natural way to view corporate data
 Data Mining - It uses a variety of statistical and artificial-intelligence
algorithms to analyze the correlation of variables

 Cognos Impromptu:
 Impromptu is an interactive database reporting tool.
 It allows Power Users to query data without
programming knowledge.
 When using the Impromptu tool, no data is written or
changed in the database. It is only capable of reading
the data

 Cognos Impromptu – Features:
 Interactive reporting capability
Enterprise-wide scalability
Superior user interface
Fastest time to result
Lowest cost of ownership

 Catalogs:
 Impromptu stores metadata in subject related folders
 This metadata will be used to develop a query for a report.
The metadata set is stored in a file called a catalog.
 The catalog does not contain any data.
It just contains information about connecting to the
database and the fields that will be accessible for reports.

 Catalogs contains:
 Folders—meaningful groups of information representing columns
from one or more tables
 Columns—individual data elements that can appear in one or more
folders
 Calculations—expressions used to compute required values from
existing data
 Conditions—used to filter information so that only a certain type of
information is displayed

Prompts—pre-defined selection criteria prompts
that users can include in reports they create
Other components, such as metadata, a logical
database name, join information, and user classes

 Catalogs – Uses:
view, run, and print reports
export reports to other applications
disconnect from and connect to the database
create reports
change the contents of the catalog

 Reports:
 Reports are created by choosing fields from the
catalog folders. This process will build a SQL
(Structured Query Language) statement behind the
scene.
 No SQL knowledge is required to use Impromptu.

 Reports:
 The data in the report may be formatted, sorted and/or
grouped as needed.
 Titles, dates, headers and footers and other standard text
formatting features (italics, bolding, and font size) are also
available.
 Once the desired layout is obtained, the report can be
saved to a report file.

 Frame-Based Reporting
 Frames are the building blocks of all Impromptu reports and
templates.
 They may contain report objects, such as data, text, pictures, and
charts.
 There are no limits to the number of frames that you can place
within an individual report or template. You can nest frames
within other frames to group report objects within a report.

 Form frame: An empty form frame appears.
 List frame: An empty list frame appears.
 Text frame: The flashing I-beam appears where you can begin
inserting text.
 Picture frame: The Source tab (Picture Properties dialog box)
appears. You can use this tab to select the image to include in the
frame.

Chart frame: The Data tab (Chart Properties dialog box)
appears. You can use this tab to select the data item to
include in the chart.
OLE Object: The Insert Object dialog box appears where
you can locate and select the file you want to insert, or you
can create a new object using the software listed in the
Object Type box.

 Impromptu features
 Unified query and reporting interface: It unifies both
query and reporting interface in a single user interface.
 Object oriented architecture: It enables an inheritance
based administration so that more than 1000 users can be
accommodated as easily as single user.

Scalability: Its scalability ranges from single user to
1000 user
Security and Control: Security is based on user
profiles and their classes.
 Data presented in a business context: It presents
information using the terminology of the business.

Over 70 pre defined report templates: It allows users can
simply supply the data to create an interactive report
Frame based reporting: It offers number of objects to
create a user designed report
Business relevant reporting: It can be used to generate a
business relevant report through filters, pre conditions and
calculations

Online Analytical Processing
(OLAP)
 It uses database tables (fact and dimension tables) to
enable multidimensional viewing, analysis and
querying of large amounts of data.
 Online Analytical Processing (OLAP) applications and
tools are those that are designed to ask ―complex
queries of large multidimensional collections of data

(OLAP)
Need:
 It is the multidimensional nature of the business
problem.
 These problems are characterized by retrieving a very
large number of records that can reach gigabytes and
terabytes and summarizing this data into a form
information that can by used by business analysts.

(OLAP)
The Multidimensional Data Model:
 OLAP is on-line, it must provide answers quickly,
analysts pose iterative queries during interactive
sessions
 Multidimensional data model is to view it as a
cube.

(OLAP)
 The Multidimensional Data Model – Operations:
 Roll-up: The roll-up operation (also called the drill-up
operation by some vendors) performs aggregation on a
data cube, either by climbing up a concept hierarchy
for a dimension or by dimension reduction.

(OLAP)
 The Multidimensional Data Model –
Operations:

(OLAP)
 The Multidimensional Data Model – Operations:
 Drill-down: Drill-down is the reverse of roll-up. It
navigates from less detailed data to more detailed data.
 Drill-down can be realized by either stepping down a
concept hierarchy for a dimension or introducing
additional dimensions.

(OLAP)
Operations:
 Slice and dice: The slice operation performs a
selection on one dimension of the given cube,
resulting in a subcube.

(OLAP)
Operations:
 The dice operation defines a subcube by
performing a selection on two or more dimensions.

(OLAP)
Operations:
 Pivot (also called rotate) is a visualization
operation that rotates the data axes in view to
provide an alternative data presentation.

OLAP vs OLTP
OLTP OLAP
Source of data Operational data; OLTPs are the
original source of the data.
Consolidation data; OLAP
data comes from the
various OLTP Databases
Purpose of data To control and run fundamental
business tasks
To help with planning,
problem solving, and
decision support
What the data Reveals a snapshot of ongoing
business processes
Multi-dimensional views of
various kinds of business
activities
Queries Relatively standardized and simple
queries Returning relatively few
records
Often complex queries
involving aggregations

OLAP vs OLTP
OLTP OLAP
Processing Speed Typically very fast Depends on the amount of
data involved;
Space Requirements Can be relatively small if historical
data is archived
Larger due to the existence
of aggregation structures
and history data;
Database Design Highly normalized with many
tables
Typically de-normalized
with fewer tables; use of
star and/or snowflake
schemas
Backup and
Recovery
data loss is likely to entail
significant monetary loss and legal
liability
Instead of regular backups,
some environments may
consider simply reloading
the OLTP data as a
recovery method

(OLAP)
 Categories of OLAP Tools – MOLAP:
 This is the more traditional way of OLAP analysis.
In MOLAP, data is stored in a multidimensional cube.
The storage is not in the relational database, but in
proprietary formats.
That is, data stored in array-based structures.

(OLAP)
 Categories of OLAP Tools – ROLAP
This methodology relies on manipulating the data
stored in the relational database to give the appearance
of traditional OLAP's slicing and dicing functionality.
In essence, each action of slicing and dicing is
equivalent to adding a "WHERE" clause in the SQL
statement. Data stored in relational tables

(OLAP)
 Categories of OLAP Tools – HOLAP (MQE: Managed Query
Environment)
 HOLAP technologies attempt to combine the advantages of MOLAP
and ROLAP.
 For summary-type information, HOLAP leverages cube technology for
faster performance.
 It stores only the indexes and aggregations in the multidimensional
form while the rest of the data is stored in the relational database.

20IT501_DWDM_PPT_Unit_I.ppt

Recommended

Recommended

More Related Content

Similar to 20IT501_DWDM_PPT_Unit_I.ppt

Similar to 20IT501_DWDM_PPT_Unit_I.ppt (20)

More from SamPrem3

More from SamPrem3 (9)

Recently uploaded

Recently uploaded (20)

20IT501_DWDM_PPT_Unit_I.ppt