KUMARAGURU COLLEGE OF TECHNOLOGY
DATA WAREHOUSING AND DATA MINING
Contact No: 9788153199
Contact No: 9843286841
DATA WAREHOUSING AND DATA MINING
Fast, accurate and scalable data analysis techniques are needed to extract useful
information from huge pile of data. Data warehouse is a single, integrated source of
decision support information formed by collecting data from multiple sources, internal to
the organization as well as external, and transforming and summarizing this information
to enable improved decision making. Data warehouse is designed for easy access by users
to large amounts of information, and data access is typically supported by specialized
analytical tools and applications. Typical applications include decision support systems
and execution information system.
Data mining is the exploration and analysis of large quantities of data in order to
discover valid, novel, potentially useful, and ultimately understandable patterns in data. It
“An information extraction activity whose goal is to discover hidden facts contained
The process of extracting valid, previously unknown, comprehensible and
actionable information from large databases and using it to make crucial business
Data mining finds patterns and subtle relationships in data and infers rules that allow the
prediction of future results. A data mining model is a description of a specific aspect of a
dataset. It produces output values for an assigned set of input values. Typical applications
include market segmentation, customer profiling, fraud detection, evaluation of retail
promotions, and credit risk analysis.”
DATA WAREHOUSING AND DATA MINING
Everyday increasingly, organizations are analyzing current and historical data to identify
useful patterns and support business strategies.
A large amount of the right information is the key to survival in today’s competitive
environment. And this kind of information can be made available only if there’s totally
integrated enterprise data warehouse.
What is data warehousing?
A data warehouse is a subject-oriented, integrated, non-volatile & time-variant
collection of data in support of management’s decisions
NEED FOR A DATA WAREHOUSE :
• IT or business staff spending a lot of time developing special reports for decision-
• Lots of PC-based or small server systems obtaining extracts of data incapable of
presenting a holistic view of the entire gamut of information.
• Same data present on different systems, in different department and users may be
unaware of this fact.
• Difficulty in getting meaningful information in a timely manner.
• Multiple systems giving different answer to the business questions.
• Less analysis by decision makers and policy planners due to non-availability of
sophisticated tools and easily decipherable, timely and comprehensive information
PURPOSE OF A DATA WAREHOUSE :
Better business intelligence for end users.
• Reduction in time to locate, access and analyze information.
• Consolidation of disparate information sources.
• Replacement of older, less-responsive decision support systems
• Faster time to market for products and services
• Strategic advantage over competitors
Data Warehouse Characteristics:
1.Subject-orientedWH is organized around the major subjects of the enterprise
rather than the major application areas. This is reflected in the need to store decision-
support data rather than application-oriented data.
2.Integratedbecause the source data come together from different enterprise-wide
applications systems. The source data is often inconsistent using..The integrated data
source must be made consistent to present a unified view of the data to the users
3.Time-variantthe source data in the WH is only accurate and valid at some point in
time or over some time interval. The time-variance of the data warehouse is also
shown in the extended time that the data is held, the implicit or explicit association of
time with all data, and the fact that the data represents a series of snapshots
4.Non-volatiledata is not update in real time but is refresh from OS on a regular
basis. New data is always added as a supplement to DB, rather than replacement.
The DB continually absorbs this new data, incrementally integrating it with previous
DATA WAREHOUSE LIFE CYCLE:
Data warehousing is a concept. It is not a product that can be purchased off the shelf. It is
a set of hardware and software components integrated together which can be used to
analyze the massive amount of data stored in an efficient manner. It is a process through
which one can build a successful data warehouse. Following are the five steps towards
building a successful data warehouse.
4.DEVELOPMENT AND IMPLEMENTATION
1Operational data sourcesfor the DW is supplied from mainframe operational data
held in first generation hierarchical and network databases, departmental data held in
proprietary file systems, private data held on workstaions and private serves and
external systems such as the Internet, commercially available DB, or DB assoicated
with and organization’s suppliers or customers
2Operational datastore(ODS)is a repository of current and integrated operational
data used for analysis. It is often structured and supplied with data in the same way as
the data warehouse, but may in fact simply act as a staging area for data to be moved
into the warehouse
3load manageralso called the frontend component, it performance all the operations
associated with the extraction and loading of data into the warehouse. These
operations include simple transformations of the data to prepare the data for entry into
4warehouse managerperforms all the operations associated with the management of
the data in the warehouse. The operations performed by this component include
analysis of data to ensure consistency, transformation and merging of source data,
creation of indexes and views, generation of denormalizations and aggregations, and
archiving and backing-up data
5query manageralso called backend component, it performs all the operations
associated with the management of user queries. The operations performed by this
component include directing queries to the appropriate tables and scheduling the
execution of queries
6detailed, lightly and lightly summarized data,archive/backup data
8end-user access toolscan be categorized into five main groups: data reporting and
query tools, application development tools, executive information system (EIS) tools,
online analytical processing (OLAP) tools, and data mining tools
1Inflow- The processes associated with the extraction, cleansing, and loading of the
data from the source systems into the data warehouse.
2upflow- The process associated with adding value to the data in the warehouse
through summarizing, packaging , packaging, and distribution of the data
3downflow- The processes associated with archiving and backing-up of data in the
4outflow- The process associated with making the data availabe to the end-users
5Meta-flow- The processes associated with the management of the meta-data
Tools and Technologies:
1The critical steps in the construction of a data warehouse:
1after the critical steps, loading the results into target system can be carried out either
by separate products, or by a single, categories:
3database data replication tools
4dynamic transformation engines
The importance of managing meta-data(integration):
1The integration of meta-data, that is ”data about data”
2Meta-data is used for a variety of purposes and the management of it is a critical
issue in achieving a fully integrated data warehouse
3The major purpose of meta-data is to show the pathway back to where the data
began, so that the warehouse administrators know the history of any item in the
4The meta-data associated with data transformation and loading must describe the
source data and any changes that were made to the data
5The meta-data associated with data management describes the data as it is stored in
6The meta-data is required by the query manager to generate appropriate queries, also
is associated with the user of queries
Data Warehousing Issues
1Semantic Integration: When getting data from
multiple sources, must eliminate mismatches,
e.g., different currencies, DB schemas.
2Heterogeneous Sources: Must access data from
a variety of source formats and repositories.
Replication capabilities can be exploited here.
3Load, Refresh, Purge: Must load data,
periodically refresh it, and purge too-old data.
4Metadata Management: Must keep track of
source, loading time, and other information for
all data in the warehouse.
A logical structure that has a fact table containing factual data in the center,
surrounded by dimension tables containing reference data (which can be denormalized)
A variant of the star schema where dimension tables do not contain denormalized
A hybrid structure that contains a mixture of star and snowflake schemas.
The benefits of data warehousing:
1The potential benefits of data warehousing are high returns on investment.
2substantial competitive advantage..
3Increased productivity of corporate decision-makers..
4More cost effective decision making
5Better enterprise intelligence
6Enhanced customer service
7Better asset/liability management
8Business process reengineering
9Empowerment of all employees
On Line Transaction Processing:
OLTP systems are the major kinds of enterprise applications:
Order entry systems, Inventory control systems, Reservation
systems, Point-of-sale systems, Tracking systems, etc.
Executive information system (EIS) :
Present information at the highest level of summarization using corporate business
measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is
required. Graphics are usually generously incorporated to provide at-a-glance indications
Decision Support Systems (DSS) :
They ideally present information in graphical and tabular form, providing the user with
the ability to drill down on selected information. Note the increased detail and data
manipulation options presented.
What is data mining?
Data Mining refers to the process of analyzing the data from different perspectives
and summarizing it into useful information. Data mining software is one of the numbers
of tools used for analyzing data. It allows users to analyze from many different
dimensions or angles, categorize it, and summarize the relationship identified.
1Data Mining is about techniques for finding and describing Structural Patterns in
Data mining is the process of finding correlation or patterns among fields in large
The process of extracting valid, previously unknown, comprehensible, and actionable
information from large databases and using it to make crucial business decisions.
Different Types of Data Mining:
1Business Data Mining
2Scientific Data Mining
3Internet Data Mining
Five major elements of Data Mining:
1.Extract, transform, and load transaction data on to the data warehouse system.
2.Store and manage data in multidimensional database system.
3.Provide access to business analysts and information technology Professionals.
4.Analyze the data by application software.
5.Present the data in useful format such as graph or table.
Requirements of Data Mining:
1Handling of different type of data
2Efficiency and scalability of algorithm
3Usefulness, certainty and expressiveness of result
4Expression of various kinds of mining results
5Interactive mining knowledge at multiple levels
6Mining information from different sources of data
7Protection of privacy and data security
Various kinds of data on which Data Mining is applied :
5Spatial and temporal data
Data mining applications:
The Main application for Data Mining is WEB MINING.
What is Web Mining?
“Web mining can be broadly defined as the automated discovery
and analysis of useful information from the Web documents and services using data
Web mining is the application of data mining or other information process
techniques to WWW, to find useful patterns. People can take advantage of these patterns
to access WWW more efficiently.
NEED FOR WEB MINING:
Now a day, the World Wide Web is a popular and interactive medium, ideal for
publishing information. It is huge, diverse and dynamic and thus raises issue of
scalability, multimedia and temporal data respectively, due to those situations; the users
are currently “drowning” in an information overload that expands at rate that far outpaces
human ability to process and exploit it.
Domains of Web Mining:
There are three domains that pertain to Web mining:
1. Web Contents Mining
2. Web Structure Mining
3. Web Usage Mining
1. Web Content Mining
Web content mining is an automatic process that extracts patterns from on-line
information, such as the HTML files, images, or E-mails, and it already goes beyond only
keyword extraction or some simple statistics of words and phrases in documents. Web
content mining is the "process of information or resource discovery from millions of
sources across the World Wide Web ". There are two approaches in Web content mining:
The agent-based approach involves artificial intelligence systems that can "act
autonomously or semi-autonomously on behalf of a particular user, to discover and
organize Web-based information ". Some intelligent Web agents can use a user profile to
search for relevant information, then organize and interpret the discovered information
The database approach focuses on "integrating and organizing the heterogeneous
and semi-structured data on the Web into more structured and high-level collections of
resources." These "metadata, are organized into structured collections (e.g., relational or
object-oriented databases) and can be analyzed".
2. Web Structure Mining
The Data which describes organization of content.Intra-page structure information
includes the arrangement of various HTML or XML tags within a given page. This can
be represented as tree structure, where the <html> tag becomes the root of tree. The
principal kind of inter-page structure information is hyper-links connecting one page to
3. Web Usage Mining
Web servers record and accumulate data about user interactions whenever
requests for resources are received. Analyzing the Web access logs of different Web sites
can help to understand the user behavior and the Web structure, by improving design of
the colossal collection of resources.
Web Mining Techniques
The common techniques for Web mining are:
This technique is used to develop profiles of items with similar characteristics.
This ability enhances the discovery of relationships that are otherwise not obvious. Eg:
Classification of Web access logs allows a company to discover the average age of
customers who order a certain product.
2. Association rules
Rules that govern "databases of transactions where each transaction consists of a
set of items." This technique is used to predict the correlation of items "where the
presence of one set of items in a transaction implies (with a certain degree of confidence)
the presence of other items."
3. Path analysis
A Technique that involves the generation of some form of graph that "represents
relation[s] defined on Web pages." This can be the physical layout of a Web site in which
the Web pages are nodes and the hypertext links between these pages are directed edges.
Eg: what paths do users travel before they go to a particular URL.
4. Sequential patterns
Applied to Web access server transaction logs. The purpose is to discover
sequential patterns that indicate user visit patterns over a certain period.
Web mining as a tool:
Web mining can be a promising tool to address ineffective search
engines, which produce incomplete indexing, unverified reliability of retrieved
information. Web mining discovers information from mounds of data on the WWW, but
it also monitors and predicts user visit habits. This gives designers more reliable
information in structuring and designing a Web site. Web mining technology can help
librarians design Web sites with paths that can be traveled easily by end users, saving
time and effort. Eg: Web mining technology and academic librarianship
Data Warehousing provides the means to change the raw data into information for
making effective business decisions-the emphasis on information, not data.The Data
warehouse is the hub for decision support data.
Data mining is a useful tool with multiple algorithms that can be tuned for specific
tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to
speed up data mining process.Web mining is a huge, interdisciplinary and vary
dynamic/scientific area, converging from several research communities such as database,
information retrieval and artificial intelligence especially from machine learning and
natural language processing. This area is so broad today partly due to the interests of
various research communities.
2Data Base Systems-Elmasri, Navathe
3Data Mining Technologies-Arun K.Pujari
4Data Mining and Data Warehousing and OLAP-A.Berson, S.J.Smith
5Database Management System-Sylbardcards