Overview of Data Mining:
Generally, data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information -
information that can be used to increase revenue, cuts costs, or both.
Data mining, the extraction of hidden predictive information from large databases, is a
powerful new technology with great potential to help companies focus on the most important
information in their data warehouses. Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions.
Data mining software is one of a number of analytical tools for analyzing data. It allows
users to analyze data from many different dimensions or angles, categorize it, and summarize
the relationships identified.
DEFINITION OF 'DATA MINING'
A process used by companies to turn raw data into useful information. By using software to
look for patterns in large batches of data, businesses can learn more about their customers
and develop more effective marketing strategies as well as increase sales and decrease costs.
Data mining depends on effective data collection and warehousing as well as computer
Grocery stores are well-known users of data mining techniques. Many supermarkets offer
free loyalty cards to customers that give them access to reduced prices not available to non-
members. The cards make it easy for stores to track who is buying what, when they are
buying it, and at what price. The stores can then use this data, after analyzing it, for multiple
purposes, such as offering customers coupons that are targeted to their buying habits and
deciding when to put items on sale and when to sell them at full price.
Data Mining Engine
Data mining engine is very essential to the data mining system. It consists of a set of
These modules are for following tasks:
o Association and Correlation Analysis
o Cluster analysis
o Outlier analysis
o Evolution analysis
Purpose and Uses of Data Mining
The purpose of data mining is to identify patterns in order to make predictions from
information contained in databases. It allows the user to be proactive in identifying and
predicting trends with that information.
Common uses of data mining in government include knowledge discovery, fraud detection,
and analysis of research, decision support, and website personalization.
The most common federal government uses of data mining as identified by GAO include:
1) Improving service or performance
2) Detecting fraud, waste, and abuse
3) Analyzing scientific and research information
4) Managing human resources
5) Detecting criminal activities or patterns
6) Analyzing intelligence and detecting terrorist activities.
State government data mining efforts include programs to ensure that the proper beneficiaries
of state benefits programs receive the correct amount of benefits. Such uses can save states
substantial amounts of money that otherwise would be erroneously paid out in the form of
Moreover, in a recent report, GAO found that twenty one states are using data mining
software to look for unusual patterns in claims, provider, and beneficiary information stored
in data warehouses in order to identify potential provider abuse.
Majordata mining Tasks
The two high-level primary goals of data mining, in practice, are prediction and description.
1) Prediction involves using some variables or fields in the database to predict unknown or
future values of other variables of interest.
2) Description focuses on finding human-interpretable patterns describing the data.
The relative importance of prediction and description for particular data mining applications
can vary considerably. However, in the context of knowledge discovery process (KDD),
description tends to be more important than prediction. This is in contrast to pattern
recognition and machine learning applications (such as speech recognition) where prediction
is often the primary goal of the KDD process.
The goals of prediction and description are achieved by using the following primary data
1. Classification is learning a function that maps (classifies) a data item into one of several
2. Regression is learning a function which maps a data item to a real-valued prediction variable.
3. Clustering is a common descriptive task where one seeks to identify a finite set of categories
or clusters to describe the data.
o Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
o Closely related to clustering is the task of probability density estimation which consists of
techniques for estimating, from data, the joint multi-variate probability density function
of all of the variables/fields in the database.
4. Summarization involves methods for finding a compact description for a subset of data.
5. Dependency Modeling consists of finding a model which describes significant dependencies
o Dependency models exist at two levels:
i. The structural level of the model specifies (often graphically) which variables are
locally dependent on each other, and
ii. The quantitative level of the model specifies the strengths of the dependencies using
some numerical scale.
6. Change and Deviation Detection focuses on discovering the most significant changes in the
data from previously measured or normative values.
Mining Methodologyand User Interaction Issues
It refers to the following kind of issues:
1. Mining different kinds of knowledge in databases
The need of different users is not the same. And Different user may be in interested in
different kind of knowledge. Therefore it is necessary for data mining to cover broad range of
knowledge discovery task.
2. Interactive mining of knowledge at multiple levels of abstraction
The data mining process needs to be interactive because it allows users to focus the search
for patterns, providing and refining data mining requests based on returned results.
3. Incorporation of background knowledge
To guide discovery process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple level of abstraction.
4. Data mining query languages and ad hoc data mining
Data Mining Query language that allows the user to describe ad hoc mining tasks should be
integrated with a data warehouse query language and optimized for efficient and flexible data
5. Presentation and visualization of data mining results
Once the patterns are discovered it needs to be expressed in high level languages, visual
representations. These representations should be easily understandable by the users.
6. Handling noisy or incomplete data
The data cleaning methods are required that can handle the noise, incomplete objects while
mining the data regularities. If data cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
7. Pattern evaluation
It refers to interestingness of the problem. The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
It refers to the following issues:
1. Efficiency and scalability of data mining algorithms
In order to effectively extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
2. Parallel, distributed, and incremental mining algorithms
The factors such as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is further processed parallel. Then the
results from the partitions are merged. The incremental algorithms, updates databases
without having mine the data again from scratch.
Diverse Data Types Issues
1. Handling of relational and complex types of data
The database may contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kind of data.
2. Mining information from heterogeneous databases and global information systems
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining knowledge from them adds
challenges to data mining.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. preparing the data
involves the following activities:
1) Data Cleaning
Data cleaning involves removing the noise and treatment of missing values. The noise is
removed by applying smoothing techniques and the problem of missing values is solved by
replacing a missing value with most commonly occurring value for that attribute.
2) Relevance Analysis
Database may also have the irrelevant attributes. Correlation analysis is used to know
whether any two given attributes are related.
3) Data Transformation and reduction
The data can be transformed by any of the following methods.
Normalization - The data is transformed using normalization. Normalization involves
scaling all values for given attribute in order to make them fall within a small specified
range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
Generalization -The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
Data Mining Applications
Here is the list of areas where data mining is widely used:
Financial Data Analysis
Biological Data Analysis
Other Scientific Applications
1. FINANCIAL DATA ANALYSIS
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates the systematic data analysis and data mining.
Here are the few typical cases:
o Design and construction of data warehouses for multidimensional data analysis and data
o Loan payment prediction and customer credit policy analysis.
o Classification and clustering of customers for targeted marketing.
o Detection of money laundering and other financial crimes.
2. RETAIL INDUSTRY
Data Mining has its great application in Retail Industry because it collects large amount data
from on sales, customer purchasing history, goods transportation, consumption and services.
It is natural that the quantity of data collected will continue to expand rapidly because of
increasing ease, availability and popularity of web.
The Data Mining in Retail Industry helps in identifying customer buying patterns and trends
that leads to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in retail industry:
o Design and Construction of data warehouses based on benefits of data mining.
o Multidimensional analysis of sales, customers, products, time and region.
o Analysis of effectiveness of sales campaigns.
o Customer Retention.
o Product recommendation and cross-referencing of items.
3. TELECOMMUNICATION INDUSTRY
Today the Telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web
data transmission etc.
Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
Data Mining in Telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
Here is the list examples for which data mining improve telecommunication services:
o Multidimensional Analysis of Telecommunication data.
o Fraudulent pattern analysis.
o Identification of unusual patterns.
o Multidimensional association and sequential patterns analysis.
o Mobile Telecommunication services.
o Use of visualization tools in telecommunication data analysis.
4. BIOLOGICAL DATA ANALYSIS
Nowadays we see that there is vast growth in field of biology such as genomics, proteomics,
functional Genomics and biomedical research. Biological data mining is very important part
Following are the aspects in which Data mining contribute for biological data analysis:
o Semantic integration of heterogeneous, distributed genomic and proteomic databases.
o Alignment, indexing, similarity search and comparative analysis multiple nucleotide
o Discovery of structural patterns and analysis of genetic networks and protein pathways.
o Association and path analysis.
o Visualization tools in genetic data analysis.
5. OTHER SCIENTIFIC APPLICATIONS
The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been collected
from scientific domains such as geosciences, astronomy etc. There is large amount of data
sets being generated because of the fast numerical simulations in various fields such as
climate, and ecosystem modeling, chemical engineering, fluid dynamics etc.
Following are the applications of data mining in field of Scientific Applications:
o Data Warehouses and data preprocessing.
o Graph-based mining.
o Visualization and domain specific knowledge.
6. INTRUSION DETECTION
Intrusion refers to any kind of action that threatens integrity, confidentiality, or availability of
network resources. In this world of connectivity security has become the major issue. With
increased usage of internet and availability of tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network
Here is the list of areas in which data mining technology may be applied for intrusion
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build discriminating
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
Data mining Process
Data Mining is an analytic process designed to explore data (usually large amounts of data -
typically business or market related - also known as "big data") in search of consistent
patterns and/or systematic relationships between variables, and then to validate the findings
by applying the detected patterns to new subsets of data.
The ultimate goal of data mining is prediction - and predictive data mining is the most
common type of data mining and one that has the most direct business applications.
The process of data mining consists of three stages:
1. The initial exploration.
2. Model building or pattern identification with validation/verification.
3. Deployment (i.e., the application of the model to new data in order to generate predictions).
Stage 1: Exploration
This stage usually starts with data preparation which may involve cleaning data, data
transformations, selecting subsets of records and - in case of data sets with large numbers
of variables ("fields") - performing some preliminary feature selection operations to bring
the number of variables to a manageable range (depending on the statistical methods
which are being considered).
Then, depending on the nature of the analytic problem, this first stage of the process of
data mining may involve anywhere between a simple choice of straightforward predictors
for a regression model, to elaborate exploratory analyses using a wide variety of
graphical and statistical methods.
Stage 2: Model building and validation
This stage involves considering various models and choosing the best one based on their
predictive performance (i.e., explaining the variability in question and producing stable
results across samples). This may sound like a simple operation, but in fact, it sometimes
involves a very elaborate process. There are a variety of techniques developed to achieve
that goal - many of which are based on so-called "competitive evaluation of models," that
is, applying different models to the same data set and then comparing their performance
to choose the best.
These techniques - which are often considered the core of predictive data mining -
include: Bagging(Voting, Averaging), Boosting, Stacking (Stacked Generalizations),
Stage 3: Deployment
That final stage involves using the model selected as best in the previous stage and applying
it to new data in order to generate predictions or estimates of the expected result or outcome.
The concept of Data Mining is becoming increasingly popular as a business information
management tool where it is expected to reveal knowledge structures that can guide decisions
in conditions of limited certainty or assurance.
In recent times, there has been increased interest in developing new analytic techniques
specifically designed to address the issues relevant to business data
mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles
of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it
shares with them both some components of its general approaches and specific techniques.
However, an important general difference in the focus and purpose between Data Mining and
the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented
towards applications than the basic nature of the underlying phenomena. In other words, Data
Mining is relatively less concerned with identifying the specific relations between the
involved variables. For example, uncovering the nature of the underlying functions or the
specific types of interactive, multivariate dependencies between variables are not the main
goal of Data Mining. Instead, the focus is on producing a solution that can generate useful
predictions. Therefore, Data Mining accepts among others a "black box" approach to data
exploration or knowledge discovery and uses not only the traditional Exploratory Data
Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate
valid predictions but are not capable of identifying the specific nature of the interrelations
between the variables on which the predictions are based.
The Scope ofData Mining
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a element of valuable ore. Both processes require
either sifting through an immense amount of material, or intelligently probing it to find
exactly where the value resides. Given databases of sufficient size and quality, data mining
technology can generate new business opportunities by providing these capabilities:
1. Automated prediction
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive
hands-on analysis can now be answered directly from the data quickly.
A typical example of a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize return on
investment in future mailings. Other predictive problems include forecasting bankruptcy and
other forms of default, and identifying segments of a population likely to respond similarly to
2. Automated discovery
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step.
An example of pattern discovery is the analysis of retail sales data to identify seemingly
unrelated products that are often purchased together. Other pattern discovery problems
include detecting fraudulent credit card transactions and identifying anomalous data that
could represent data entry keying errors.
Data mining techniques can yield the benefits of automation on existing software and
hardware platforms, and can be implemented on new systems as existing platforms are
upgraded and new products developed. When data mining tools are implemented on high
performance parallel processing systems, they can analyze massive databases in minutes.
Faster processing means that users can automatically experiment with more models to
understand complex data. High speed makes it practical for users to analyze huge quantities
of data. Larger databases, in turn, yield improved predictions.
Techniques of data mining
1. Decision trees:
Tree-shaped structures that represent sets of decisions. These decisions generate rules for the
classification of a dataset. Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).
In decision tree technique, the root of the decision tree is a simple question or condition that
has multiple answers. Each answer then leads to a set of questions or conditions that help us
determine the data so that we can make the final decision based on it.
For example, we use the following decision tree to determine whether or not to play tennis.
Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it
is rainy, we should only play tennis if the wind is week. And if it is sunny then we should
play tennis in case the humidity is normal.
Association is one of the best known data mining technique. In association, a pattern is
discovered based on a relationship between items in the same transaction. That’s the reason
why association technique is also known as relation technique. The association technique is
used in market basket analysis to identify a set of products that customers frequently
Retailers are using association technique to research customer’s buying habits. Based on
historical sale data, retailers might find out that customers always buy crisps when they buy
beers, and therefore they can put beers and crisps next to each other to save time for customer
and increase sales.
Classification is a classic data mining technique based on machine learning. Basically
classification is used to classify each item in a set of data into one of predefined set of classes
or groups. Classification method makes use of mathematical techniques such as decision
trees, linear programming, neural network and statistics.
In classification, we develop the software that can learn how to classify the data items into
For example, we can apply classification in application that “given all records of employees
who left the company; predict who will probably leave the company in a future period.” In
this case, we divide the records of employees into two groups that named “leave” and “stay”.
And then we can ask our data mining software to classify the employees into separate groups.
Clustering is a data mining technique that makes meaningful or useful cluster of objects
which have similar characteristics using automatic technique.
The clustering technique defines the classes and puts objects in each class, while in the
classification techniques, objects are assigned into predefined classes. To make the concept
clearer, we can take book management in library as an example. In a library, there is a wide
range of books in various topics available. The challenge is how to keep those books in a way
that readers can take several books in a particular topic without hassle. By using clustering
technique, we can keep books that have some kinds of similarities in one cluster or one shelf
and label it with a meaningful name. If readers want to grab books in that topic, they would
only have to go to that shelf instead of looking for entire library.
The prediction, as it name implied, is one of a data mining techniques that discovers
relationship between independent variables and relationship between dependent and
For instance, the prediction analysis technique can be used in sale to predict profit for the
future if we consider sale is an independent variable, profit could be a dependent variable.
Then based on the historical sale and profit data, we can draw a fitted regression curve that is
used for profit prediction.
6. Sequential Patterns
Sequential patterns analysis is one of data mining technique that seeks to discover or identify
similar patterns, regular events or trends in transaction data over a business period.
In sales, with historical transaction data, businesses can identify a set of items that customers
buy together different times in a year. Then businesses can use this information to
recommend customers buy it with better deals based on their purchasing frequency in the
Challenges in Web Mining
The web poses great challenges for resource and knowledge discovery based on the
1. The web is too huge
The size of the web is very huge and rapidly increasing. This seems that the web is too huge
for data warehousing and data mining.
2. Complexity of Web pages
The web pages do not have unifying structure. They are very complex as compared to
traditional text document. There are huge amount of documents in digital library of web.
These libraries are not arranged according in any particular sorted order.
3. Web is dynamic information source
The information on the web is rapidly updated. The data such as news, stock markets,
weather, sports, shopping etc are regularly updated.
4. Diversity of user communities
The user community on the web is rapidly expanding. These users have different
backgrounds, interests, and usage purposes. There are more than 100 million workstations
that are connected to the Internet and still rapidly increasing.
5. Relevancy of Information
It is considered that a particular person is generally interested in only small portion of the
web, while the rest of the portion of the web contains the information that is not relevant to
the user and may swamp desired results.
Advantages of Data Mining
1. Marketing / Retail
Data mining helps marketing companies build models based on historical data to predict who
will respond to the new marketing campaigns such as direct mail, online marketing
campaign. Through the results, marketers will have appropriate approach to sell profitable
products to targeted customers.
Data mining brings a lot of benefits to retail companies in the same way as marketing.
Through market basket analysis, a store can have an appropriate production arrangement in a
way that customers can buy frequent buying products together with pleasant. In addition, it
also helps the retail companies offer certain discounts for particular products that will attract
2. Finance / Banking
Data mining gives financial institutions information about loan information and credit
reporting. By building a model from historical customer’s data, the bank and financial
institution can determine good and bad loans. In addition, data mining helps banks detect
fraudulent credit card transactions to protect credit card’s owner.
By applying data mining in operational engineering data, manufacturers can detect faulty
equipments and determine optimal control parameters.
For example semi-conductor manufacturers has a challenge that even the conditions of
manufacturing environments at different wafer production plants are similar, the quality of
wafer are lot the same and some for unknown reasons even has defects. Data mining has
been applying to determine the ranges of control parameters that lead to the production of
golden wafer. Then those optimal control parameters are used to manufacture wafers with
Data mining helps government agency by digging and analyzing records of financial
transaction to build patterns that can detect money laundering or criminal activities.
1. Privacy Issues
The concerns about the personal privacy have been increasing enormously recently
especially when internet is booming with social networks, e-commerce, forums, blogs.
Because of privacy issues, people are afraid of their personal information is collected and
used in unethical way that potentially causing them a lot of troubles. Businesses collect
information about their customers in many ways for understanding their purchasing
However businesses don’t last forever, some days they may be acquired by other or gone. At
this time the personal information they own probably is sold to other or leak.
2. Security issues
Security is a big issue. Businesses own information about their employees and customers
including social security number, birthday, payroll and etc.
However how properly this information is taken care is still in questions. There have been a
lot of cases that hackers accessed and stole big data of customers from big corporation such
as Ford Motor Credit Company, Sony, with so much personal and financial information
available, the credit card stolen and identity theft become a big problem.
3. Misuse of information/inaccurate information
Information is collected through data mining intended for the ethical purposes can be
misused. This information may be exploited by unethical people or businesses to take
benefits of vulnerable people or discriminate against a group of people.
In addition, data mining technique is not perfectly accurate. Therefore if inaccurate
information is used for decision-making, it will cause serious consequence.
Data Mining Example: Marketing
In marketing in the area of advertising campaigns data mining can often increase
the response and purchase rate by a factor of two to three.
The following describes a typical [data mining] example:
A company wants to launch an advertising campaign for a product. Among its present
customers the company wants to post product information to those with a high probability of
purchasing the product. The company has data describing the past customer behaviour and
personal data about each of its customers. There are also customers who have already bought
the product, e.g. in a trial period. The customers of the trial period are divided into two
classes: those who have bought the product and those who have not. With this data a
prediction model is created to predict the probability of purchasing the product. After that the
probability of purchasing the product is predicted for all other customers. Only those with a
higher probability are addressed. As a side effect the company learns with this data mining
analysis which are the relevant driver attributes of its customers buying a specific product (or
at least being very interested in it).
The example shows how Data Mining can help in marketing to predict the purchase
probability of customers for a specific product. This reduces cost, because sales activity can
be focused much better (lower cost for mailings and flyers or for cost intensive sales agents’
visits on the spot). The customers benefit at the same time because the average relevance of
the company’s offers increases (or the other way round: the “spam” quota of non-relevant
offers is reduced).
A Producerwants to know…………
1. Which are our lowest/highest margin customers?
2. Who are my customer and what products they are buying?
3. Which customers are most likely to go to the competitors?
4. What impacts will new products/ services have on revenues and margins?
5. What product promotions have biggest impact on revenues?
6. What is the most effective distribution channel?
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different
sources made available to end users in what they can understand and use in a business
context- Barry Devlin
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.”
o The process of constructing and using data warehouses.
o Organized around major subjects, such as customer, product, sales.
o Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
o Provide a simple and concise view around particular subject issues by excluding data
that are not useful in the decision support process.
Characteristics ofData warehousing
Organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on daily operations or
Provide a simple and concise view around particular subject issues by excluding data that are
not useful in the decision support process.
Constructed by integrating multiple, heterogeneous data sources.
o Relational databases, flat files, on-line transaction records.
Data cleaning and data integration techniques are applied.
o Ensure consistency in naming conventions, encoding structures, attribute measures, etc.
among different data sources.
E.g., Hotel price: currency, tax, breakfast covered, etc.
o When data is moved to the warehouse, it is converted.
3. Time Variant
The time horizon for the data warehouse is significantly longer than that of operational
o Operational database: current value data.
o Data warehouse data: provide information from a historical perspective (e.g., past 5-10
Every key structure in the data warehouse contains an element of time, explicitly or
implicitly but the key of operational data may or may not contain “time element”.
Once data is entered into warehouse are not changed and updated.
A physically separate store of data transformed from the operational environment.
Operational update of data does not occur in the data warehouse environment.
o Does not require transaction processing, recovery, and concurrency control mechanisms.
o Requires only two operations in data accessing: Initial loading of data and access of data.
Purpose of Data warehousing (why data warehousing)
"As part of a company's business intelligence solution, a data warehouse is integral to the
gathering, processing and use of all the information a business receives daily. A strong
business intelligence plan, coupled with a robust data warehouse, will guarantee a business
has all the tools needed to make the right decisions for today and for the future.
The term "Business Intelligence" describes the process a business uses to gather all its raw
data from multiple sources and process it into practical information they will apply to
determine effectiveness of business processes, create policy, forecast trends, analyze the
market and much more. Data warehousing is an integral part of any effective business
intelligence endeavour. Data warehousing is more than just a database-like method of storing
information. While a database simply holds data, a well designed data warehousing system is
actually comprised of three segments:
a. Staging: raw data is stored and manipulated by developers. The goal of developers in
this stage is to take raw information from widely disparate sources, standardize and
organize it, readying it for integration.
b. Integration: raw data is further categorized and stored logically according to the
needs of the end user, allowing easier access.
c. Access: data is presented to users in a coherent way that is easy to understand and
use. Clients employ computer applications to both access and analyze the information
the data warehousing system provides.
While many companies are on board with data warehousing and storage and use business
intelligence systems daily, others find the concept of a data warehouse and its benefits hard
Here is a look at some of the pros of employing a data warehousing solution:
1. Improved user access: a standard database can be read and manipulated by programs like
SQL Query Studio or the Oracle client, but there is considerable ramp up time for end users
to effectively use these apps to get what they need. Business intelligence and data warehouse
end-user access tools are built specifically for the purposes data warehouses are used:
analysis, benchmarking, prediction and more.
2. Better consistency of data: developers work with data warehousing systems after data has
been received so that all the information contained in the data warehouse is standardized.
Only uniform data can be used efficiently for successful comparisons. Other solutions simply
cannot match a data warehouse's level of consistency.
o A data warehouse has the ability to receive data from many different sources, meaning
any system in a business can contribute its data.
o Let's face it: different business segments use different applications. Only a proper data
warehouse solution can receive data from all of them and give a business the "big
picture" view that is needed to analyze the business, make plans, track competitors and
4. Advanced query processing: in most businesses, even the best database systems are bound
to either a single server or a handful of servers in a cluster. A data warehouse is a purpose-
built hardware solution far more advanced than standard database servers. What this means is
a data warehouse will process queries much faster and more effectively, leading to efficiency
and increased productivity.
5. Retention of data history: end-user applications typically don't have the ability, not to
mention the space, to maintain much transaction history and keep track of multiple changes
to data. Data warehousing solutions have the ability to track all alterations to data, providing
a reliable history of all changes, additions and deletions. With a data warehouse, the integrity
of data is ensured.
6. Disaster recovery implications: a data warehouse system offers a great deal of security
when it comes to disaster recovery. Since data from disparate systems is all sent to a data
warehouse, that data warehouse essentially acts as another information backup source.
Considering the data warehouse will also be backed up, that's now four places where the
same information will be stored: the original source, its backup, the data warehouse and its
subsequent backup. This is unparalleled information security.
Data Warehouse Tools and Utilities Functions
The following are the functions of Data Warehouse tools and Utilities:
1. Data Extraction
Data Extraction involves gathering the data from multiple heterogeneous sources.
2. Data Cleaning
Data Cleaning involves finding and correcting the errors in data.
3. Data Transformation
Data Transformation involves converting data from legacy format to warehouse format.
4. Data Loading
Data Loading involves sorting, summarizing, consolidating, checking integrity and building
indices and partitions.
Refreshing involves updating from data sources to warehouse.
DATA WAREHOUSING APPLICATIONS
In the world of computing the term data warehousing is an efficient system which is used for
reporting and analysis.
These systems are used to store the historical as well as current data which is used for
making trending reports which is used further for senior management reporting used for
comparisons annually and quarterly.
It helps in bringing all the data in a central location called data warehouse. All the data that is
stored in this warehouse is uploaded from the operational systems. The data in this
warehouse is passed through various operations. The data warehouse environment consists of
various source systems that provide this warehouse with data. Various data integration
technologies are used to make the data ready to use.
Various architectures, tools and applications are included for storing data in this warehouse.
A data warehouse has its foundation on a mainframe server. The data here is extracted and
organized which serves the user queries. It gives us the advantage of gathering information
and data from diverse resources for easy access and analysis. The applications of data
warehousing are data mining, web mining and decision support systems.
1. Data mining
It is the analysis of the data for the new relationships between various types of data. It is
basically done by sorting and analysing the data to recognize the patterns and relationships
between various types of data. Association of patterns is done by relating one event to
another. A sequence or path is setup after analysis of patterns where one event is responsible
for the occurrence of other. All the patterns are classified and organization of data is done
accordingly. And discovering of new patterns every time is used for predictive analysis. The
data mining techniques are used in research areas.
2. Web mining
It is becoming important in the field of customer relationship management. It is basically the
integration of data and information by data mining methodologies. The information is
gathered from all over the world. When used in customer relationship management it is used
to observe the customer behaviour and their needs more properly and surely this helps in
success of the market. The data mining parameters like classification association and
clustering are used for evaluation of the data.
3. Decision support system
It is an application of data warehousing which is used in analysis of data related to business
and presents its results in such a way so as to make the business decisions easier for the
business users. It is considered to be an informational application. It basic purposes are to
compare the sales figures of various weeks. Assumptions are also done by forecasting the
revenue figures based on the sales of products. The past experiences and sales are also
counted and make the decisions right. The information presented by decision support system
is done graphically. It may also include an artificial intelligence system for the purpose.
Seeing to all the above points it is clear that data warehousing has lot of applications which are
being used in almost every field.
Advantages and disadvantages of data warehouses
Data warehouses are the traditional solution for data integration, and for good reason, but this
is becoming increasingly difficult to scale and copy data from multiple data sources in
multiple organizations in multiple locations.
1. A Data Warehouse Delivers Enhanced Business Intelligence
By providing data from various sources, managers and executives will no longer need to
make business decisions based on limited data or their gut. In addition, “data warehouses and
related BI can be applied directly to business processes including marketing segmentation,
inventory management, financial management, and sales.”
2. A Data Warehouse Saves Time
Since business users can quickly access critical data from a number of sources (all in one
place) they can rapidly make informed decisions on key initiatives. They won’t waste
precious time retrieving data from multiple sources.
Not only can that but the business execs query the data themselves with little or no support
from IT, saving more time and more money. That means the business users won’t have to
wait until IT gets around to generating the reports, and those hardworking folks in IT can do
what they do best—keep the business running.
3. A Data Warehouse Enhances Data Quality and Consistency
A data warehouse implementation includes the conversion of data from numerous source
systems into a common format.
Since each data from the various departments is standardized, each department will produce
results that are in line with all the other departments. So you can have more confidence in the
accuracy of your data. And accurate data is the basis for strong business decisions.
4. A Data Warehouse Provides Historical Intelligence
A data warehouse stores large amounts of historical data so you can analyze different time
periods and trends in order to make future predictions. Such data typically cannot be stored in
a transactional database or used to generate reports from a transactional system.
5. A Data Warehouse Generates a High ROI
Finally, the piece de resistance—return on investment. Companies that have implemented
data warehouses and complementary BI systems have generated more revenue and saved
more money than companies that haven’t invested in BI systems and data warehouses.
The Disadvantagesofa Data Warehouse
1. Extra Reporting Work
Depending on the size of the organization, a data warehouse runs the risk of extra work on
departments. Each type of data that's needed in the warehouse typically has to be generated
by the IT teams in each division of the business. This can be as simple as duplicating data
from an existing database, but at other times, it involves gathering data from customers or
employees that wasn't gathered before.
2. Cost/Benefit Ratio
A commonly cited disadvantage of data warehousing is the cost/benefit analysis. A data
warehouse is a big IT project, and like many big IT projects, it can suck a lot of IT man hours
and budgetary money to generate a tool that doesn't get used often enough to justify the
This is completely sidestepping the issue of the expense of maintaining the data warehouse
and updating it as the business grows and adapts to the market.
3. Data Ownership Concerns
Data warehouses are often, but not always, Software as a Service implementations, or cloud
services applications. Your data security in this environment is only as good as your cloud
vendor. Even if implemented locally, there are concerns about data access throughout the
company. Make sure that the people doing the analysis are individuals that your organization
trusts, especially with customers' personal data. A data warehouse that leaks customer data is
a privacy and public relations nightmare.
4. Data Flexibility
Data warehouses tend to have static data sets with minimal ability to "drill down" to specific
solutions. The data is imported and filtered through a schema, and it is often days or weeks
old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc
queries and are thus notoriously difficult to tune for processing speed and query speed. While
the queries are often ad hoc, the queries are limited by what data relations were set when the
aggregation was assembled.
Top 10 challenges in building data warehouse forlarge banks
1) Lack of strategic focus to build Enterprise Data Warehouse (EDW)
Building EDW is a strategic initiative since it requires shift in culture, longer timescale &
more importantly it is an expensive affaire. Hence, it should be one of the top agendas of the
CXOs and they need to closely monitor the progress and also need to provide executive
support to break any unwanted barriers.
2) Needof considerable Time, Effort & Cost
Typical time taken for a global bank to build an EDW varies from a couple of years to 5
years. It also requires substantial effort & eventually huge amount of money to build a data
warehouse. Also, Evidence of successful ROI is very opaque in the existing data warehouse
3) Lack of cross divisional collaboration
Building EDW requires constructive collaboration from various teams like multiple business
divisions, source system teams, architecture & design teams, project teams and vendor
4) Technological complexity
Mostly, source data is kept in multiple operating systems & multiple data base technologies.
There are plenty of tools for data sourcing, data quality management, data integration, data
ware housing, reporting & analytics.
Choosing appropriate technology is not so simple and is complicated by various emerging
techniques like data virtualization, self service BI, in-data base analytics, columnar data
base, NOSQL database, massively parallel processing, in-memory computing and etc,.
Also, traditional data warehouse is required to be integrated with big data technologies &
Internet of Things for gaining business insights.
5) Ill-defined, changing business data requirements & Insensitivity of technical team in
understanding business requirements
Most of the time business finds difficulty in defining the data requirements, since data
requirements keep evolving as the use of data increases. However, technical team wants
finalised data requirements from business before designing & building a data warehouse.
6) Lack of clarity on true source of data
Most of the large banks have great legacy behind them and have been growing over decades
through mergers & acquisition. They have widespread footprint across geographies and
various customer segments. In this process, they have acquired many systems which are
poorly integrated, less documented and data 2is scattered across multiple systems. It is
nightmare for these banks to identify the true source of their data.
7) Lack of ability to manage data quality issues
Since data is an organisational asset it needs to be acquired & maintained well.
Many front office/customer facing systems don’t capture quality data at its origination. There
is no unified data capturing process across organisation.
For example, last name of a personal customer would not have been captured in a front office
system, since it is not a mandatory field, whereas it may be mandatory field for another
Sometimes there is lack of well defined processes & technologies to curtail the data quality
8) Vestedinterest of vendors in promoting their own solution
Most of the top data warehousing vendors have their own suit of solutions/products in the
entire data warehousing eco system. These vendors tend to promote their own solution
rather than advocating what is best suited for the customer.
9) Comfort of using divisional data marts
Reporting is indispensable activity of banking. Many banks have built divisional data
marts for fulfilling their own divisional needs. Though divisional marts do not provide
enterprise wide view, many business users are comfortable in using divisional data mart
assuming that “Known devil is better than unknown angel”.
10) Subordinate use of data ware house
Business users from various divisions need to use data warehouse for reporting, business
intelligence, data analytics & advanced analytics to unleash full potential of the enterprise
data asset. Under utilised data warehouse will not grow & will not yield the desired return
on investment (ROI).
Data Warehousing Solution for One of Europe's Largest Financial Services Groups
o The client sought a business intelligence solution to consolidate the mortgage administration
processes, provide better sales cycle management, mortgage product performance analysis,
financial forecasting based on sales demands, fraud detection and general mortgage
operational reporting. Infosys delivered a highly scalable solution.
o The client is one of Europe's largest financial services groups in corporate and commercial
banking, retail banking, credit cards and general insurance. The company sells mortgages to
corporate and retail customers through various channels. These mortgage systems run on
different technology platforms and follow different business processes.
o Consolidate the mortgage administration processes for all brands and BI for different brands.
o Satisfy better sales cycle management, mortgage product performance analysis, financial
forecasting based on sales demands, fraud detection and general mortgage operational
o The biggest challenge was to provide scalable architecture for consolidating huge amount of
o Infosys designed and implemented a data warehouse solution to extract information from the
Mortgage Sales Application and administration systems of different brands and house them
in a single data warehouse database. This resulted in a highly scalable solution that met the
Transaction volume expected: 73 Million per year; annual growth rate of 110%
Size expected: 180 GB at the end of Year 1; annual growth rate of 45%
o Infosys followed an iterative phased approach to implement the solution that included the
Business requirements analysis
Data warehouse dimensional modeling
ETL (Extract, Transform and Load) and business intelligence reporting development and
o Highly scalable solution to meet the following requirements:
Transaction volume expected: 73 Million per year; annual growth rate of 110%
Size expected: 180 GB at the end of Year 1; annual growth rate of 45%
Integrating Data Mining System with a Database or Data Warehouse System
The data mining system needs to be integrated with database or the data warehouse system. If
the data mining system is not integrated with any database or data warehouse system then
there will be no system to communicate with. This scheme is known as non-coupling
scheme. In this scheme the main focus is put on data mining design and for developing
efficient and effective algorithms for mining the available data sets.
Here is the list of Integration Schemes:
1) No Coupling
In this scheme the Data Mining system does not utilize any of the database or data warehouse
functions. It then fetches the data from a particular source and processes that data using some
data mining algorithms. The data mining result is stored in other file.
2) Loose Coupling
In this scheme the data mining system may use some of the functions of database and data
warehouse system. It then fetches the data from data respiratory managed by these systems
and perform data mining on that data. It then stores the mining result either in a file or in a
designated place in a database or data warehouse.
3) Semi-tight Coupling
In this scheme the data mining system is along with the kinking the efficient implementation
of data mining primitives can be provided in database or data warehouse systems.
4) Tight coupling
In this coupling scheme data mining system is smoothly integrated into database or data
warehouse system. The data mining subsystem is treated as one functional component of an
DEPARTMENT OF BUSINESS AND INDUSTRIAL
TERM ASSIGNMENT 2014-15
INFORMATION TECHNOLOGY FOR BUSINESS
TOPIC: DATA MINING AND DATA WAREHOUSING
GROUP NUMBER: 9
16: CHAWLA DIVYA
23: GANDHI SANI
43: LADHNI ROMA
SUBMITTED ON-24TH DECEMBER,2014
SUBMITTED TO – DR. JAYDEEPCHAUDRY