DATA WAREHOUSING
AND
DATA MINING
UNIT – 2
Prepared by
Mr. P. Nandakumar
Assistant Professor,
Department of IT, SVCET
Data Pre-processing
It refers to the cleaning, transforming, and integrating
of data in order to make it ready for analysis.
The goal of data preprocessing is to improve the
quality of the data and to make it more suitable for the
specific data mining task.
Data preprocessing is a data mining technique which
is used to transform the raw data in a useful and
efficient format.
Data Pre-processing
Some common steps in data preprocessing
include:
• Data cleaning: this step involves identifying and
removing missing, inconsistent, or irrelevant data. This
can include removing duplicate records, filling in
missing values, and handling outliers.
• Data integration: this step involves combining data
from multiple sources, such as databases,
spreadsheets, and text files. The goal of integration is
to create a single, consistent view of the data.
Data Pre-processing
Some common steps in data preprocessing include:
• Data transformation: this step involves converting the
data into a format that is more suitable for the data
mining task. This can include normalizing numerical
data, creating dummy variables, and encoding
categorical data.
• Data reduction: this step is used to select a subset of
the data that is relevant to the data mining task. This can
include feature selection (selecting a subset of the
variables) or feature extraction (extracting new variables
from the data).
• Data discretization: this step is used to convert
Major Tasks in Data Preprocessing
Stages of Data Processing
Data Pre-processing
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can
be handled in various ways.
Some of them are:
 Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
 Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed
to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to
complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be
Methods of Data Cleaning
Usage of Data Cleaning in Data Mining
Characteristics of Data Cleaning
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0).
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data
cube.
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute. The attribute having p-value greater than significance level
can be discarded.
Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Component Analysis)
Tools for Data Cleaning in Data Mining
 OpenRefine
 Trifacta Wrangler
 Drake
 Data Ladder
 Data Cleaner
 Cloudingo
 Reifier
 IBM Infosphere Quality Stage
 TIBCO Clarity
 Winpure
Benefits of Data Cleaning
Having clean data will ultimately increase overall productivity and allow for
the highest quality information in your decision-making.
 Removal of errors when multiple sources of data are at play.
 Fewer errors make for happier clients and less-frustrated employees.
 Ability to map the different functions and what your data is intended to
 Monitoring errors and better reporting to see where errors are coming
making it easier to fix incorrect or corrupt data for future applications.
 Using tools for data cleaning will make for more efficient business
and quicker decision-making.
Association Rule Mining
 Association rule mining is a popular and well researched method for
discovering interesting relations between variables in large databases.
 It is intended to identify strong rules discovered in databases using different
measures of interestingness.
 Based on the concept of strong rules, RakeshAgrawal et al. introduced
association rules.
 Association rule mining finds interesting associations and relationships among
large sets of data items.
 This rule shows how frequently a itemset occurs in a transaction.
 A typical example is a Market Based Analysis.
Types of Association Rules in Data
Mining
1. Multi-relational association rules: Multi-Relation Association Rules (MRAR) is a new
class of association rules, different from original, simple, and even multi-relational
association rules (usually extracted from multi-relational databases), each rule
element consists of one entity but many a relationship.
2. Generalized association rules: Generalized association rule extraction is a powerful
tool for getting a rough idea of ​​interesting patterns hidden in data.
3. Quantitative association rules: Quantitative association rules is a special type of
association rule. Unlike general association rules, where both left and right sides of
the rule should be categorical (nominal or discrete) attributes, at least one attribute
(left or right) of quantitative association rules must contain numeric attributes.
Uses of Association Rules
1. Medical Diagnosis: Association rules in medical diagnosis
can be used to help doctors cure patients.
2. Market Basket Analysis: It is one of the most popular
examples and uses of association rule mining.
Association Rule Mining – Market based
analysis
 Market Based Analysis is one of the key techniques used by large relations to
show associations between items.
 It allows retailers to identify relationships between the items that people buy
together frequently.
 Given a set of transactions, we can find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction.
 This process analyzes customer buying habits by finding associations between
the different items that customers place in their shopping baskets.
Association Rule Mining – Market based
analysis
Frequent Pattern Mining
 Frequent pattern mining in data mining is the process of identifying patterns or
associations within a dataset that occur frequently.
 This is typically done by analyzing large datasets to find items or sets of items that
appear together frequently.
 There are several different algorithms used for frequent pattern mining, including
1. Apriori algorithm.
2. ECLAT algorithm. (Equivalence Class Clustering and Bottom-Up Lattice Traversal)
3. FP-growth algorithm. (Frequent Pattern)
 Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.
Frequent Pattern Mining
Frequent pattern mining can be classified in various ways, based on the following
criteria:
1. Based on the completeness of patterns to be mined.
2. Based on the levels of abstraction involved in the rule set.
3. Based on the number of data dimensions involved in the rule.
4. Based on the types of values handled in the rule.
5. Based on the kinds of rules to be mined.
6. Based on the kinds of patterns to be mined.
Frequent Pattern Mining
 Advantages:
1. It can find useful information which is not visible in simple data browsing.
2. It can find interesting association and correlation among data items.
 Disadvantages:
1. It can generate a large number of patterns.
2. With high dimensionality, the number of patterns can be very large,
making it difficult to interpret the results.
Apriori Algorithm
Finding Frequent Itemsets Using Candidate Generation:
The Apriori Algorithm
 Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994
for mining frequent itemsets for Boolean association rules.
 The name of the algorithm is based on the fact that the algorithm uses prior
knowledge of frequent itemset properties.
 Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets.
Approaches For Mining Multilevel Association
Rules
1. Uniform Minimum Support: When a uniform minimum support threshold is
used, the search procedure is simplified.
2. Reduced Minimum Support: Each level of abstraction has its own minimum
support threshold.
3. Group-Based Minimum Support: Because users or experts often have insight
as to which groups are more important than others, it is sometimes more
desirable to set up user-specific, item, or group based minimal support
thresholds when mining multilevel rules.
From Association Mining to Correlation
Analysis
 A correlation measure can be used to augment the support-confidence
framework for association rules.
 This leads to correlation rules of the form:
A=>B [support, confidence, correlation]
 That is, a correlation rule is measured not only by its support and confidence
but also by the correlation between itemsets A and B.

UNIT 2: Part 2: Data Warehousing and Data Mining

  • 1.
    DATA WAREHOUSING AND DATA MINING UNIT– 2 Prepared by Mr. P. Nandakumar Assistant Professor, Department of IT, SVCET
  • 2.
    Data Pre-processing It refersto the cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data mining task. Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.
  • 3.
    Data Pre-processing Some commonsteps in data preprocessing include: • Data cleaning: this step involves identifying and removing missing, inconsistent, or irrelevant data. This can include removing duplicate records, filling in missing values, and handling outliers. • Data integration: this step involves combining data from multiple sources, such as databases, spreadsheets, and text files. The goal of integration is to create a single, consistent view of the data.
  • 4.
    Data Pre-processing Some commonsteps in data preprocessing include: • Data transformation: this step involves converting the data into a format that is more suitable for the data mining task. This can include normalizing numerical data, creating dummy variables, and encoding categorical data. • Data reduction: this step is used to select a subset of the data that is relevant to the data mining task. This can include feature selection (selecting a subset of the variables) or feature extraction (extracting new variables from the data). • Data discretization: this step is used to convert
  • 5.
    Major Tasks inData Preprocessing
  • 7.
    Stages of DataProcessing
  • 10.
    Data Pre-processing Steps Involvedin Data Preprocessing: 1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. (a). Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are:  Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.  Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.
  • 11.
    (b). Noisy Data: Noisydata is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways : Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. Regression: Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). Clustering: This approach groups the similar data in a cluster. The outliers may be
  • 12.
  • 13.
    Usage of DataCleaning in Data Mining
  • 14.
  • 15.
    2. Data Transformation: Thisstep is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways: Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0). Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help the mining process. Discretization: This is done to replace the raw values of numeric attribute by interval levels or conceptual levels. Concept Hierarchy Generation: Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”.
  • 16.
    3. Data Reduction: Sincedata mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs. The various steps to data reduction are: Data Cube Aggregation: Aggregation operation is applied to data for the construction of the data cube. Attribute Subset Selection: The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute. The attribute having p-value greater than significance level can be discarded.
  • 17.
    Numerosity Reduction: This enableto store the model of data instead of whole data, for example: Regression Models. Dimensionality Reduction: This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis)
  • 18.
    Tools for DataCleaning in Data Mining  OpenRefine  Trifacta Wrangler  Drake  Data Ladder  Data Cleaner  Cloudingo  Reifier  IBM Infosphere Quality Stage  TIBCO Clarity  Winpure
  • 19.
    Benefits of DataCleaning Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making.  Removal of errors when multiple sources of data are at play.  Fewer errors make for happier clients and less-frustrated employees.  Ability to map the different functions and what your data is intended to  Monitoring errors and better reporting to see where errors are coming making it easier to fix incorrect or corrupt data for future applications.  Using tools for data cleaning will make for more efficient business and quicker decision-making.
  • 20.
    Association Rule Mining Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases.  It is intended to identify strong rules discovered in databases using different measures of interestingness.  Based on the concept of strong rules, RakeshAgrawal et al. introduced association rules.  Association rule mining finds interesting associations and relationships among large sets of data items.  This rule shows how frequently a itemset occurs in a transaction.  A typical example is a Market Based Analysis.
  • 21.
    Types of AssociationRules in Data Mining 1. Multi-relational association rules: Multi-Relation Association Rules (MRAR) is a new class of association rules, different from original, simple, and even multi-relational association rules (usually extracted from multi-relational databases), each rule element consists of one entity but many a relationship. 2. Generalized association rules: Generalized association rule extraction is a powerful tool for getting a rough idea of ​​interesting patterns hidden in data. 3. Quantitative association rules: Quantitative association rules is a special type of association rule. Unlike general association rules, where both left and right sides of the rule should be categorical (nominal or discrete) attributes, at least one attribute (left or right) of quantitative association rules must contain numeric attributes.
  • 22.
    Uses of AssociationRules 1. Medical Diagnosis: Association rules in medical diagnosis can be used to help doctors cure patients. 2. Market Basket Analysis: It is one of the most popular examples and uses of association rule mining.
  • 23.
    Association Rule Mining– Market based analysis  Market Based Analysis is one of the key techniques used by large relations to show associations between items.  It allows retailers to identify relationships between the items that people buy together frequently.  Given a set of transactions, we can find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.  This process analyzes customer buying habits by finding associations between the different items that customers place in their shopping baskets.
  • 24.
    Association Rule Mining– Market based analysis
  • 25.
    Frequent Pattern Mining Frequent pattern mining in data mining is the process of identifying patterns or associations within a dataset that occur frequently.  This is typically done by analyzing large datasets to find items or sets of items that appear together frequently.  There are several different algorithms used for frequent pattern mining, including 1. Apriori algorithm. 2. ECLAT algorithm. (Equivalence Class Clustering and Bottom-Up Lattice Traversal) 3. FP-growth algorithm. (Frequent Pattern)  Frequent pattern mining has many applications, such as Market Basket Analysis, Recommender Systems, Fraud Detection, and many more.
  • 26.
    Frequent Pattern Mining Frequentpattern mining can be classified in various ways, based on the following criteria: 1. Based on the completeness of patterns to be mined. 2. Based on the levels of abstraction involved in the rule set. 3. Based on the number of data dimensions involved in the rule. 4. Based on the types of values handled in the rule. 5. Based on the kinds of rules to be mined. 6. Based on the kinds of patterns to be mined.
  • 27.
    Frequent Pattern Mining Advantages: 1. It can find useful information which is not visible in simple data browsing. 2. It can find interesting association and correlation among data items.  Disadvantages: 1. It can generate a large number of patterns. 2. With high dimensionality, the number of patterns can be very large, making it difficult to interpret the results.
  • 28.
    Apriori Algorithm Finding FrequentItemsets Using Candidate Generation: The Apriori Algorithm  Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules.  The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties.  Apriori employs an iterative approach known as a level-wise search, where k- itemsets are used to explore (k+1)-itemsets.
  • 29.
    Approaches For MiningMultilevel Association Rules 1. Uniform Minimum Support: When a uniform minimum support threshold is used, the search procedure is simplified. 2. Reduced Minimum Support: Each level of abstraction has its own minimum support threshold. 3. Group-Based Minimum Support: Because users or experts often have insight as to which groups are more important than others, it is sometimes more desirable to set up user-specific, item, or group based minimal support thresholds when mining multilevel rules.
  • 30.
    From Association Miningto Correlation Analysis  A correlation measure can be used to augment the support-confidence framework for association rules.  This leads to correlation rules of the form: A=>B [support, confidence, correlation]  That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets A and B.