SlideShare a Scribd company logo
1 of 164
20IT501 – Data Warehousing and
Data Mining
III Year / V Semester
UNIT III - ASSOCIATION RULE
MINING AND CLASSIFICATION
Mining Frequent Patterns, Associations and Correlations –
Mining Methods – Mining various Kinds of Association Rules
– Correlation Analysis – Constraint Based Association Mining
– Classification and Prediction - Basic Concepts - Decision
Tree Induction - Bayesian Classification – Rule Based
Classification – Classification by Back propagation – Support
Vector Machines - Other Classification Methods-Facsimile
extraction classification
Basic Concepts
Market Basket Analysis:
 Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data
sets.
 The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business
decision-making processes, such as catalog design, cross-marketing,
and customer shopping behavior analysis
Basic Concepts
 A typical example of frequent itemset mining is market
basket analysis.
 This process analyzes customer buying habits by finding
associations between the different items that customers
place in their “shopping baskets”.
 The discovery of such associations can help retailers
develop marketing strategies by gaining insight into which
items are frequently purchased together by customers.
Basic Concepts
 Each item has a Boolean variable representing the presence
or absence of that item.
 Each basket can then be represented by a Boolean vector of
values assigned to these variables.
 The Boolean vectors can be analyzed for buying patterns
that reflect items that are frequently associated or purchased
together. These patterns can be represented in the form of
association rules.
Basic Concepts
 The information that customers who purchase
computers also tend to buy antivirus software at the
same time is represented in Association Rule.
 computer  antivirus software [support = 2%,
confidence = 60%]
 Rule support and confidence are two measures of rule
interestingness.
Basic Concepts
 A support of 2% for Association Rule means that 2% of all
the transactions under analysis show that computer and
antivirus software are purchased together.
 A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software.
 Association rules are considered interesting if they satisfy
both a minimum support threshold and a minimum
confidence threshold.
Frequent Itemsets, Closed Itemsets,
and Association Rules
 A set of items is referred to as an itemset.
 An itemset that contains k items is a k-itemset.
The set {computer, antivirus software} is a 2-
itemset.
 The occurrence frequency of an itemset is the
number of transactions that contain the itemset.
Frequent Itemsets, Closed Itemsets,
and Association Rules
 In general, association rule mining can be viewed as a
two-step process:
 Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a predetermined
minimum support count, min sup.
 Generate strong association rules from the frequent
itemsets: By definition, these rules must satisfy minimum
support and minimum confidence.
Frequent Pattern Mining
 Market basket analysis is just one form of
frequent pattern mining.
 There are many kinds of frequent patterns,
association rules, and correlation relationships.
Frequent Pattern Mining -
Classification
 Based on the completeness of patterns to be mined:
 Mine the complete set of frequent itemsets, the closed
frequent itemsets, and the maximal frequent itemsets, given
a minimum support threshold.
 Mine constrained frequent itemsets (i.e., those that satisfy
a set of user-defined constraints), approximate frequent
itemsets (i.e., those that derive only approximate support
counts for the mined frequent itemsets),
Frequent Pattern Mining -
Classification
Based on the completeness of patterns to be
mined:
 Mine near-match frequent itemsets (i.e., those that
tally the support count of the near or almost matching
itemsets),
 Mine top-k frequent itemsets (i.e., the k most frequent
itemsets for a user-specified value, k), and so on.
Frequent Pattern Mining -
Classification
 Based on the levels of abstraction involved in the rule
set:
 suppose that a set of association rules mined includes the
following rules where X is a variable representing a
customer:
buys(X, “computer”)  buys(X, “HP printer”)
 buys(X, “laptop computer”)  buys(X, “HP printer”)
computer is a higher-level abstraction of laptop computer
Frequent Pattern Mining -
Classification
Based on the number of data dimensions
involved in the rule:
 If the items or attributes in an association rule
reference only one dimension, then it is a single-
dimensional association rule.
 buys(X, “computer”)  buys(X, “HP printer”)
Frequent Pattern Mining -
Classification
Based on the number of data dimensions involved
in the rule:
 If a rule references two or more dimensions, such as
the dimensions age, income, and buys, then it is a
multidimensional association rule.
 age(X, “30….39”)^income(X, “42K….48K”)  buys(X,
“high resolution TV”).
Frequent Pattern Mining -
Classification
 Based on the types of values handled in the rule:
 If a rule involves associations between the presence or
absence of items, it is a Boolean association rule.
 If a rule describes associations between quantitative
items or attributes, then it is a quantitative association
rule. Note that the quantitative attributes, age and
income, have been discretized.
Frequent Pattern Mining -
Classification
 Based on the kinds of rules to be mined:
 Association rules are the most popular kind of rules
generated from frequent patterns.
 Based on the kinds of patterns to be mined:
 Frequent itemset mining, that is, the mining of
frequent itemsets (sets of items) from transactional or
relational data sets.
Frequent Pattern Mining -
Classification
 Based on the kinds of patterns to be mined:
 Frequent itemset mining, that is, the mining of
frequent itemsets (sets of items) from transactional or
relational data sets.
 Sequential pattern mining searches for frequent
subsequences in a sequence data set, where a sequence
records an ordering of events.
Frequent Itemset Mining Methods
 Apriori, the basic algorithm for finding
frequent itemsets.
 To generate strong association rules from
frequent itemsets
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
 Apriori is a seminal algorithm proposed by R.
Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association
rules.
 The algorithm uses prior knowledge of
frequent itemset properties
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
 To improve the efficiency of the level-wise
generation of frequent itemsets, an important
property called the Apriori property is used to
reduce the search space.
 Apriori property: All nonempty subsets of a
frequent itemset must also be frequent.
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
 Step 1: In the first iteration of the algorithm, each item is a
member of the set of candidate 1-itemsets, C1. The
algorithm simply scans all of the transactions in order to
count the number of occurrences of each item.
 Step 2: Suppose that the minimum support count required
is 2, that is, min sup = 2. The set of frequent 1-itemsets, L1,
can then be determined. It consists of the candidate 1-
itemsets satisfying minimum support.
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
TID List of item_IDS
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
Itemset Support
Count
I1 6
I2 7
I3 6
I4 2
I5 2
Scan D for count
Of each candidate
Itemset Support
Count
I1 6
I2 7
I3 6
I4 2
I5 2
Compare candidate
support count with
mini. Support
count.
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
 Step 3: To discover the set of frequent 2-itemsets, L2, the algorithm
uses the join L1 on L1 to generate a candidate set of 2-itemsets, C2.
 Step 4: Next, the transactions in D are scanned and the support
count of each candidate itemset in C2 is accumulated.
 Step 5: The set of frequent 2-itemsets, L2, is then determined,
consisting of those candidate 2-itemsets in C2 having minimum
support
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
Itemset
I1, I2
I1, I3
I1, I4
I1, I5
I2, I3
I2, I4
I2, I5
I3, I4
I3, I5
I4, I5
Generate
C2 from L1
Itemset Sup.
Count
I1, I2 4
I1, I3 4
I1, I4 1
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
I3, I4 0
I3, I5 1
I4, I5 0
Scan D
for count
of each
candidate
Itemset Sup.
Count
I1, I2 4
I1, I3 4
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
Compare
candidate
support
count
with MSC
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
 Step 6: The generation of the set of candidate 3-
itemsets.
 Step 7: The transactions in D are scanned in order to
determine L3, consisting of those candidate 3-itemsets
in C3 having minimum support.
 Step 8: The algorithm uses to generate a candidate set
of 4-itemsets, C4.
The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
Itemset Support
Count
I1, I2, I3 2
I1, I2, I5 2
Scan D for count
Of each candidate
Itemset Support
Count
I1, I2, I3 2
I1, I2, I5 2
Compare candidate
support count with
mini. Support
count.
Itemset
I1, I2, I3
I1, I2, I5
Generate C3
candidates from
L2
Generating Association Rules from
Frequent Itemsets
 Once the frequent itemsets from transactions in a database
D have been found, it is straightforward to generate strong
association rules from them (where strong association rules
satisfy both minimum support and minimum confidence).
 Confidence
(A  B) = P (B | A) = support_count(A  B) /
support_count(A).
Generating Association Rules from
Frequent Itemsets
Association rules can be generated as follows:
 For each frequent itemset l, generate all nonempty
subsets of l.
 For every nonempty subset s of l, output the rule
“s  (l - s)”
Generating Association Rules from
Frequent Itemsets
 Suppose the data contain the frequent itemset l =
{I1, I2, I5}
 The nonempty subsets of l are {I1, I2}, {I1, I5},
{I2, I5}, {I1}, {I2}, and {I5}.
 The resulting association rules are as shown
below, each listed with its confidence:
 I1^I2 I5, confidence = 2 / 4 = 50%
Improving the Efficiency of Apriori
 Hash-based technique:
 A hash-based technique can be used to reduce the size
of the candidate k-itemsets, Ck, for k > 1.
 Generate all of the itemsets for each transaction, hash
(i.e., map) them into the different buckets of a hash
table structure, and increase the corresponding bucket
counts.
Improving the Efficiency of Apriori
 Hash-based technique:
Improving the Efficiency of Apriori
 Transaction reduction:
 A transaction that does not contain any frequent k-
itemsets cannot contain any frequent (k+1)-itemsets.
 Therefore, such a transaction can be marked or
removed from further consideration because
subsequent scans of the database for j-itemsets, where j
> k, will not require it.
Improving the Efficiency of Apriori
 Partitioning:
Improving the Efficiency of Apriori
 Partitioning:
 A partitioning technique can be used that requires just two
database scans to mine the frequent itemsets.
 In Phase I, the algorithm subdivides the transactions of D
into n non overlapping partitions.
 If the minimum support threshold for transactions in D is
min sup, then the minimum support count for a partition is
min sup the number of transactions in that partition.
Improving the Efficiency of Apriori
 Partitioning:
 For each partition, all frequent itemsets within the partition
are found. These are referred to as local frequent itemsets.
 Any itemset that is potentially frequent with respect to D
must occur as a frequent itemset in at least one of the
partitions. Therefore, all local frequent itemsets are
candidate itemsets with respect to D.
Improving the Efficiency of Apriori
 Partitioning:
 In Phase II, a second scan of D is conducted in which
the actual support of each candidate is assessed in
order to determine the global frequent itemsets.
Partition size and the number of partitions are set so
that each partition can fit into main memory and
therefore be read only once in each phase.
Improving the Efficiency of Apriori
 Sampling
 The basic idea of the sampling approach is to pick a
random sample S of the given data D, and then search for
frequent itemsets in S instead of D.
 The sample size of S is such that the search for frequent
itemsets in S can be done in main memory, and so only one
scan of the transactions in S is required overall.
Improving the Efficiency of Apriori
Dynamic itemset counting
 The database is partitioned into blocks marked by start
points.
 New candidate itemsets can be added at any start
point, unlike in Apriori, which determines new
candidate itemsets only immediately before each
complete database scan.
Mining Frequent Itemsets without
Candidate Generation
 The Apriori candidate generate-and-test method
significantly reduces the size of candidate sets, leading
to good performance gain.
 Issues:
 It may need to generate a huge number of candidate sets.
 It may need to repeatedly scan the database and check a
large set of candidates by pattern matching.
Mining Frequent Itemsets without
Candidate Generation
 An interesting method in this attempt is called frequent-pattern
growth, or simply FP-growth, which adopts a divide-and-conquer
strategy as follows.
 First, it compresses the database representing frequent items into a
frequent-pattern tree, or FP-tree, which retains the itemset
association information.
 It then divides the compressed database into a set of conditional
databases, each associated with one frequent item or “pattern
fragment,” and mines each such database separately.
Mining Frequent Itemsets without
Candidate Generation
Mining Frequent Itemsets without
Candidate Generation
 The FP-growth method transforms the problem of
finding long frequent patterns to searching for shorter
ones recursively and then concatenating the suffix.
 It uses the least frequent items as a suffix, offering good
selectivity.
 The method substantially reduces the search costs.
Mining Various Kinds of Association Rules
 Mining Multilevel Association Rules
 Strong associations discovered at high levels of
abstraction may represent commonsense knowledge.
 Therefore, data mining systems should provide
capabilities for mining association rules at multiple
levels of abstraction, with sufficient flexibility for easy
traversal among different abstraction spaces.
Mining Various Kinds of Association Rules
 Mining Multilevel Association Rules
 Association rules generated from mining data at
multiple levels of abstraction are called multiple-level
or multilevel association rules.
 Multilevel association rules can be mined efficiently
using concept hierarchies under a support-confidence
framework.
Mining Various Kinds of Association Rules
 Mining Multilevel Association Rules - Using uniform
minimum support for all levels:
 The same minimum support threshold is used when mining
at each level of abstraction.
 When a uniform minimum support threshold is used, the
search procedure is simplified.
 The method is also simple in that users are required to
specify only one minimum support threshold.
Mining Various Kinds of Association Rules
 Mining Multilevel Association Rules - Using
reduced minimum support at lower levels:
 Each level of abstraction has its own minimum
support threshold.
 The deeper the level of abstraction, the smaller the
corresponding threshold is.
Mining Various Kinds of Association Rules
 Mining Multilevel Association Rules
Mining Various Kinds of Association Rules
 Mining Multidimensional Association Rules from
Relational Databases and Data Warehouses:
 Database attributes can be categorical or quantitative.
 Categorical attributes have a finite number of possible
values, with no ordering among the values (e.g.,
occupation, brand, color).
 Categorical attributes are also called nominal attributes,
because their values are “names of things.”
Mining Various Kinds of Association Rules
 Mining Multidimensional Association Rules from
Relational Databases and Data Warehouses:
 Quantitative attributes are numeric and have an
implicit ordering among values (e.g., age, income,
price).
 Quantitative attributes are discretized using
predefined concept hierarchies.
Mining Various Kinds of Association Rules
 Mining Multidimensional Association Rules from
Relational Databases and Data Warehouses:
 A concept hierarchy for income may be used to
replace the original numeric values of this attribute by
interval labels, such as “0. . . 20K”, “21K . . . 30K”,
“31K . . . 40K”, and so on.
Here, discretization is static and predetermined.
Mining Various Kinds of Association Rules
 Mining Multidimensional Association Rules from
Relational Databases and Data Warehouses:
 Quantitative attributes are discretized or clustered into
“bins” based on the distribution of the data.
 The discretization process is dynamic and established
so as to satisfy some mining criteria, such as
maximizing the confidence of the rules mined.
Constraint-Based Association Mining
 A good heuristic is to have the users specify such
intuition or expectations as constraints to confine the
search space. This strategy is known as constraint-
based mining.
 Constraints: Knowledge type constraints, Data
constraints, Dimension/level constraints,
Interestingness constraints and Rule constraints.
Constraint-Based Association Mining
 Meta rule-Guided Mining of Association Rules:
 Metarules allow users to specify the syntactic form of rules
that they are interested in mining.
 The rule forms can be used as constraints to help improve
the efficiency of the mining process.
 Metarules may be based on the analyst’s experience,
expectations, or intuition regarding the data or may be
automatically generated based on the database schema.
Constraint-Based Association Mining
 Constraint Pushing: Mining Guided by Rule Constraints:
 Rule constraints specify expected set/subset relationships of the
variables in the mined rules, constant initiation of variables, and
aggregate functions.
 Rule constraints can be classified into the following five
categories with respect to frequent itemset mining: (1)
antimonotonic, (2) monotonic, (3) succinct, (4) convertible, and
(5) inconvertible.
Classification and Prediction
 The data analysis task is classification, where a
model or classifier is constructed to predict
categorical labels,
“safe” or “risky” for the loan application data;
“yes” or “no” for the marketing data; or
“treatment A,” “treatment B,” or “treatment C” for the
medical data.
Classification and Prediction
 The data analysis task is an example of numeric
prediction, where the model constructed predicts a
continuous-valued function, or ordered value, as
opposed to a categorical label. This model is a
predictor.
 Regression analysis is a statistical methodology that is
most often used for numeric prediction
Classification and Prediction
 How does classification work?
 A classifier is built describing a predetermined set of
data classes or concepts.
 This is the learning step (or training phase), where a
classification algorithm builds the classifier by
analyzing or “learning from” a training set made up of
database tuples and their associated class labels.
Classification and Prediction
 How does classification work?
 Each tuple, X, is assumed to belong to a predefined
class as determined by another database attribute
called the class label attribute.
 The class label attribute is discrete-valued and
unordered. It is categorical in that each value serves as
a category or class.
Classification and Prediction
 How does classification work?
 The class label of each training tuple is provided, this
step is also known as supervised learning.
 Unsupervised learning (or clustering), in which the
class label of each training tuple is not known, and the
number or set of classes to be learned may not be
known in advance.
Classification and Prediction
 What about classification accuracy?
 First, the predictive accuracy of the classifier is estimated.
 A test set is used, made up of test tuples and their
associated class labels.
 The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by
the classifier.
Issues Regarding Classification and
Prediction
 Preparing the Data for Classification and Prediction:
 Data cleaning: This refers to the preprocessing of data in
order to remove or reduce noise and the treatment of
missing values.
 Relevance analysis: Many of the attributes in the data may
be redundant. Attribute subset selection can be used in
these cases to find a reduced set of attributes.
Issues Regarding Classification and
Prediction
 Preparing the Data for Classification and Prediction:
 Data transformation and reduction: The data may be
transformed by normalization. Normalization involves scaling all
values for a given attribute so that they fall within a small
specified range, such as – 1.0 to 1.0, or 0.0 to 1.0.
 Concept hierarchies may be used for this purpose. numeric
values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high.
Comparing Classification and Prediction
Methods
 Accuracy: The accuracy of the classifier can be referred to
as the ability of the classifier to predict the class label
correctly, and the accuracy of the predictor can be referred
to as how well a given predictor can estimate the unknown
value.
 Speed: The speed of the method depends on the
computational cost of generating and using the classifier or
predictor.
Comparing Classification and Prediction
Methods
 Robustness: Robustness is the ability to make correct predictions or
classifications. In the context of data mining, robustness is the
ability of the classifier or predictor to make correct predictions from
incoming unknown data.
 Scalability: Scalability refers to an increase or decrease in the
performance of the classifier or predictor based on the given data.
 Interpretability: Interpretability is how readily we can understand
the reasoning behind predictions or classification made by the
predictor or classifier.
Classification by Decision Tree Induction
A decision tree is a flowchart-like tree structure,
where each internal node (nonleaf node) denotes a
test on an attribute, each branch represents an
outcome of the test, and each leaf node (or
terminal node) holds a class label.
 The topmost node in a tree is the root node.
Classification by Decision Tree Induction
Classification by Decision Tree Induction
 How are decision trees used for classification?
 Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested against
the decision tree.
 A path is traced from the root to a leaf node, which holds
the class prediction for that tuple.
 Decision trees can easily be converted to classification
rules.
Classification by Decision Tree Induction
 Why are decision tree classifiers so popular?
 The construction of decision tree classifiers does not
require any domain knowledge or parameter setting.
 Decision trees can handle high dimensional data.
 Decision tree induction algorithms have been used for
classification in many application areas, such as medicine,
manufacturing and production, financial analysis,
astronomy, and molecular biology.
Decision Tree Induction
 Algorithm: Generate decision tree. Generate a decision tree from the training
tuples of data partition D.
 Input:
 Data partition, D, which is a set of training tuples and their associated class labels;
 attribute list, the set of candidate attributes;
 Attribute selection method, a procedure to determine the splitting criterion that
“best” partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split point or splitting subset.
 Output: A decision tree.
Decision Tree Induction
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; //
majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best”
splitting criterion;
(7) label node N with splitting criterion;
Decision Tree Induction
(8) if splitting attribute is discrete-valued and multiway
splits allowed then // not restricted to binary trees
(9) attribute_list attribute_list - splitting attribute;
// remove splitting attribute
(10) for each outcome j of splitting criterion
//partition the tuples and grow subtrees for each
partition
Decision Tree Induction
(11) let Dj be the set of data tuples in D satisfying outcome j; // a
partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to
node N;
(14) else attach the node returned by Generate decision tree(Dj,
attribute list) to node N;
endfor
(15) return N;
Decision Tree Induction
Attribute Selection Measures:
 Information gain (or) Entropy:
 Information gain of the attribute:
Decision Tree Induction
Attribute Selection Measures:
 Gain:
Decision Tree Induction
RID Age Income Student Credit_Rating Class:
buys_computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle_aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior low Yes Fair Yes
6 Senior low Yes Excellent No
7 Middle_aged low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle_aged Medium No Excellent Yes
13 Middle_aged High Yes Fair Yes
Decision Tree Induction
Attribute Selection Measures:
 Information gain (or) Entropy:
 Info(D) = - 9/14 log2(9/14) - 5/14 log2(5/14)
= 0.940 bits
Decision Tree Induction
Attribute Selection Measures:
 Information gain of the attribute:
 Info age (D) = 0.694
Decision Tree Induction
Attribute Selection Measures:
 Gain:
 Gain (A) = .940 - .694 = 0.246 bits
Decision Tree Induction
Attribute Selection Measures:
 Gain(Age) = .940 - .694 = 0.246 bits
 Gain (Income) = 0.029 bits
 Gain (student) = 0.151 bits
 Gain (credit_rating) = 0.048 bits
Decision Tree Induction
Decision Tree Induction
 Gain ratio:
 The information gain measure is biased toward tests with
many outcomes.
 For example, consider an attribute that acts as a unique
identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there are
values), each one containing just one tuple.
Decision Tree Induction
 Gain ratio:
 C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio, which attempts
to overcome this bias.
It applies a kind of normalization to information gain
using a “split information” value defined analogously
with Info(D).
Decision Tree Induction
 Gain ratio:
 The attribute with the maximum gain ratio is selected
as the splitting attribute.
Decision Tree Induction
 Gini index:
 The Gini index is used in CART.
 The Gini index measures the impurity of D, a data
partition or set of training tuples, as
Decision Tree Induction
 Tree Pruning
 When a decision tree is built, many of the
branches will reflect anomalies in the training data
due to noise or outliers.
 Tree pruning methods address this problem of
overfitting the data.
Decision Tree Induction
 How does tree pruning work?
 In the prepruning approach, a tree is “pruned” by halting
its construction early.
 The second and more common approach is postpruning,
which removes subtrees from a “fully grown” tree.
 A subtree at a given node is pruned by removing its
branches and replacing it with a leaf.
Bayesian Classification
 Bayesian classifiers are statistical classifiers.
 They can predict class membership
probabilities, such as the probability that a
given tuple belongs to a particular class.
 Bayesian classification is based on Bayes’
theorem
Bayesian Classification
 Bayes’ Theorem
 Let X be a data tuple.
 Let H be some hypothesis, such as that the data tuple X
belongs to a specified class C.
 For classification problems, we want to determine P(HjX),
we are looking for the probability that tuple X belongs to
class C, given that we know the attribute description of X.
Bayesian Classification
 Bayes’ Theorem
 P(H | X) is the posterior probability, or a posteriori
probability, of H conditioned on X.
 P(H) is the prior probability, or a priori probability, of H.
 Similarly, P(X | H) is the posterior probability of X
conditioned on H.
 P(X) is the prior probability of X.
Bayesian Classification
 Naïve Bayesian Classification
 Bayesian classifiers have the minimum error rate in
comparison to all other classifiers.
 Bayesian classifiers are also useful in that they
provide a theoretical justification for other classifiers
that do not explicitly use Bayes’ theorem.
Bayesian Classification
 Naïve Bayesian Classification
 Let D be a training set of tuples and their
associated class labels. Each tuple is represented
by an n-dimensional attribute vector, X = (x1, x2, :
: : , xn), from n attributes, respectively, A1, A2, …
, An.
Bayesian Classification
 Naïve Bayesian Classification
 Suppose that there are m classes, C1, C2, : : : , Cm.
 Given a tuple, X, the classifier will predict that X belongs to
the class having the highest posterior probability, conditioned
on X.
 That is, the naïve Bayesian classifier predicts that tuple X
belongs to the class Ci if and only if
Bayesian Classification
 Naïve Bayesian Classification
 Suppose that there are m classes, C1, C2, : : : ,
Cm.
 The class Ci for which P(Ci | X) is maximized is called
the maximum posteriori hypothesis. By Bayes’ theorem
Bayesian Classification
 Naïve Bayesian Classification
 As P(X) is constant for all classes, only P(X | Ci)
P(Ci) need be maximized.
 If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = …. = P(Cm), and we would
therefore maximize P(X | Ci).
Bayesian Classification
 Naïve Bayesian Classification
 In order to reduce computation in evaluating P(X |
Ci), the naive assumption of class conditional
independence is made.
Bayesian Classification
 Naïve Bayesian Classification
 In order to predict the class label of X, P(X | i)P(Ci) is
evaluated for each class Ci. The classifier predicts that the
class label of tuple X is the class Ci if and only if
P(X | Ci)P(Ci) > P(X | Cj)P(Cj)
 In other words, the predicted class label is the class Ci for
which P(XjCi)P(Ci) is the maximum.
Bayesian Classification
 Naïve Bayesian Classification
 Class: C1: buys_computer = ‘yes’
 Class: C2: buys_computer = ‘no’
 Data to be classified:
 X = (age = youth, income = medium, student = yes, credit
rating = fair)
 P(Ci) = P(buys_computer = yes) = 9/14 = 0.643
P(buys_computer = no) = 5/14 = 0.357
Bayesian Classification
 Naïve Bayesian Classification
 To compute P(X | Ci),
 P(age = youth | buys_computer = yes) = 2/9 = 0.222
 P(age = youth | buys_computer = no) = 3/5 = 0.600
 P(income = medium | buys_computer = yes) = 4/9 = 0.444
 P(income = medium | buys_computer = no) = 2/5 = 0.400
 P(student = yes | buys_computer = yes) = 6/9 = 0.667
 P(student = yes | buys_computer = no) = 1/5 = 0.200
 P(credit rating = fair | buys_computer = yes) = 6/9 = 0.667
 P(credit rating = fair | buys_computer = no) = 2/5 = 0.400
Bayesian Classification
 Naïve Bayesian Classification
 X = (age = youth, income = medium, student = yes, credit rating
= fair)
 P(X | buys_computer = yes)
= P(age = youth | buys_computer = yes) * P(income = medium |
buys_computer = yes) * P(student = yes | buys_computer = yes)*
P(credit rating = fair | buys_computer = yes)
= 0.222 * 0.444*0.667*0.667 = 0.044.
 P(X|buys_computer = no) = 0.600*0.400*0.200*0.400 = 0.019.
Bayesian Classification
 Naïve Bayesian Classification
 To find the class, Ci, that maximizes P(X | Ci)P(Ci), we
compute
 P(X | buys_computer = yes)P(buys_computer = yes) =
0.044*0.643 = 0.028
 P(X | buys_computer = no)P(buys_computer = no) =
0.019*0.357 = 0:007
 The naïve Bayesian classifier predicts buys computer = yes for
tuple X.
Bayesian Classification
 Bayesian Belief Networks:
 Bayesian belief networks specify joint conditional probability
distributions.
 They allow class conditional independencies to be defined
between subsets of variables.
 They provide a graphical model of causal relationships, on
which learning can be performed.
 Bayesian belief networks are also known as belief networks,
Bayesian networks, and probabilistic networks.
Bayesian Classification
 Bayesian Belief Networks:
 A belief network is defined by two components—
a directed acyclic graph and a set of conditional
probability tables.
Rule-Based Classification
The learned model is represented as a set of IF-
THEN rules.
 Using IF-THEN Rules for Classification:
 Rules are a good way of representing information or
bits of knowledge.
A rule-based classifier uses a set of IF-THEN rules for
classification.
Rule-Based Classification
Using IF-THEN Rules for Classification:
 An IF-THEN rule is an expression of the form
 IF condition THEN conclusion
 An example is rule R1, R1: IF age = youth AND
student = yes THEN buys computer = yes.
Rule-Based Classification
Using IF-THEN Rules for Classification:
 A rule R can be assessed by its coverage and accuracy.
Given a tuple, X, from a class labeled data set, D, let
ncovers be the number of tuples covered by R; ncorrect be
the number of tuples correctly classified by R; and |Dj|
be the number of tuples in D.
We can define the coverage and accuracy of R as
Rule-Based Classification
Using IF-THEN Rules for Classification:
 A rule’s coverage is the percentage of tuples that
are covered by the rule.
 For a rule’s accuracy, tuples that it covers and see
what percentage of them the rule can correctly
classify.
Rule-Based Classification
Using IF-THEN Rules for Classification:
An example is rule R1, R1: IF age = youth AND
student = yes THEN buys computer = yes.
 Coverage (R1) = 2 / 14 = 14.28%
 Accuracy (R2) = 2 / 2 = 100%
Rule-Based Classification
Using IF-THEN Rules for Classification:
 Rule-based classification to predict the class label of a
given tuple, X.
 If a rule is satisfied by X, the rule is said to be
triggered.
 X= (age = youth, income = medium, student = yes,
credit rating = fair).
Rule-Based Classification
Using IF-THEN Rules for Classification:
 If more than one rule is triggered, we need a
conflict resolution strategy to figure out which rule
gets to fire and assign its class prediction to X.
 size ordering scheme: It assigns the highest priority to
the triggering rule that has the “toughest” requirements.
The triggering rule with the most attribute tests is fired
Rule-Based Classification
 Using IF-THEN Rules for Classification:
 The rule ordering scheme prioritizes the rules beforehand. The
ordering may be class based or rule-based.
 With class-based ordering, the classes are sorted in order of
decreasing “importance,” such as by decreasing order of
prevalence.
 That is, all of the rules for the most prevalent (or most frequent)
class come first, the rules for the next prevalent class come next,
and so on.
Rule-Based Classification
 Using IF-THEN Rules for Classification:
 With rule-based ordering, the rules are organized into one long
priority list, according to some measure of rule quality such as
accuracy, coverage, or size (number of attribute tests in the rule
antecedent), or based on advice from domain experts.
 When rule ordering is used, the rule set is known as a decision
list.
 With rule ordering, the triggering rule that appears earliest in the
list has highest priority, and so it gets to fire its class prediction.
Rule-Based Classification
Rule Extraction from a Decision Tree
 To extract rules from a decision tree, one rule is
created for each path from the root to a leaf node.
 Each splitting criterion along a given path is logically
ANDed to form the rule antecedent (“IF” part).
The leaf node holds the class prediction, forming the
rule consequent (“THEN” part).
Rule-Based Classification
Rule Extraction from a Decision Tree
 R1: IF age = youth AND student = no THEN buys
computer = no
R2: IF age = youth AND student = yes THEN buys
computer = yes
R3: IF age = middle aged THEN buys computer = yes
R4: IF age = senior AND credit rating = excellent
THEN buys computer = yes
R5: IF age = senior AND credit rating = fair THEN
buys computer = no
Rule-Based Classification
 Rule Induction Using a Sequential Covering Algorithm:
 The name comes from the notion that the rules are learned
sequentially (one at a time), where each rule for a given
class will ideally cover many of the tuples of that class (and
hopefully none of the tuples of other classes).
 There are many sequential covering algorithms. Popular
variations include AQ, CN2, and the more recent, RIPPER.
Rule-Based Classification
 Rule Induction Using a Sequential Covering Algorithm:
Rules are learned one at a time.
Each time a rule is learned, the tuples covered by the rule
are removed, and the process repeats on the remaining
tuples.
This sequential learning of rules is in contrast to decision
tree induction.
Classification by Backpropagation
 Backpropagation is a neural network learning algorithm
 A neural network is a set of connected input/output
units in which each connection has a weight associated
with it.
 Neural network learning is also referred to as
connectionist learning due to the connections between
units.
Classification by Backpropagation
 Backpropagation is a neural network learning algorithm
 A neural network is a set of connected input/output
units in which each connection has a weight associated
with it.
 Neural network learning is also referred to as
connectionist learning due to the connections between
units.
Classification by Backpropagation
A Multilayer Feed-Forward Neural Network:
Classification by Backpropagation
 A Multilayer Feed-Forward Neural Network:
 The backpropagation algorithm performs learning on a
multilayer feed-forward neural network.
 It iteratively learns a set of weights for prediction of the
class label of tuples.
 A multilayer feed-forward neural network consists of an
input layer, one or more hidden layers, and an output layer.
Classification by Backpropagation
 Each layer is made up of units. The inputs to the network
correspond to the attributes measured for each training
tuple.
 The inputs are fed simultaneously into the units making up
the input layer.
 These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer of
“neuronlike” units, known as a hidden layer.
Classification by Backpropagation
 The outputs of the hidden layer units can be input to
another hidden layer, and so on.
 The number of hidden layers is arbitrary, although in
practice, usually only one is used.
 The weighted outputs of the last hidden layer are input
to units making up the output layer, which emits the
network’s prediction for given tuples.
Classification by Backpropagation
 The units in the input layer are called input units.
 The units in the hidden layers and output layer are
sometimes referred to as neurodes, due to their
symbolic biological basis, or as output units.
 The multilayer neural network has two layers of output
units. Therefore, we say that it is a two-layer neural
network.
Classification by Backpropagation
 Network Topology:
 The user must decide on the network topology by
specifying the number of units in the input layer, the
number of hidden layers (if more than one), the number of
units in each hidden layer, and the number of units in the
output layer.
 Typically, input values are normalized so as to fall between
0.0 and 1.0.
Classification by Backpropagation
 Network Topology:
 Network design is a trial-and-error process and
may affect the accuracy of the resulting trained
network.
Classification by Backpropagation
 Backpropagation:
 Backpropagation learns by iteratively processing a data set
of training tuples, comparing the network’s prediction for
each tuple with the actual known target value.
 The target value may be the known class label of the
training tuple (for classification problems) or a continuous
value (for prediction).
Classification by Backpropagation
 Backpropagation:
 For each training tuple, the weights are modified so as to
minimize the mean squared error between the network’s
prediction and the actual target value.
 These modifications are made in the “backwards”
direction, that is, from the output layer, through each
hidden layer down to the first hidden layer (hence the name
backpropagation).
Classification by Backpropagation
 Backpropagation:
 Initialize the weights: The weights in the network are
initialized to small random numbers (e.g., ranging
from -1.0 to 1.0, or – 0.5 to 0.5).
 Each unit has a bias associated with it, as explained
below. The biases are similarly initialized to small
random numbers.
Classification by Backpropagation
 Backpropagation: Each training tuple, X, is processed
by the following steps.
 Propagate the inputs forward: First, the training tuple is fed
to the input layer of the network. The inputs pass through
the input units, unchanged.
 That is, for an input unit, j, its output, Oj, is equal to its
input value, Ij. Next, the net input and output of each unit
in the hidden and output layers are computed.
Classification by Backpropagation
 Backpropagation: Each training tuple, X, is processed
by the following steps.
 Propagate the inputs forward: The net input to a unit in the
hidden or output layers is computed as a linear combination
of its inputs.
 Each such unit has a number of inputs to it that are, in fact,
the outputs of the units connected to it in the previous layer.
 Each connection has a weight.
Classification by Backpropagation
 Backpropagation:
 To compute the net input to the unit, each input
connected to the unit is multiplied by its
corresponding weight, and this is summed.
Given a unit j in a hidden or output layer, the net
input, Ij, to unit j is
Classification by Backpropagation
 Backpropagation:
Classification by Backpropagation
 Backpropagation:
 Given a unit j in a hidden or output layer, the net input, Ij,
to unit j is
 where wi j is the weight of the connection from unit i in
the previous layer to unit j; Oi is the output of unit i from
the previous layer; and qj is the bias of the unit.
The bias acts as a threshold in that it serves to vary the
activity of the unit.
Classification by Backpropagation
 Backpropagation:
 Each unit in the hidden and output layers takes its net input
and then applies an activation function to it.
 The function symbolizes the activation of the neuron
represented by the unit. The logistic, or sigmoid, function is
used.
Given the net input Ij to unit j, then Oj, the output of unit j,
is computed as
Classification by Backpropagation
 Backpropagation:
 We compute the output values, Oj, for each hidden
layer, up to and including the output layer, which gives
the network’s prediction.
In practice, it is a good idea to cache (i.e., save) the
intermediate output values at each unit as they are
required again later, when backpropagating the error.
Classification by Backpropagation
 Backpropagation:
 Backpropagate the error: The error is propagated backward
by updating the weights and biases to reflect the error of
the network’s prediction. For a unit j in the output layer, the
error Err j is computed by
Errj = Oj(1-Oj)(Tj - Oj)
 where Oj is the actual output of unit j, and Tj is the known
target value of the given training tuple.
Classification by Backpropagation
 Backpropagation:
 To compute the error of a hidden layer unit j, the weighted
sum of the errors of the units connected to unit j in the next
layer are considered. The error of a hidden layer unit j is
 where wjk is the weight of the connection from unit j to a
unit k in the next higher layer, and Errk is the error of unit
k.
Classification by Backpropagation
 Backpropagation:
 The computational efficiency depends on the time
spent training the network. Given |D| tuples and w
weights, each epoch requires O(|D| x w) time.
Support Vector Machines
 Method for the classification of both linear
and nonlinear data.
 It uses a nonlinear mapping to transform the
original training data into a higher dimension.
 Within this new dimension, it searches for the
linear optimal separating hyperplane
Support Vector Machines
 With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane.
 The SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors).
 SVMs can be used for prediction as well as classification.
 Applications: Handwritten digit recognition, object recognition, and
speaker identification, as well as benchmark time-series prediction
tests.
Support Vector Machines
Goal:
The SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new
data point in the correct category in the future.
This best decision boundary is called a hyperplane.
Support Vector Machines
 The Case When the Data Are Linearly Separable:
 The 2-D data are linearly separable (or “linear,” for short)
because a straight line can be drawn to separate all of the
tuples of class+1 from all of the tuples of class -1.
 There are an infinite number of separating lines that could
be drawn.
To find the “best” one, that is, one that will have the
minimum classification error on previously unseen tuples.
Support Vector Machines
The Case When the Data Are Linearly Separable:
 Term “hyperplane” to refer to the decision boundary
that we are seeking, regardless of the number of input
attributes.
 An SVM approaches this problem by searching for the
maximum marginal hyperplane.
Support Vector Machines
The Case When the Data Are Linearly
Separable:
Support Vector Machines
The Case When the Data Are Linearly
Separable:
Support Vector Machines
The Case When the Data Are Linearly Separable:
 Both hyperplanes can correctly classify all of the
given data tuples.
 The hyperplane with the larger margin to be more
accurate at classifying future data tuples than the
hyperplane with the smaller margin.
Support Vector Machines
The Case When the Data Are Linearly
Inseparable:
Support Vector Machines
The Case When the Data Are Linearly
Inseparable:
 No straight line can be found that would separate the
classes.
 SVMs are capable of finding nonlinear decision
boundaries (i.e., nonlinear hypersurfaces) in input
space.
Support Vector Machines
The Case When the Data Are Linearly
Inseparable:
 Transform the original input data into a higher
dimensional space using a nonlinear mapping.
 Once the data have been transformed into the new
higher space, the second step searches for a linear
separating hyperplane in the new space
Other Classification Methods
 Genetic Algorithms:
 Genetic algorithms attempt to incorporate ideas of
natural evolution.
 An initial population is created consisting of
randomly generated rules.
 Each rule can be represented by a string of bits.
Other Classification Methods
 Genetic Algorithms:
 Based on the notion of survival of the fittest, a new
population is formed to consist of the fittest rules in
the current population, as well as offspring of these
rules.
 The fitness of a rule is assessed by its classification
accuracy on a set of training samples.
Other Classification Methods
 Genetic Algorithms:
 Offspring are created by applying genetic operators such
as crossover and mutation.
 In crossover, substrings from pairs of rules are swapped to
form new pairs of rules.
In mutation, randomly selected bits in a rule’s string are
inverted.
 To evaluate the fitness of other algorithms.
Other Classification Methods
 Rough Set Approach
 Rough set theory can be used for classification to
discover structural relationships within imprecise or
noisy data.
 It applies to discrete-valued attributes. Continuous-
valued attributes must therefore be discretized before
its use.
Other Classification Methods
 Rough Set Approach
 Rough set theory is based on the establishment of
equivalence classes within the given training data.
Rough sets can be used to approximately or “roughly”
define such classes.
 A rough set definition for a given class, C, is approximated
by two sets—a lower approximation of C and an upper
approximation of C.
Other Classification Methods
 Rough Set Approach
 The lower approximation of C consists of all of the data
tuples that, based on the knowledge of the attributes, are
certain to belong to C without ambiguity.
 The upper approximation of C consists of all of the tuples
that, based on the knowledge of the attributes, cannot be
described as not belonging to C.
Other Classification Methods
 Rough Set Approach
 Rough sets can also be used for attribute subset
selection and relevance analysis.
Other Classification Methods
 Fuzzy Set Approaches
 Fuzzy set theory is also known as possibility theory.
 It lets us work at a high level of abstraction and offers a
means for dealing with imprecise measurement of data.
 Fuzzy set theory allows us to deal with vague or inexact
facts.
 Fuzzy set theory is useful for data mining systems
performing rule-based classification.
Prediction
 Numeric prediction is the task of predicting
continuous (or ordered) values for given input.
 For example, to predict the salary of college
graduates with 10 years of work experience, or
the potential sales of a new product given its
price.
Prediction
 Regression analysis can be used to model the
relationship between one or more independent or
predictor variables and a dependent or response
variable (which is continuous-valued).
 In the context of data mining, the predictor variables
are the attributes of interest describing the tuple (i.e.,
making up the attribute vector).
Prediction
 Linear Regression
 Straight-line regression analysis involves a response
variable, y, and a single predictor variable, x.
 It is the simplest form of regression, and models y as a
linear function of x. That is, y = b + wx.
 The regression coefficients, w and b, can also be thought
of as weights, so that we can equivalently write,
y = w0+w1x.
Prediction
 Linear Regression
Prediction
 Linear Regression
 Multiple linear regression is an extension of straight-line
regression so as to involve more than one predictor
variable.
 An example of a multiple linear regression model based on
two predictor attributes or variables, A1 and A2, is
y = w0+w1x1+w2x2.
Prediction
 Nonlinear Regression:
 Polynomial regression is often of interest when there is
just one predictor variable.
 It can be modeled by adding polynomial terms to the basic
linear model.
 By applying transformations to the variables, we can
convert the nonlinear model into a linear one that can then
be solved by the method of least squares.

More Related Content

Similar to 20IT501_DWDM_PPT_Unit_III.ppt

Lec6_Association.ppt
Lec6_Association.pptLec6_Association.ppt
Lec6_Association.pptprema370155
 
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset CreationTop Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset Creationcscpconf
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithmijceronline
 
An improved apriori algorithm for association rules
An improved apriori algorithm for association rulesAn improved apriori algorithm for association rules
An improved apriori algorithm for association rulesijnlc
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit IIImalathieswaran29
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Editor IJARCET
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Editor IJARCET
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptNBACriteria2SICET
 
Efficient Utility Based Infrequent Weighted Item-Set Mining
Efficient Utility Based Infrequent Weighted Item-Set MiningEfficient Utility Based Infrequent Weighted Item-Set Mining
Efficient Utility Based Infrequent Weighted Item-Set MiningIJTET Journal
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptRaviKiranVarma4
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...IOSR Journals
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDataminingTools Inc
 

Similar to 20IT501_DWDM_PPT_Unit_III.ppt (20)

Lec6_Association.ppt
Lec6_Association.pptLec6_Association.ppt
Lec6_Association.ppt
 
6 module 4
6 module 46 module 4
6 module 4
 
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset CreationTop Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithm
 
An improved apriori algorithm for association rules
An improved apriori algorithm for association rulesAn improved apriori algorithm for association rules
An improved apriori algorithm for association rules
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084
 
Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084Volume 2-issue-6-2081-2084
Volume 2-issue-6-2081-2084
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.ppt
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
06 fp basic
06 fp basic06 fp basic
06 fp basic
 
Efficient Utility Based Infrequent Weighted Item-Set Mining
Efficient Utility Based Infrequent Weighted Item-Set MiningEfficient Utility Based Infrequent Weighted Item-Set Mining
Efficient Utility Based Infrequent Weighted Item-Set Mining
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
 
06FPBasic02.pdf
06FPBasic02.pdf06FPBasic02.pdf
06FPBasic02.pdf
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

20IT501_DWDM_PPT_Unit_III.ppt

  • 1. 20IT501 – Data Warehousing and Data Mining III Year / V Semester
  • 2. UNIT III - ASSOCIATION RULE MINING AND CLASSIFICATION Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining various Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining – Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines - Other Classification Methods-Facsimile extraction classification
  • 3. Basic Concepts Market Basket Analysis:  Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational data sets.  The discovery of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes, such as catalog design, cross-marketing, and customer shopping behavior analysis
  • 4. Basic Concepts  A typical example of frequent itemset mining is market basket analysis.  This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”.  The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers.
  • 5. Basic Concepts  Each item has a Boolean variable representing the presence or absence of that item.  Each basket can then be represented by a Boolean vector of values assigned to these variables.  The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules.
  • 6. Basic Concepts  The information that customers who purchase computers also tend to buy antivirus software at the same time is represented in Association Rule.  computer  antivirus software [support = 2%, confidence = 60%]  Rule support and confidence are two measures of rule interestingness.
  • 7. Basic Concepts  A support of 2% for Association Rule means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together.  A confidence of 60% means that 60% of the customers who purchased a computer also bought the software.  Association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.
  • 8. Frequent Itemsets, Closed Itemsets, and Association Rules  A set of items is referred to as an itemset.  An itemset that contains k items is a k-itemset. The set {computer, antivirus software} is a 2- itemset.  The occurrence frequency of an itemset is the number of transactions that contain the itemset.
  • 9. Frequent Itemsets, Closed Itemsets, and Association Rules  In general, association rule mining can be viewed as a two-step process:  Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup.  Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.
  • 10. Frequent Pattern Mining  Market basket analysis is just one form of frequent pattern mining.  There are many kinds of frequent patterns, association rules, and correlation relationships.
  • 11. Frequent Pattern Mining - Classification  Based on the completeness of patterns to be mined:  Mine the complete set of frequent itemsets, the closed frequent itemsets, and the maximal frequent itemsets, given a minimum support threshold.  Mine constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints), approximate frequent itemsets (i.e., those that derive only approximate support counts for the mined frequent itemsets),
  • 12. Frequent Pattern Mining - Classification Based on the completeness of patterns to be mined:  Mine near-match frequent itemsets (i.e., those that tally the support count of the near or almost matching itemsets),  Mine top-k frequent itemsets (i.e., the k most frequent itemsets for a user-specified value, k), and so on.
  • 13. Frequent Pattern Mining - Classification  Based on the levels of abstraction involved in the rule set:  suppose that a set of association rules mined includes the following rules where X is a variable representing a customer: buys(X, “computer”)  buys(X, “HP printer”)  buys(X, “laptop computer”)  buys(X, “HP printer”) computer is a higher-level abstraction of laptop computer
  • 14. Frequent Pattern Mining - Classification Based on the number of data dimensions involved in the rule:  If the items or attributes in an association rule reference only one dimension, then it is a single- dimensional association rule.  buys(X, “computer”)  buys(X, “HP printer”)
  • 15. Frequent Pattern Mining - Classification Based on the number of data dimensions involved in the rule:  If a rule references two or more dimensions, such as the dimensions age, income, and buys, then it is a multidimensional association rule.  age(X, “30….39”)^income(X, “42K….48K”)  buys(X, “high resolution TV”).
  • 16. Frequent Pattern Mining - Classification  Based on the types of values handled in the rule:  If a rule involves associations between the presence or absence of items, it is a Boolean association rule.  If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. Note that the quantitative attributes, age and income, have been discretized.
  • 17. Frequent Pattern Mining - Classification  Based on the kinds of rules to be mined:  Association rules are the most popular kind of rules generated from frequent patterns.  Based on the kinds of patterns to be mined:  Frequent itemset mining, that is, the mining of frequent itemsets (sets of items) from transactional or relational data sets.
  • 18. Frequent Pattern Mining - Classification  Based on the kinds of patterns to be mined:  Frequent itemset mining, that is, the mining of frequent itemsets (sets of items) from transactional or relational data sets.  Sequential pattern mining searches for frequent subsequences in a sequence data set, where a sequence records an ordering of events.
  • 19. Frequent Itemset Mining Methods  Apriori, the basic algorithm for finding frequent itemsets.  To generate strong association rules from frequent itemsets
  • 20. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation  Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules.  The algorithm uses prior knowledge of frequent itemset properties
  • 21. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation  To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space.  Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
  • 22. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation  Step 1: In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item.  Step 2: Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1- itemsets satisfying minimum support.
  • 23. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation TID List of item_IDS T1 I1, I2, I5 T2 I2, I4 T3 I2, I3 T4 I1, I2, I4 T5 I1, I3 T6 I2, I3 T7 I1, I3 T8 I1, I2, I3, I5 T9 I1, I2, I3
  • 24. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation Itemset Support Count I1 6 I2 7 I3 6 I4 2 I5 2 Scan D for count Of each candidate Itemset Support Count I1 6 I2 7 I3 6 I4 2 I5 2 Compare candidate support count with mini. Support count.
  • 25. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation  Step 3: To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to generate a candidate set of 2-itemsets, C2.  Step 4: Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is accumulated.  Step 5: The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support
  • 26. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation Itemset I1, I2 I1, I3 I1, I4 I1, I5 I2, I3 I2, I4 I2, I5 I3, I4 I3, I5 I4, I5 Generate C2 from L1 Itemset Sup. Count I1, I2 4 I1, I3 4 I1, I4 1 I1, I5 2 I2, I3 4 I2, I4 2 I2, I5 2 I3, I4 0 I3, I5 1 I4, I5 0 Scan D for count of each candidate Itemset Sup. Count I1, I2 4 I1, I3 4 I1, I5 2 I2, I3 4 I2, I4 2 I2, I5 2 Compare candidate support count with MSC
  • 27. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation  Step 6: The generation of the set of candidate 3- itemsets.  Step 7: The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support.  Step 8: The algorithm uses to generate a candidate set of 4-itemsets, C4.
  • 28. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation Itemset Support Count I1, I2, I3 2 I1, I2, I5 2 Scan D for count Of each candidate Itemset Support Count I1, I2, I3 2 I1, I2, I5 2 Compare candidate support count with mini. Support count. Itemset I1, I2, I3 I1, I2, I5 Generate C3 candidates from L2
  • 29. Generating Association Rules from Frequent Itemsets  Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence).  Confidence (A  B) = P (B | A) = support_count(A  B) / support_count(A).
  • 30. Generating Association Rules from Frequent Itemsets Association rules can be generated as follows:  For each frequent itemset l, generate all nonempty subsets of l.  For every nonempty subset s of l, output the rule “s  (l - s)”
  • 31. Generating Association Rules from Frequent Itemsets  Suppose the data contain the frequent itemset l = {I1, I2, I5}  The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}.  The resulting association rules are as shown below, each listed with its confidence:  I1^I2 I5, confidence = 2 / 4 = 50%
  • 32. Improving the Efficiency of Apriori  Hash-based technique:  A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck, for k > 1.  Generate all of the itemsets for each transaction, hash (i.e., map) them into the different buckets of a hash table structure, and increase the corresponding bucket counts.
  • 33. Improving the Efficiency of Apriori  Hash-based technique:
  • 34. Improving the Efficiency of Apriori  Transaction reduction:  A transaction that does not contain any frequent k- itemsets cannot contain any frequent (k+1)-itemsets.  Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it.
  • 35. Improving the Efficiency of Apriori  Partitioning:
  • 36. Improving the Efficiency of Apriori  Partitioning:  A partitioning technique can be used that requires just two database scans to mine the frequent itemsets.  In Phase I, the algorithm subdivides the transactions of D into n non overlapping partitions.  If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is min sup the number of transactions in that partition.
  • 37. Improving the Efficiency of Apriori  Partitioning:  For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets.  Any itemset that is potentially frequent with respect to D must occur as a frequent itemset in at least one of the partitions. Therefore, all local frequent itemsets are candidate itemsets with respect to D.
  • 38. Improving the Efficiency of Apriori  Partitioning:  In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.
  • 39. Improving the Efficiency of Apriori  Sampling  The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D.  The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall.
  • 40. Improving the Efficiency of Apriori Dynamic itemset counting  The database is partitioned into blocks marked by start points.  New candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan.
  • 41. Mining Frequent Itemsets without Candidate Generation  The Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain.  Issues:  It may need to generate a huge number of candidate sets.  It may need to repeatedly scan the database and check a large set of candidates by pattern matching.
  • 42. Mining Frequent Itemsets without Candidate Generation  An interesting method in this attempt is called frequent-pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy as follows.  First, it compresses the database representing frequent items into a frequent-pattern tree, or FP-tree, which retains the itemset association information.  It then divides the compressed database into a set of conditional databases, each associated with one frequent item or “pattern fragment,” and mines each such database separately.
  • 43. Mining Frequent Itemsets without Candidate Generation
  • 44. Mining Frequent Itemsets without Candidate Generation  The FP-growth method transforms the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix.  It uses the least frequent items as a suffix, offering good selectivity.  The method substantially reduces the search costs.
  • 45. Mining Various Kinds of Association Rules  Mining Multilevel Association Rules  Strong associations discovered at high levels of abstraction may represent commonsense knowledge.  Therefore, data mining systems should provide capabilities for mining association rules at multiple levels of abstraction, with sufficient flexibility for easy traversal among different abstraction spaces.
  • 46. Mining Various Kinds of Association Rules  Mining Multilevel Association Rules  Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules.  Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
  • 47. Mining Various Kinds of Association Rules  Mining Multilevel Association Rules - Using uniform minimum support for all levels:  The same minimum support threshold is used when mining at each level of abstraction.  When a uniform minimum support threshold is used, the search procedure is simplified.  The method is also simple in that users are required to specify only one minimum support threshold.
  • 48. Mining Various Kinds of Association Rules  Mining Multilevel Association Rules - Using reduced minimum support at lower levels:  Each level of abstraction has its own minimum support threshold.  The deeper the level of abstraction, the smaller the corresponding threshold is.
  • 49. Mining Various Kinds of Association Rules  Mining Multilevel Association Rules
  • 50. Mining Various Kinds of Association Rules  Mining Multidimensional Association Rules from Relational Databases and Data Warehouses:  Database attributes can be categorical or quantitative.  Categorical attributes have a finite number of possible values, with no ordering among the values (e.g., occupation, brand, color).  Categorical attributes are also called nominal attributes, because their values are “names of things.”
  • 51. Mining Various Kinds of Association Rules  Mining Multidimensional Association Rules from Relational Databases and Data Warehouses:  Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income, price).  Quantitative attributes are discretized using predefined concept hierarchies.
  • 52. Mining Various Kinds of Association Rules  Mining Multidimensional Association Rules from Relational Databases and Data Warehouses:  A concept hierarchy for income may be used to replace the original numeric values of this attribute by interval labels, such as “0. . . 20K”, “21K . . . 30K”, “31K . . . 40K”, and so on. Here, discretization is static and predetermined.
  • 53. Mining Various Kinds of Association Rules  Mining Multidimensional Association Rules from Relational Databases and Data Warehouses:  Quantitative attributes are discretized or clustered into “bins” based on the distribution of the data.  The discretization process is dynamic and established so as to satisfy some mining criteria, such as maximizing the confidence of the rules mined.
  • 54. Constraint-Based Association Mining  A good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. This strategy is known as constraint- based mining.  Constraints: Knowledge type constraints, Data constraints, Dimension/level constraints, Interestingness constraints and Rule constraints.
  • 55. Constraint-Based Association Mining  Meta rule-Guided Mining of Association Rules:  Metarules allow users to specify the syntactic form of rules that they are interested in mining.  The rule forms can be used as constraints to help improve the efficiency of the mining process.  Metarules may be based on the analyst’s experience, expectations, or intuition regarding the data or may be automatically generated based on the database schema.
  • 56. Constraint-Based Association Mining  Constraint Pushing: Mining Guided by Rule Constraints:  Rule constraints specify expected set/subset relationships of the variables in the mined rules, constant initiation of variables, and aggregate functions.  Rule constraints can be classified into the following five categories with respect to frequent itemset mining: (1) antimonotonic, (2) monotonic, (3) succinct, (4) convertible, and (5) inconvertible.
  • 57. Classification and Prediction  The data analysis task is classification, where a model or classifier is constructed to predict categorical labels, “safe” or “risky” for the loan application data; “yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data.
  • 58. Classification and Prediction  The data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a categorical label. This model is a predictor.  Regression analysis is a statistical methodology that is most often used for numeric prediction
  • 59. Classification and Prediction  How does classification work?  A classifier is built describing a predetermined set of data classes or concepts.  This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels.
  • 60. Classification and Prediction  How does classification work?  Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class label attribute.  The class label attribute is discrete-valued and unordered. It is categorical in that each value serves as a category or class.
  • 61. Classification and Prediction  How does classification work?  The class label of each training tuple is provided, this step is also known as supervised learning.  Unsupervised learning (or clustering), in which the class label of each training tuple is not known, and the number or set of classes to be learned may not be known in advance.
  • 62. Classification and Prediction  What about classification accuracy?  First, the predictive accuracy of the classifier is estimated.  A test set is used, made up of test tuples and their associated class labels.  The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier.
  • 63. Issues Regarding Classification and Prediction  Preparing the Data for Classification and Prediction:  Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise and the treatment of missing values.  Relevance analysis: Many of the attributes in the data may be redundant. Attribute subset selection can be used in these cases to find a reduced set of attributes.
  • 64. Issues Regarding Classification and Prediction  Preparing the Data for Classification and Prediction:  Data transformation and reduction: The data may be transformed by normalization. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as – 1.0 to 1.0, or 0.0 to 1.0.  Concept hierarchies may be used for this purpose. numeric values for the attribute income can be generalized to discrete ranges, such as low, medium, and high.
  • 65. Comparing Classification and Prediction Methods  Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to predict the class label correctly, and the accuracy of the predictor can be referred to as how well a given predictor can estimate the unknown value.  Speed: The speed of the method depends on the computational cost of generating and using the classifier or predictor.
  • 66. Comparing Classification and Prediction Methods  Robustness: Robustness is the ability to make correct predictions or classifications. In the context of data mining, robustness is the ability of the classifier or predictor to make correct predictions from incoming unknown data.  Scalability: Scalability refers to an increase or decrease in the performance of the classifier or predictor based on the given data.  Interpretability: Interpretability is how readily we can understand the reasoning behind predictions or classification made by the predictor or classifier.
  • 67. Classification by Decision Tree Induction A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.  The topmost node in a tree is the root node.
  • 68. Classification by Decision Tree Induction
  • 69. Classification by Decision Tree Induction  How are decision trees used for classification?  Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree.  A path is traced from the root to a leaf node, which holds the class prediction for that tuple.  Decision trees can easily be converted to classification rules.
  • 70. Classification by Decision Tree Induction  Why are decision tree classifiers so popular?  The construction of decision tree classifiers does not require any domain knowledge or parameter setting.  Decision trees can handle high dimensional data.  Decision tree induction algorithms have been used for classification in many application areas, such as medicine, manufacturing and production, financial analysis, astronomy, and molecular biology.
  • 71. Decision Tree Induction  Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition D.  Input:  Data partition, D, which is a set of training tuples and their associated class labels;  attribute list, the set of candidate attributes;  Attribute selection method, a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly, either a split point or splitting subset.  Output: A decision tree.
  • 72. Decision Tree Induction (1) create a node N; (2) if tuples in D are all of the same class, C then (3) return N as a leaf node labeled with the class C; (4) if attribute list is empty then (5) return N as a leaf node labeled with the majority class in D; // majority voting (6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion; (7) label node N with splitting criterion;
  • 73. Decision Tree Induction (8) if splitting attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees (9) attribute_list attribute_list - splitting attribute; // remove splitting attribute (10) for each outcome j of splitting criterion //partition the tuples and grow subtrees for each partition
  • 74. Decision Tree Induction (11) let Dj be the set of data tuples in D satisfying outcome j; // a partition (12) if Dj is empty then (13) attach a leaf labeled with the majority class in D to node N; (14) else attach the node returned by Generate decision tree(Dj, attribute list) to node N; endfor (15) return N;
  • 75. Decision Tree Induction Attribute Selection Measures:  Information gain (or) Entropy:  Information gain of the attribute:
  • 76. Decision Tree Induction Attribute Selection Measures:  Gain:
  • 77. Decision Tree Induction RID Age Income Student Credit_Rating Class: buys_computer 1 Youth High No Fair No 2 Youth High No Excellent No 3 Middle_aged High No Fair Yes 4 Senior Medium No Fair Yes 5 Senior low Yes Fair Yes 6 Senior low Yes Excellent No 7 Middle_aged low Yes Excellent Yes 8 Youth Medium No Fair No 9 Youth low Yes Fair Yes 10 Senior Medium Yes Fair Yes 11 Youth Medium Yes Excellent Yes 12 Middle_aged Medium No Excellent Yes 13 Middle_aged High Yes Fair Yes
  • 78. Decision Tree Induction Attribute Selection Measures:  Information gain (or) Entropy:  Info(D) = - 9/14 log2(9/14) - 5/14 log2(5/14) = 0.940 bits
  • 79. Decision Tree Induction Attribute Selection Measures:  Information gain of the attribute:  Info age (D) = 0.694
  • 80. Decision Tree Induction Attribute Selection Measures:  Gain:  Gain (A) = .940 - .694 = 0.246 bits
  • 81. Decision Tree Induction Attribute Selection Measures:  Gain(Age) = .940 - .694 = 0.246 bits  Gain (Income) = 0.029 bits  Gain (student) = 0.151 bits  Gain (credit_rating) = 0.048 bits
  • 83. Decision Tree Induction  Gain ratio:  The information gain measure is biased toward tests with many outcomes.  For example, consider an attribute that acts as a unique identifier, such as product ID. A split on product ID would result in a large number of partitions (as many as there are values), each one containing just one tuple.
  • 84. Decision Tree Induction  Gain ratio:  C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to overcome this bias. It applies a kind of normalization to information gain using a “split information” value defined analogously with Info(D).
  • 85. Decision Tree Induction  Gain ratio:  The attribute with the maximum gain ratio is selected as the splitting attribute.
  • 86. Decision Tree Induction  Gini index:  The Gini index is used in CART.  The Gini index measures the impurity of D, a data partition or set of training tuples, as
  • 87. Decision Tree Induction  Tree Pruning  When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers.  Tree pruning methods address this problem of overfitting the data.
  • 88. Decision Tree Induction  How does tree pruning work?  In the prepruning approach, a tree is “pruned” by halting its construction early.  The second and more common approach is postpruning, which removes subtrees from a “fully grown” tree.  A subtree at a given node is pruned by removing its branches and replacing it with a leaf.
  • 89. Bayesian Classification  Bayesian classifiers are statistical classifiers.  They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.  Bayesian classification is based on Bayes’ theorem
  • 90. Bayesian Classification  Bayes’ Theorem  Let X be a data tuple.  Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.  For classification problems, we want to determine P(HjX), we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X.
  • 91. Bayesian Classification  Bayes’ Theorem  P(H | X) is the posterior probability, or a posteriori probability, of H conditioned on X.  P(H) is the prior probability, or a priori probability, of H.  Similarly, P(X | H) is the posterior probability of X conditioned on H.  P(X) is the prior probability of X.
  • 92. Bayesian Classification  Naïve Bayesian Classification  Bayesian classifiers have the minimum error rate in comparison to all other classifiers.  Bayesian classifiers are also useful in that they provide a theoretical justification for other classifiers that do not explicitly use Bayes’ theorem.
  • 93. Bayesian Classification  Naïve Bayesian Classification  Let D be a training set of tuples and their associated class labels. Each tuple is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), from n attributes, respectively, A1, A2, … , An.
  • 94. Bayesian Classification  Naïve Bayesian Classification  Suppose that there are m classes, C1, C2, : : : , Cm.  Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X.  That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
  • 95. Bayesian Classification  Naïve Bayesian Classification  Suppose that there are m classes, C1, C2, : : : , Cm.  The class Ci for which P(Ci | X) is maximized is called the maximum posteriori hypothesis. By Bayes’ theorem
  • 96. Bayesian Classification  Naïve Bayesian Classification  As P(X) is constant for all classes, only P(X | Ci) P(Ci) need be maximized.  If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = …. = P(Cm), and we would therefore maximize P(X | Ci).
  • 97. Bayesian Classification  Naïve Bayesian Classification  In order to reduce computation in evaluating P(X | Ci), the naive assumption of class conditional independence is made.
  • 98. Bayesian Classification  Naïve Bayesian Classification  In order to predict the class label of X, P(X | i)P(Ci) is evaluated for each class Ci. The classifier predicts that the class label of tuple X is the class Ci if and only if P(X | Ci)P(Ci) > P(X | Cj)P(Cj)  In other words, the predicted class label is the class Ci for which P(XjCi)P(Ci) is the maximum.
  • 99. Bayesian Classification  Naïve Bayesian Classification  Class: C1: buys_computer = ‘yes’  Class: C2: buys_computer = ‘no’  Data to be classified:  X = (age = youth, income = medium, student = yes, credit rating = fair)  P(Ci) = P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14 = 0.357
  • 100. Bayesian Classification  Naïve Bayesian Classification  To compute P(X | Ci),  P(age = youth | buys_computer = yes) = 2/9 = 0.222  P(age = youth | buys_computer = no) = 3/5 = 0.600  P(income = medium | buys_computer = yes) = 4/9 = 0.444  P(income = medium | buys_computer = no) = 2/5 = 0.400  P(student = yes | buys_computer = yes) = 6/9 = 0.667  P(student = yes | buys_computer = no) = 1/5 = 0.200  P(credit rating = fair | buys_computer = yes) = 6/9 = 0.667  P(credit rating = fair | buys_computer = no) = 2/5 = 0.400
  • 101. Bayesian Classification  Naïve Bayesian Classification  X = (age = youth, income = medium, student = yes, credit rating = fair)  P(X | buys_computer = yes) = P(age = youth | buys_computer = yes) * P(income = medium | buys_computer = yes) * P(student = yes | buys_computer = yes)* P(credit rating = fair | buys_computer = yes) = 0.222 * 0.444*0.667*0.667 = 0.044.  P(X|buys_computer = no) = 0.600*0.400*0.200*0.400 = 0.019.
  • 102. Bayesian Classification  Naïve Bayesian Classification  To find the class, Ci, that maximizes P(X | Ci)P(Ci), we compute  P(X | buys_computer = yes)P(buys_computer = yes) = 0.044*0.643 = 0.028  P(X | buys_computer = no)P(buys_computer = no) = 0.019*0.357 = 0:007  The naïve Bayesian classifier predicts buys computer = yes for tuple X.
  • 103. Bayesian Classification  Bayesian Belief Networks:  Bayesian belief networks specify joint conditional probability distributions.  They allow class conditional independencies to be defined between subsets of variables.  They provide a graphical model of causal relationships, on which learning can be performed.  Bayesian belief networks are also known as belief networks, Bayesian networks, and probabilistic networks.
  • 104. Bayesian Classification  Bayesian Belief Networks:  A belief network is defined by two components— a directed acyclic graph and a set of conditional probability tables.
  • 105. Rule-Based Classification The learned model is represented as a set of IF- THEN rules.  Using IF-THEN Rules for Classification:  Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-THEN rules for classification.
  • 106. Rule-Based Classification Using IF-THEN Rules for Classification:  An IF-THEN rule is an expression of the form  IF condition THEN conclusion  An example is rule R1, R1: IF age = youth AND student = yes THEN buys computer = yes.
  • 107. Rule-Based Classification Using IF-THEN Rules for Classification:  A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set, D, let ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and |Dj| be the number of tuples in D. We can define the coverage and accuracy of R as
  • 108. Rule-Based Classification Using IF-THEN Rules for Classification:  A rule’s coverage is the percentage of tuples that are covered by the rule.  For a rule’s accuracy, tuples that it covers and see what percentage of them the rule can correctly classify.
  • 109. Rule-Based Classification Using IF-THEN Rules for Classification: An example is rule R1, R1: IF age = youth AND student = yes THEN buys computer = yes.  Coverage (R1) = 2 / 14 = 14.28%  Accuracy (R2) = 2 / 2 = 100%
  • 110. Rule-Based Classification Using IF-THEN Rules for Classification:  Rule-based classification to predict the class label of a given tuple, X.  If a rule is satisfied by X, the rule is said to be triggered.  X= (age = youth, income = medium, student = yes, credit rating = fair).
  • 111. Rule-Based Classification Using IF-THEN Rules for Classification:  If more than one rule is triggered, we need a conflict resolution strategy to figure out which rule gets to fire and assign its class prediction to X.  size ordering scheme: It assigns the highest priority to the triggering rule that has the “toughest” requirements. The triggering rule with the most attribute tests is fired
  • 112. Rule-Based Classification  Using IF-THEN Rules for Classification:  The rule ordering scheme prioritizes the rules beforehand. The ordering may be class based or rule-based.  With class-based ordering, the classes are sorted in order of decreasing “importance,” such as by decreasing order of prevalence.  That is, all of the rules for the most prevalent (or most frequent) class come first, the rules for the next prevalent class come next, and so on.
  • 113. Rule-Based Classification  Using IF-THEN Rules for Classification:  With rule-based ordering, the rules are organized into one long priority list, according to some measure of rule quality such as accuracy, coverage, or size (number of attribute tests in the rule antecedent), or based on advice from domain experts.  When rule ordering is used, the rule set is known as a decision list.  With rule ordering, the triggering rule that appears earliest in the list has highest priority, and so it gets to fire its class prediction.
  • 114. Rule-Based Classification Rule Extraction from a Decision Tree  To extract rules from a decision tree, one rule is created for each path from the root to a leaf node.  Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent (“THEN” part).
  • 115. Rule-Based Classification Rule Extraction from a Decision Tree  R1: IF age = youth AND student = no THEN buys computer = no R2: IF age = youth AND student = yes THEN buys computer = yes R3: IF age = middle aged THEN buys computer = yes R4: IF age = senior AND credit rating = excellent THEN buys computer = yes R5: IF age = senior AND credit rating = fair THEN buys computer = no
  • 116. Rule-Based Classification  Rule Induction Using a Sequential Covering Algorithm:  The name comes from the notion that the rules are learned sequentially (one at a time), where each rule for a given class will ideally cover many of the tuples of that class (and hopefully none of the tuples of other classes).  There are many sequential covering algorithms. Popular variations include AQ, CN2, and the more recent, RIPPER.
  • 117. Rule-Based Classification  Rule Induction Using a Sequential Covering Algorithm: Rules are learned one at a time. Each time a rule is learned, the tuples covered by the rule are removed, and the process repeats on the remaining tuples. This sequential learning of rules is in contrast to decision tree induction.
  • 118. Classification by Backpropagation  Backpropagation is a neural network learning algorithm  A neural network is a set of connected input/output units in which each connection has a weight associated with it.  Neural network learning is also referred to as connectionist learning due to the connections between units.
  • 119. Classification by Backpropagation  Backpropagation is a neural network learning algorithm  A neural network is a set of connected input/output units in which each connection has a weight associated with it.  Neural network learning is also referred to as connectionist learning due to the connections between units.
  • 120. Classification by Backpropagation A Multilayer Feed-Forward Neural Network:
  • 121. Classification by Backpropagation  A Multilayer Feed-Forward Neural Network:  The backpropagation algorithm performs learning on a multilayer feed-forward neural network.  It iteratively learns a set of weights for prediction of the class label of tuples.  A multilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer.
  • 122. Classification by Backpropagation  Each layer is made up of units. The inputs to the network correspond to the attributes measured for each training tuple.  The inputs are fed simultaneously into the units making up the input layer.  These inputs pass through the input layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units, known as a hidden layer.
  • 123. Classification by Backpropagation  The outputs of the hidden layer units can be input to another hidden layer, and so on.  The number of hidden layers is arbitrary, although in practice, usually only one is used.  The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network’s prediction for given tuples.
  • 124. Classification by Backpropagation  The units in the input layer are called input units.  The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units.  The multilayer neural network has two layers of output units. Therefore, we say that it is a two-layer neural network.
  • 125. Classification by Backpropagation  Network Topology:  The user must decide on the network topology by specifying the number of units in the input layer, the number of hidden layers (if more than one), the number of units in each hidden layer, and the number of units in the output layer.  Typically, input values are normalized so as to fall between 0.0 and 1.0.
  • 126. Classification by Backpropagation  Network Topology:  Network design is a trial-and-error process and may affect the accuracy of the resulting trained network.
  • 127. Classification by Backpropagation  Backpropagation:  Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s prediction for each tuple with the actual known target value.  The target value may be the known class label of the training tuple (for classification problems) or a continuous value (for prediction).
  • 128. Classification by Backpropagation  Backpropagation:  For each training tuple, the weights are modified so as to minimize the mean squared error between the network’s prediction and the actual target value.  These modifications are made in the “backwards” direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the name backpropagation).
  • 129. Classification by Backpropagation  Backpropagation:  Initialize the weights: The weights in the network are initialized to small random numbers (e.g., ranging from -1.0 to 1.0, or – 0.5 to 0.5).  Each unit has a bias associated with it, as explained below. The biases are similarly initialized to small random numbers.
  • 130. Classification by Backpropagation  Backpropagation: Each training tuple, X, is processed by the following steps.  Propagate the inputs forward: First, the training tuple is fed to the input layer of the network. The inputs pass through the input units, unchanged.  That is, for an input unit, j, its output, Oj, is equal to its input value, Ij. Next, the net input and output of each unit in the hidden and output layers are computed.
  • 131. Classification by Backpropagation  Backpropagation: Each training tuple, X, is processed by the following steps.  Propagate the inputs forward: The net input to a unit in the hidden or output layers is computed as a linear combination of its inputs.  Each such unit has a number of inputs to it that are, in fact, the outputs of the units connected to it in the previous layer.  Each connection has a weight.
  • 132. Classification by Backpropagation  Backpropagation:  To compute the net input to the unit, each input connected to the unit is multiplied by its corresponding weight, and this is summed. Given a unit j in a hidden or output layer, the net input, Ij, to unit j is
  • 134. Classification by Backpropagation  Backpropagation:  Given a unit j in a hidden or output layer, the net input, Ij, to unit j is  where wi j is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit i from the previous layer; and qj is the bias of the unit. The bias acts as a threshold in that it serves to vary the activity of the unit.
  • 135. Classification by Backpropagation  Backpropagation:  Each unit in the hidden and output layers takes its net input and then applies an activation function to it.  The function symbolizes the activation of the neuron represented by the unit. The logistic, or sigmoid, function is used. Given the net input Ij to unit j, then Oj, the output of unit j, is computed as
  • 136. Classification by Backpropagation  Backpropagation:  We compute the output values, Oj, for each hidden layer, up to and including the output layer, which gives the network’s prediction. In practice, it is a good idea to cache (i.e., save) the intermediate output values at each unit as they are required again later, when backpropagating the error.
  • 137. Classification by Backpropagation  Backpropagation:  Backpropagate the error: The error is propagated backward by updating the weights and biases to reflect the error of the network’s prediction. For a unit j in the output layer, the error Err j is computed by Errj = Oj(1-Oj)(Tj - Oj)  where Oj is the actual output of unit j, and Tj is the known target value of the given training tuple.
  • 138. Classification by Backpropagation  Backpropagation:  To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected to unit j in the next layer are considered. The error of a hidden layer unit j is  where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and Errk is the error of unit k.
  • 139. Classification by Backpropagation  Backpropagation:  The computational efficiency depends on the time spent training the network. Given |D| tuples and w weights, each epoch requires O(|D| x w) time.
  • 140. Support Vector Machines  Method for the classification of both linear and nonlinear data.  It uses a nonlinear mapping to transform the original training data into a higher dimension.  Within this new dimension, it searches for the linear optimal separating hyperplane
  • 141. Support Vector Machines  With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane.  The SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors).  SVMs can be used for prediction as well as classification.  Applications: Handwritten digit recognition, object recognition, and speaker identification, as well as benchmark time-series prediction tests.
  • 142. Support Vector Machines Goal: The SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.
  • 143. Support Vector Machines  The Case When the Data Are Linearly Separable:  The 2-D data are linearly separable (or “linear,” for short) because a straight line can be drawn to separate all of the tuples of class+1 from all of the tuples of class -1.  There are an infinite number of separating lines that could be drawn. To find the “best” one, that is, one that will have the minimum classification error on previously unseen tuples.
  • 144. Support Vector Machines The Case When the Data Are Linearly Separable:  Term “hyperplane” to refer to the decision boundary that we are seeking, regardless of the number of input attributes.  An SVM approaches this problem by searching for the maximum marginal hyperplane.
  • 145. Support Vector Machines The Case When the Data Are Linearly Separable:
  • 146. Support Vector Machines The Case When the Data Are Linearly Separable:
  • 147. Support Vector Machines The Case When the Data Are Linearly Separable:  Both hyperplanes can correctly classify all of the given data tuples.  The hyperplane with the larger margin to be more accurate at classifying future data tuples than the hyperplane with the smaller margin.
  • 148. Support Vector Machines The Case When the Data Are Linearly Inseparable:
  • 149. Support Vector Machines The Case When the Data Are Linearly Inseparable:  No straight line can be found that would separate the classes.  SVMs are capable of finding nonlinear decision boundaries (i.e., nonlinear hypersurfaces) in input space.
  • 150. Support Vector Machines The Case When the Data Are Linearly Inseparable:  Transform the original input data into a higher dimensional space using a nonlinear mapping.  Once the data have been transformed into the new higher space, the second step searches for a linear separating hyperplane in the new space
  • 151. Other Classification Methods  Genetic Algorithms:  Genetic algorithms attempt to incorporate ideas of natural evolution.  An initial population is created consisting of randomly generated rules.  Each rule can be represented by a string of bits.
  • 152. Other Classification Methods  Genetic Algorithms:  Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules in the current population, as well as offspring of these rules.  The fitness of a rule is assessed by its classification accuracy on a set of training samples.
  • 153. Other Classification Methods  Genetic Algorithms:  Offspring are created by applying genetic operators such as crossover and mutation.  In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly selected bits in a rule’s string are inverted.  To evaluate the fitness of other algorithms.
  • 154. Other Classification Methods  Rough Set Approach  Rough set theory can be used for classification to discover structural relationships within imprecise or noisy data.  It applies to discrete-valued attributes. Continuous- valued attributes must therefore be discretized before its use.
  • 155. Other Classification Methods  Rough Set Approach  Rough set theory is based on the establishment of equivalence classes within the given training data. Rough sets can be used to approximately or “roughly” define such classes.  A rough set definition for a given class, C, is approximated by two sets—a lower approximation of C and an upper approximation of C.
  • 156. Other Classification Methods  Rough Set Approach  The lower approximation of C consists of all of the data tuples that, based on the knowledge of the attributes, are certain to belong to C without ambiguity.  The upper approximation of C consists of all of the tuples that, based on the knowledge of the attributes, cannot be described as not belonging to C.
  • 157. Other Classification Methods  Rough Set Approach  Rough sets can also be used for attribute subset selection and relevance analysis.
  • 158. Other Classification Methods  Fuzzy Set Approaches  Fuzzy set theory is also known as possibility theory.  It lets us work at a high level of abstraction and offers a means for dealing with imprecise measurement of data.  Fuzzy set theory allows us to deal with vague or inexact facts.  Fuzzy set theory is useful for data mining systems performing rule-based classification.
  • 159. Prediction  Numeric prediction is the task of predicting continuous (or ordered) values for given input.  For example, to predict the salary of college graduates with 10 years of work experience, or the potential sales of a new product given its price.
  • 160. Prediction  Regression analysis can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable (which is continuous-valued).  In the context of data mining, the predictor variables are the attributes of interest describing the tuple (i.e., making up the attribute vector).
  • 161. Prediction  Linear Regression  Straight-line regression analysis involves a response variable, y, and a single predictor variable, x.  It is the simplest form of regression, and models y as a linear function of x. That is, y = b + wx.  The regression coefficients, w and b, can also be thought of as weights, so that we can equivalently write, y = w0+w1x.
  • 163. Prediction  Linear Regression  Multiple linear regression is an extension of straight-line regression so as to involve more than one predictor variable.  An example of a multiple linear regression model based on two predictor attributes or variables, A1 and A2, is y = w0+w1x1+w2x2.
  • 164. Prediction  Nonlinear Regression:  Polynomial regression is often of interest when there is just one predictor variable.  It can be modeled by adding polynomial terms to the basic linear model.  By applying transformations to the variables, we can convert the nonlinear model into a linear one that can then be solved by the method of least squares.