20IT501_DWDM_PPT_Unit_III.ppt

20IT501 – Data Warehousing and
Data Mining
III Year / V Semester

UNIT III - ASSOCIATION RULE
MINING AND CLASSIFICATION
Mining Frequent Patterns, Associations and Correlations –
Mining Methods – Mining various Kinds of Association Rules
– Correlation Analysis – Constraint Based Association Mining
– Classification and Prediction - Basic Concepts - Decision
Tree Induction - Bayesian Classification – Rule Based
Classification – Classification by Back propagation – Support
Vector Machines - Other Classification Methods-Facsimile
extraction classification

Basic Concepts
Market Basket Analysis:
 Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data
sets.
 The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business
decision-making processes, such as catalog design, cross-marketing,
and customer shopping behavior analysis

Basic Concepts
 A typical example of frequent itemset mining is market
basket analysis.
 This process analyzes customer buying habits by finding
associations between the different items that customers
place in their “shopping baskets”.
 The discovery of such associations can help retailers
develop marketing strategies by gaining insight into which
items are frequently purchased together by customers.

Basic Concepts
 Each item has a Boolean variable representing the presence
or absence of that item.
 Each basket can then be represented by a Boolean vector of
values assigned to these variables.
 The Boolean vectors can be analyzed for buying patterns
that reflect items that are frequently associated or purchased
together. These patterns can be represented in the form of
association rules.

Basic Concepts
 The information that customers who purchase
computers also tend to buy antivirus software at the
same time is represented in Association Rule.
 computer  antivirus software [support = 2%,
confidence = 60%]
 Rule support and confidence are two measures of rule
interestingness.

Basic Concepts
 A support of 2% for Association Rule means that 2% of all
the transactions under analysis show that computer and
antivirus software are purchased together.
 A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software.
 Association rules are considered interesting if they satisfy
both a minimum support threshold and a minimum
confidence threshold.

Frequent Itemsets, Closed Itemsets,
and Association Rules
 A set of items is referred to as an itemset.
 An itemset that contains k items is a k-itemset.
The set {computer, antivirus software} is a 2-
itemset.
 The occurrence frequency of an itemset is the
number of transactions that contain the itemset.

Frequent Itemsets, Closed Itemsets,
and Association Rules
 In general, association rule mining can be viewed as a
two-step process:
 Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a predetermined
minimum support count, min sup.
 Generate strong association rules from the frequent
itemsets: By definition, these rules must satisfy minimum
support and minimum confidence.

Frequent Pattern Mining
 Market basket analysis is just one form of
frequent pattern mining.
 There are many kinds of frequent patterns,
association rules, and correlation relationships.

Frequent Pattern Mining -
Classification
 Based on the completeness of patterns to be mined:
 Mine the complete set of frequent itemsets, the closed
frequent itemsets, and the maximal frequent itemsets, given
a minimum support threshold.
 Mine constrained frequent itemsets (i.e., those that satisfy
a set of user-defined constraints), approximate frequent
itemsets (i.e., those that derive only approximate support
counts for the mined frequent itemsets),

Classification
Based on the completeness of patterns to be
mined:
 Mine near-match frequent itemsets (i.e., those that
tally the support count of the near or almost matching
itemsets),
 Mine top-k frequent itemsets (i.e., the k most frequent
itemsets for a user-specified value, k), and so on.

Classification
 Based on the levels of abstraction involved in the rule
set:
 suppose that a set of association rules mined includes the
following rules where X is a variable representing a
customer:
buys(X, “computer”)  buys(X, “HP printer”)
 buys(X, “laptop computer”)  buys(X, “HP printer”)
computer is a higher-level abstraction of laptop computer

Classification
Based on the number of data dimensions
involved in the rule:
 If the items or attributes in an association rule
reference only one dimension, then it is a single-
dimensional association rule.
 buys(X, “computer”)  buys(X, “HP printer”)

Classification
Based on the number of data dimensions involved
in the rule:
 If a rule references two or more dimensions, such as
the dimensions age, income, and buys, then it is a
multidimensional association rule.
 age(X, “30….39”)^income(X, “42K….48K”)  buys(X,
“high resolution TV”).

Classification
 Based on the types of values handled in the rule:
 If a rule involves associations between the presence or
absence of items, it is a Boolean association rule.
 If a rule describes associations between quantitative
items or attributes, then it is a quantitative association
rule. Note that the quantitative attributes, age and
income, have been discretized.

Classification
 Based on the kinds of rules to be mined:
 Association rules are the most popular kind of rules
generated from frequent patterns.
 Based on the kinds of patterns to be mined:
 Frequent itemset mining, that is, the mining of
frequent itemsets (sets of items) from transactional or
relational data sets.

Classification
 Based on the kinds of patterns to be mined:
 Frequent itemset mining, that is, the mining of
frequent itemsets (sets of items) from transactional or
relational data sets.
 Sequential pattern mining searches for frequent
subsequences in a sequence data set, where a sequence
records an ordering of events.

Frequent Itemset Mining Methods
 Apriori, the basic algorithm for finding
frequent itemsets.
 To generate strong association rules from
frequent itemsets

The Apriori Algorithm: Finding Frequent
Itemsets Using Candidate Generation
 Apriori is a seminal algorithm proposed by R.
Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association
rules.
 The algorithm uses prior knowledge of
frequent itemset properties

 To improve the efficiency of the level-wise
generation of frequent itemsets, an important
property called the Apriori property is used to
reduce the search space.
 Apriori property: All nonempty subsets of a
frequent itemset must also be frequent.

 Step 1: In the first iteration of the algorithm, each item is a
member of the set of candidate 1-itemsets, C1. The
algorithm simply scans all of the transactions in order to
count the number of occurrences of each item.
 Step 2: Suppose that the minimum support count required
is 2, that is, min sup = 2. The set of frequent 1-itemsets, L1,
can then be determined. It consists of the candidate 1-
itemsets satisfying minimum support.

TID List of item_IDS
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3

Itemset Support
Count
I1 6
I2 7
I3 6
I4 2
I5 2
Scan D for count
Of each candidate
Itemset Support
Count
I1 6
I2 7
I3 6
I4 2
I5 2
Compare candidate
support count with
mini. Support
count.

 Step 3: To discover the set of frequent 2-itemsets, L2, the algorithm
uses the join L1 on L1 to generate a candidate set of 2-itemsets, C2.
 Step 4: Next, the transactions in D are scanned and the support
count of each candidate itemset in C2 is accumulated.
 Step 5: The set of frequent 2-itemsets, L2, is then determined,
consisting of those candidate 2-itemsets in C2 having minimum
support

Itemset
I1, I2
I1, I3
I1, I4
I1, I5
I2, I3
I2, I4
I2, I5
I3, I4
I3, I5
I4, I5
Generate
C2 from L1
Itemset Sup.
Count
I1, I2 4
I1, I3 4
I1, I4 1
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
I3, I4 0
I3, I5 1
I4, I5 0
Scan D
for count
of each
candidate
Itemset Sup.
Count
I1, I2 4
I1, I3 4
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
Compare
candidate
support
count
with MSC

 Step 6: The generation of the set of candidate 3-
itemsets.
 Step 7: The transactions in D are scanned in order to
determine L3, consisting of those candidate 3-itemsets
in C3 having minimum support.
 Step 8: The algorithm uses to generate a candidate set
of 4-itemsets, C4.

Itemset Support
Count
I1, I2, I3 2
I1, I2, I5 2
Scan D for count
Of each candidate
Itemset Support
Count
I1, I2, I3 2
I1, I2, I5 2
Compare candidate
support count with
mini. Support
count.
Itemset
I1, I2, I3
I1, I2, I5
Generate C3
candidates from
L2

Generating Association Rules from
Frequent Itemsets
 Once the frequent itemsets from transactions in a database
D have been found, it is straightforward to generate strong
association rules from them (where strong association rules
satisfy both minimum support and minimum confidence).
 Confidence
(A  B) = P (B | A) = support_count(A  B) /
support_count(A).

Frequent Itemsets
Association rules can be generated as follows:
 For each frequent itemset l, generate all nonempty
subsets of l.
 For every nonempty subset s of l, output the rule
“s  (l - s)”

Frequent Itemsets
 Suppose the data contain the frequent itemset l =
{I1, I2, I5}
 The nonempty subsets of l are {I1, I2}, {I1, I5},
{I2, I5}, {I1}, {I2}, and {I5}.
 The resulting association rules are as shown
below, each listed with its confidence:
 I1^I2 I5, confidence = 2 / 4 = 50%

Improving the Efficiency of Apriori
 Hash-based technique:
 A hash-based technique can be used to reduce the size
of the candidate k-itemsets, Ck, for k > 1.
 Generate all of the itemsets for each transaction, hash
(i.e., map) them into the different buckets of a hash
table structure, and increase the corresponding bucket
counts.

 Hash-based technique:

 Transaction reduction:
 A transaction that does not contain any frequent k-
itemsets cannot contain any frequent (k+1)-itemsets.
 Therefore, such a transaction can be marked or
removed from further consideration because
subsequent scans of the database for j-itemsets, where j
> k, will not require it.

 Partitioning:

 Partitioning:
 A partitioning technique can be used that requires just two
database scans to mine the frequent itemsets.
 In Phase I, the algorithm subdivides the transactions of D
into n non overlapping partitions.
 If the minimum support threshold for transactions in D is
min sup, then the minimum support count for a partition is
min sup the number of transactions in that partition.

 Partitioning:
 For each partition, all frequent itemsets within the partition
are found. These are referred to as local frequent itemsets.
 Any itemset that is potentially frequent with respect to D
must occur as a frequent itemset in at least one of the
partitions. Therefore, all local frequent itemsets are
candidate itemsets with respect to D.

 Partitioning:
 In Phase II, a second scan of D is conducted in which
the actual support of each candidate is assessed in
order to determine the global frequent itemsets.
Partition size and the number of partitions are set so
that each partition can fit into main memory and
therefore be read only once in each phase.

 Sampling
 The basic idea of the sampling approach is to pick a
random sample S of the given data D, and then search for
frequent itemsets in S instead of D.
 The sample size of S is such that the search for frequent
itemsets in S can be done in main memory, and so only one
scan of the transactions in S is required overall.

Dynamic itemset counting
 The database is partitioned into blocks marked by start
points.
 New candidate itemsets can be added at any start
point, unlike in Apriori, which determines new
candidate itemsets only immediately before each
complete database scan.

Mining Frequent Itemsets without
Candidate Generation
 The Apriori candidate generate-and-test method
significantly reduces the size of candidate sets, leading
to good performance gain.
 Issues:
 It may need to generate a huge number of candidate sets.
 It may need to repeatedly scan the database and check a
large set of candidates by pattern matching.

 An interesting method in this attempt is called frequent-pattern
growth, or simply FP-growth, which adopts a divide-and-conquer
strategy as follows.
 First, it compresses the database representing frequent items into a
frequent-pattern tree, or FP-tree, which retains the itemset
association information.
 It then divides the compressed database into a set of conditional
databases, each associated with one frequent item or “pattern
fragment,” and mines each such database separately.

 The FP-growth method transforms the problem of
finding long frequent patterns to searching for shorter
ones recursively and then concatenating the suffix.
 It uses the least frequent items as a suffix, offering good
selectivity.
 The method substantially reduces the search costs.

Mining Various Kinds of Association Rules
 Mining Multilevel Association Rules
 Strong associations discovered at high levels of
abstraction may represent commonsense knowledge.
 Therefore, data mining systems should provide
capabilities for mining association rules at multiple
levels of abstraction, with sufficient flexibility for easy
traversal among different abstraction spaces.

 Association rules generated from mining data at
multiple levels of abstraction are called multiple-level
or multilevel association rules.
 Multilevel association rules can be mined efficiently
using concept hierarchies under a support-confidence
framework.

 Mining Multilevel Association Rules - Using uniform
minimum support for all levels:
 The same minimum support threshold is used when mining
at each level of abstraction.
 When a uniform minimum support threshold is used, the
search procedure is simplified.
 The method is also simple in that users are required to
specify only one minimum support threshold.

 Mining Multilevel Association Rules - Using
reduced minimum support at lower levels:
 Each level of abstraction has its own minimum
support threshold.
 The deeper the level of abstraction, the smaller the
corresponding threshold is.

 Mining Multidimensional Association Rules from
Relational Databases and Data Warehouses:
 Database attributes can be categorical or quantitative.
 Categorical attributes have a finite number of possible
values, with no ordering among the values (e.g.,
occupation, brand, color).
 Categorical attributes are also called nominal attributes,
because their values are “names of things.”

 Quantitative attributes are numeric and have an
implicit ordering among values (e.g., age, income,
price).
 Quantitative attributes are discretized using
predefined concept hierarchies.

 A concept hierarchy for income may be used to
replace the original numeric values of this attribute by
interval labels, such as “0. . . 20K”, “21K . . . 30K”,
“31K . . . 40K”, and so on.
Here, discretization is static and predetermined.

 Quantitative attributes are discretized or clustered into
“bins” based on the distribution of the data.
 The discretization process is dynamic and established
so as to satisfy some mining criteria, such as
maximizing the confidence of the rules mined.

Constraint-Based Association Mining
 A good heuristic is to have the users specify such
intuition or expectations as constraints to confine the
search space. This strategy is known as constraint-
based mining.
 Constraints: Knowledge type constraints, Data
constraints, Dimension/level constraints,
Interestingness constraints and Rule constraints.

 Meta rule-Guided Mining of Association Rules:
 Metarules allow users to specify the syntactic form of rules
that they are interested in mining.
 The rule forms can be used as constraints to help improve
the efficiency of the mining process.
 Metarules may be based on the analyst’s experience,
expectations, or intuition regarding the data or may be
automatically generated based on the database schema.

 Constraint Pushing: Mining Guided by Rule Constraints:
 Rule constraints specify expected set/subset relationships of the
variables in the mined rules, constant initiation of variables, and
aggregate functions.
 Rule constraints can be classified into the following five
categories with respect to frequent itemset mining: (1)
antimonotonic, (2) monotonic, (3) succinct, (4) convertible, and
(5) inconvertible.

Classification and Prediction
 The data analysis task is classification, where a
model or classifier is constructed to predict
categorical labels,
“safe” or “risky” for the loan application data;
“yes” or “no” for the marketing data; or
“treatment A,” “treatment B,” or “treatment C” for the
medical data.

 The data analysis task is an example of numeric
prediction, where the model constructed predicts a
continuous-valued function, or ordered value, as
opposed to a categorical label. This model is a
predictor.
 Regression analysis is a statistical methodology that is
most often used for numeric prediction

 How does classification work?
 A classifier is built describing a predetermined set of
data classes or concepts.
 This is the learning step (or training phase), where a
classification algorithm builds the classifier by
analyzing or “learning from” a training set made up of
database tuples and their associated class labels.

 Each tuple, X, is assumed to belong to a predefined
class as determined by another database attribute
called the class label attribute.
 The class label attribute is discrete-valued and
unordered. It is categorical in that each value serves as
a category or class.

 The class label of each training tuple is provided, this
step is also known as supervised learning.
 Unsupervised learning (or clustering), in which the
class label of each training tuple is not known, and the
number or set of classes to be learned may not be
known in advance.

 What about classification accuracy?
 First, the predictive accuracy of the classifier is estimated.
 A test set is used, made up of test tuples and their
associated class labels.
 The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by
the classifier.

Issues Regarding Classification and
Prediction
 Preparing the Data for Classification and Prediction:
 Data cleaning: This refers to the preprocessing of data in
order to remove or reduce noise and the treatment of
missing values.
 Relevance analysis: Many of the attributes in the data may
be redundant. Attribute subset selection can be used in
these cases to find a reduced set of attributes.

Issues Regarding Classification and
Prediction
 Preparing the Data for Classification and Prediction:
 Data transformation and reduction: The data may be
transformed by normalization. Normalization involves scaling all
values for a given attribute so that they fall within a small
specified range, such as – 1.0 to 1.0, or 0.0 to 1.0.
 Concept hierarchies may be used for this purpose. numeric
values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high.

Comparing Classification and Prediction
Methods
 Accuracy: The accuracy of the classifier can be referred to
as the ability of the classifier to predict the class label
correctly, and the accuracy of the predictor can be referred
to as how well a given predictor can estimate the unknown
value.
 Speed: The speed of the method depends on the
computational cost of generating and using the classifier or
predictor.

Comparing Classification and Prediction
Methods
 Robustness: Robustness is the ability to make correct predictions or
classifications. In the context of data mining, robustness is the
ability of the classifier or predictor to make correct predictions from
incoming unknown data.
 Scalability: Scalability refers to an increase or decrease in the
performance of the classifier or predictor based on the given data.
 Interpretability: Interpretability is how readily we can understand
the reasoning behind predictions or classification made by the
predictor or classifier.

Classification by Decision Tree Induction
A decision tree is a flowchart-like tree structure,
where each internal node (nonleaf node) denotes a
test on an attribute, each branch represents an
outcome of the test, and each leaf node (or
terminal node) holds a class label.
 The topmost node in a tree is the root node.

 How are decision trees used for classification?
 Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested against
the decision tree.
 A path is traced from the root to a leaf node, which holds
the class prediction for that tuple.
 Decision trees can easily be converted to classification
rules.

 Why are decision tree classifiers so popular?
 The construction of decision tree classifiers does not
require any domain knowledge or parameter setting.
 Decision trees can handle high dimensional data.
 Decision tree induction algorithms have been used for
classification in many application areas, such as medicine,
manufacturing and production, financial analysis,
astronomy, and molecular biology.

Decision Tree Induction
 Algorithm: Generate decision tree. Generate a decision tree from the training
tuples of data partition D.
 Input:
 Data partition, D, which is a set of training tuples and their associated class labels;
 attribute list, the set of candidate attributes;
 Attribute selection method, a procedure to determine the splitting criterion that
“best” partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split point or splitting subset.
 Output: A decision tree.

(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; //
majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best”
splitting criterion;
(7) label node N with splitting criterion;

(8) if splitting attribute is discrete-valued and multiway
splits allowed then // not restricted to binary trees
(9) attribute_list attribute_list - splitting attribute;
// remove splitting attribute
(10) for each outcome j of splitting criterion
//partition the tuples and grow subtrees for each
partition

(11) let Dj be the set of data tuples in D satisfying outcome j; // a
partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to
node N;
(14) else attach the node returned by Generate decision tree(Dj,
attribute list) to node N;
endfor
(15) return N;

Attribute Selection Measures:
 Information gain (or) Entropy:
 Information gain of the attribute:

 Gain:

RID Age Income Student Credit_Rating Class:
buys_computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle_aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior low Yes Fair Yes
6 Senior low Yes Excellent No
7 Middle_aged low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle_aged Medium No Excellent Yes
13 Middle_aged High Yes Fair Yes

 Information gain (or) Entropy:
 Info(D) = - 9/14 log2(9/14) - 5/14 log2(5/14)
= 0.940 bits

 Information gain of the attribute:
 Info age (D) = 0.694

 Gain:
 Gain (A) = .940 - .694 = 0.246 bits

 Gain(Age) = .940 - .694 = 0.246 bits
 Gain (Income) = 0.029 bits
 Gain (student) = 0.151 bits
 Gain (credit_rating) = 0.048 bits

 Gain ratio:
 The information gain measure is biased toward tests with
many outcomes.
 For example, consider an attribute that acts as a unique
identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there are
values), each one containing just one tuple.

 Gain ratio:
 C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio, which attempts
to overcome this bias.
It applies a kind of normalization to information gain
using a “split information” value defined analogously
with Info(D).

 Gain ratio:
 The attribute with the maximum gain ratio is selected
as the splitting attribute.

 Gini index:
 The Gini index is used in CART.
 The Gini index measures the impurity of D, a data
partition or set of training tuples, as

 Tree Pruning
 When a decision tree is built, many of the
branches will reflect anomalies in the training data
due to noise or outliers.
 Tree pruning methods address this problem of
overfitting the data.

 How does tree pruning work?
 In the prepruning approach, a tree is “pruned” by halting
its construction early.
 The second and more common approach is postpruning,
which removes subtrees from a “fully grown” tree.
 A subtree at a given node is pruned by removing its
branches and replacing it with a leaf.

Bayesian Classification
 Bayesian classifiers are statistical classifiers.
 They can predict class membership
probabilities, such as the probability that a
given tuple belongs to a particular class.
 Bayesian classification is based on Bayes’
theorem

 Bayes’ Theorem
 Let X be a data tuple.
 Let H be some hypothesis, such as that the data tuple X
belongs to a specified class C.
 For classification problems, we want to determine P(HjX),
we are looking for the probability that tuple X belongs to
class C, given that we know the attribute description of X.

 Bayes’ Theorem
 P(H | X) is the posterior probability, or a posteriori
probability, of H conditioned on X.
 P(H) is the prior probability, or a priori probability, of H.
 Similarly, P(X | H) is the posterior probability of X
conditioned on H.
 P(X) is the prior probability of X.

 Naïve Bayesian Classification
 Bayesian classifiers have the minimum error rate in
comparison to all other classifiers.
 Bayesian classifiers are also useful in that they
provide a theoretical justification for other classifiers
that do not explicitly use Bayes’ theorem.

 Let D be a training set of tuples and their
associated class labels. Each tuple is represented
by an n-dimensional attribute vector, X = (x1, x2, :
: : , xn), from n attributes, respectively, A1, A2, …
, An.

 Suppose that there are m classes, C1, C2, : : : , Cm.
 Given a tuple, X, the classifier will predict that X belongs to
the class having the highest posterior probability, conditioned
on X.
 That is, the naïve Bayesian classifier predicts that tuple X
belongs to the class Ci if and only if

 Suppose that there are m classes, C1, C2, : : : ,
Cm.
 The class Ci for which P(Ci | X) is maximized is called
the maximum posteriori hypothesis. By Bayes’ theorem

 As P(X) is constant for all classes, only P(X | Ci)
P(Ci) need be maximized.
 If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = …. = P(Cm), and we would
therefore maximize P(X | Ci).

 In order to reduce computation in evaluating P(X |
Ci), the naive assumption of class conditional
independence is made.

 In order to predict the class label of X, P(X | i)P(Ci) is
evaluated for each class Ci. The classifier predicts that the
class label of tuple X is the class Ci if and only if
P(X | Ci)P(Ci) > P(X | Cj)P(Cj)
 In other words, the predicted class label is the class Ci for
which P(XjCi)P(Ci) is the maximum.

 Class: C1: buys_computer = ‘yes’
 Class: C2: buys_computer = ‘no’
 Data to be classified:
 X = (age = youth, income = medium, student = yes, credit
rating = fair)
 P(Ci) = P(buys_computer = yes) = 9/14 = 0.643
P(buys_computer = no) = 5/14 = 0.357

 To compute P(X | Ci),
 P(age = youth | buys_computer = yes) = 2/9 = 0.222
 P(age = youth | buys_computer = no) = 3/5 = 0.600
 P(income = medium | buys_computer = yes) = 4/9 = 0.444
 P(income = medium | buys_computer = no) = 2/5 = 0.400
 P(student = yes | buys_computer = yes) = 6/9 = 0.667
 P(student = yes | buys_computer = no) = 1/5 = 0.200
 P(credit rating = fair | buys_computer = yes) = 6/9 = 0.667
 P(credit rating = fair | buys_computer = no) = 2/5 = 0.400

 X = (age = youth, income = medium, student = yes, credit rating
= fair)
 P(X | buys_computer = yes)
= P(age = youth | buys_computer = yes) * P(income = medium |
buys_computer = yes) * P(student = yes | buys_computer = yes)*
P(credit rating = fair | buys_computer = yes)
= 0.222 * 0.444*0.667*0.667 = 0.044.
 P(X|buys_computer = no) = 0.600*0.400*0.200*0.400 = 0.019.

 To find the class, Ci, that maximizes P(X | Ci)P(Ci), we
compute
 P(X | buys_computer = yes)P(buys_computer = yes) =
0.044*0.643 = 0.028
 P(X | buys_computer = no)P(buys_computer = no) =
0.019*0.357 = 0:007
 The naïve Bayesian classifier predicts buys computer = yes for
tuple X.

 Bayesian Belief Networks:
 Bayesian belief networks specify joint conditional probability
distributions.
 They allow class conditional independencies to be defined
between subsets of variables.
 They provide a graphical model of causal relationships, on
which learning can be performed.
 Bayesian belief networks are also known as belief networks,
Bayesian networks, and probabilistic networks.

 Bayesian Belief Networks:
 A belief network is defined by two components—
a directed acyclic graph and a set of conditional
probability tables.

Rule-Based Classification
The learned model is represented as a set of IF-
THEN rules.
 Using IF-THEN Rules for Classification:
 Rules are a good way of representing information or
bits of knowledge.
A rule-based classifier uses a set of IF-THEN rules for
classification.

Using IF-THEN Rules for Classification:
 An IF-THEN rule is an expression of the form
 IF condition THEN conclusion
 An example is rule R1, R1: IF age = youth AND
student = yes THEN buys computer = yes.

 A rule R can be assessed by its coverage and accuracy.
Given a tuple, X, from a class labeled data set, D, let
ncovers be the number of tuples covered by R; ncorrect be
the number of tuples correctly classified by R; and |Dj|
be the number of tuples in D.
We can define the coverage and accuracy of R as

 A rule’s coverage is the percentage of tuples that
are covered by the rule.
 For a rule’s accuracy, tuples that it covers and see
what percentage of them the rule can correctly
classify.

An example is rule R1, R1: IF age = youth AND
student = yes THEN buys computer = yes.
 Coverage (R1) = 2 / 14 = 14.28%
 Accuracy (R2) = 2 / 2 = 100%

 Rule-based classification to predict the class label of a
given tuple, X.
 If a rule is satisfied by X, the rule is said to be
triggered.
 X= (age = youth, income = medium, student = yes,
credit rating = fair).

 If more than one rule is triggered, we need a
conflict resolution strategy to figure out which rule
gets to fire and assign its class prediction to X.
 size ordering scheme: It assigns the highest priority to
the triggering rule that has the “toughest” requirements.
The triggering rule with the most attribute tests is fired

 The rule ordering scheme prioritizes the rules beforehand. The
ordering may be class based or rule-based.
 With class-based ordering, the classes are sorted in order of
decreasing “importance,” such as by decreasing order of
prevalence.
 That is, all of the rules for the most prevalent (or most frequent)
class come first, the rules for the next prevalent class come next,
and so on.

 With rule-based ordering, the rules are organized into one long
priority list, according to some measure of rule quality such as
accuracy, coverage, or size (number of attribute tests in the rule
antecedent), or based on advice from domain experts.
 When rule ordering is used, the rule set is known as a decision
list.
 With rule ordering, the triggering rule that appears earliest in the
list has highest priority, and so it gets to fire its class prediction.

Rule Extraction from a Decision Tree
 To extract rules from a decision tree, one rule is
created for each path from the root to a leaf node.
 Each splitting criterion along a given path is logically
ANDed to form the rule antecedent (“IF” part).
The leaf node holds the class prediction, forming the
rule consequent (“THEN” part).

Rule Extraction from a Decision Tree
 R1: IF age = youth AND student = no THEN buys
computer = no
R2: IF age = youth AND student = yes THEN buys
computer = yes
R3: IF age = middle aged THEN buys computer = yes
R4: IF age = senior AND credit rating = excellent
THEN buys computer = yes
R5: IF age = senior AND credit rating = fair THEN
buys computer = no

 Rule Induction Using a Sequential Covering Algorithm:
 The name comes from the notion that the rules are learned
sequentially (one at a time), where each rule for a given
class will ideally cover many of the tuples of that class (and
hopefully none of the tuples of other classes).
 There are many sequential covering algorithms. Popular
variations include AQ, CN2, and the more recent, RIPPER.

 Rule Induction Using a Sequential Covering Algorithm:
Rules are learned one at a time.
Each time a rule is learned, the tuples covered by the rule
are removed, and the process repeats on the remaining
tuples.
This sequential learning of rules is in contrast to decision
tree induction.

Classification by Backpropagation
 Backpropagation is a neural network learning algorithm
 A neural network is a set of connected input/output
units in which each connection has a weight associated
with it.
 Neural network learning is also referred to as
connectionist learning due to the connections between
units.

A Multilayer Feed-Forward Neural Network:

 A Multilayer Feed-Forward Neural Network:
 The backpropagation algorithm performs learning on a
multilayer feed-forward neural network.
 It iteratively learns a set of weights for prediction of the
class label of tuples.
 A multilayer feed-forward neural network consists of an
input layer, one or more hidden layers, and an output layer.

 Each layer is made up of units. The inputs to the network
correspond to the attributes measured for each training
tuple.
 The inputs are fed simultaneously into the units making up
the input layer.
 These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer of
“neuronlike” units, known as a hidden layer.

 The outputs of the hidden layer units can be input to
another hidden layer, and so on.
 The number of hidden layers is arbitrary, although in
practice, usually only one is used.
 The weighted outputs of the last hidden layer are input
to units making up the output layer, which emits the
network’s prediction for given tuples.

 The units in the input layer are called input units.
 The units in the hidden layers and output layer are
sometimes referred to as neurodes, due to their
symbolic biological basis, or as output units.
 The multilayer neural network has two layers of output
units. Therefore, we say that it is a two-layer neural
network.

 Network Topology:
 The user must decide on the network topology by
specifying the number of units in the input layer, the
number of hidden layers (if more than one), the number of
units in each hidden layer, and the number of units in the
output layer.
 Typically, input values are normalized so as to fall between
0.0 and 1.0.

 Network Topology:
 Network design is a trial-and-error process and
may affect the accuracy of the resulting trained
network.

 Backpropagation:
 Backpropagation learns by iteratively processing a data set
of training tuples, comparing the network’s prediction for
each tuple with the actual known target value.
 The target value may be the known class label of the
training tuple (for classification problems) or a continuous
value (for prediction).

 For each training tuple, the weights are modified so as to
minimize the mean squared error between the network’s
prediction and the actual target value.
 These modifications are made in the “backwards”
direction, that is, from the output layer, through each
hidden layer down to the first hidden layer (hence the name
backpropagation).

 Initialize the weights: The weights in the network are
initialized to small random numbers (e.g., ranging
from -1.0 to 1.0, or – 0.5 to 0.5).
 Each unit has a bias associated with it, as explained
below. The biases are similarly initialized to small
random numbers.

 Backpropagation: Each training tuple, X, is processed
by the following steps.
 Propagate the inputs forward: First, the training tuple is fed
to the input layer of the network. The inputs pass through
the input units, unchanged.
 That is, for an input unit, j, its output, Oj, is equal to its
input value, Ij. Next, the net input and output of each unit
in the hidden and output layers are computed.

 Backpropagation: Each training tuple, X, is processed
by the following steps.
 Propagate the inputs forward: The net input to a unit in the
hidden or output layers is computed as a linear combination
of its inputs.
 Each such unit has a number of inputs to it that are, in fact,
the outputs of the units connected to it in the previous layer.
 Each connection has a weight.

 To compute the net input to the unit, each input
connected to the unit is multiplied by its
corresponding weight, and this is summed.
Given a unit j in a hidden or output layer, the net
input, Ij, to unit j is

 Given a unit j in a hidden or output layer, the net input, Ij,
to unit j is
 where wi j is the weight of the connection from unit i in
the previous layer to unit j; Oi is the output of unit i from
the previous layer; and qj is the bias of the unit.
The bias acts as a threshold in that it serves to vary the
activity of the unit.

 Each unit in the hidden and output layers takes its net input
and then applies an activation function to it.
 The function symbolizes the activation of the neuron
represented by the unit. The logistic, or sigmoid, function is
used.
Given the net input Ij to unit j, then Oj, the output of unit j,
is computed as

 We compute the output values, Oj, for each hidden
layer, up to and including the output layer, which gives
the network’s prediction.
In practice, it is a good idea to cache (i.e., save) the
intermediate output values at each unit as they are
required again later, when backpropagating the error.

 Backpropagate the error: The error is propagated backward
by updating the weights and biases to reflect the error of
the network’s prediction. For a unit j in the output layer, the
error Err j is computed by
Errj = Oj(1-Oj)(Tj - Oj)
 where Oj is the actual output of unit j, and Tj is the known
target value of the given training tuple.

 To compute the error of a hidden layer unit j, the weighted
sum of the errors of the units connected to unit j in the next
layer are considered. The error of a hidden layer unit j is
 where wjk is the weight of the connection from unit j to a
unit k in the next higher layer, and Errk is the error of unit
k.

 The computational efficiency depends on the time
spent training the network. Given |D| tuples and w
weights, each epoch requires O(|D| x w) time.

Support Vector Machines
 Method for the classification of both linear
and nonlinear data.
 It uses a nonlinear mapping to transform the
original training data into a higher dimension.
 Within this new dimension, it searches for the
linear optimal separating hyperplane

 With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane.
 The SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors).
 SVMs can be used for prediction as well as classification.
 Applications: Handwritten digit recognition, object recognition, and
speaker identification, as well as benchmark time-series prediction
tests.

Goal:
The SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new
data point in the correct category in the future.
This best decision boundary is called a hyperplane.

 The Case When the Data Are Linearly Separable:
 The 2-D data are linearly separable (or “linear,” for short)
because a straight line can be drawn to separate all of the
tuples of class+1 from all of the tuples of class -1.
 There are an infinite number of separating lines that could
be drawn.
To find the “best” one, that is, one that will have the
minimum classification error on previously unseen tuples.

The Case When the Data Are Linearly Separable:
 Term “hyperplane” to refer to the decision boundary
that we are seeking, regardless of the number of input
attributes.
 An SVM approaches this problem by searching for the
maximum marginal hyperplane.

The Case When the Data Are Linearly
Separable:

The Case When the Data Are Linearly Separable:
 Both hyperplanes can correctly classify all of the
given data tuples.
 The hyperplane with the larger margin to be more
accurate at classifying future data tuples than the
hyperplane with the smaller margin.

Inseparable:

Inseparable:
 No straight line can be found that would separate the
classes.
 SVMs are capable of finding nonlinear decision
boundaries (i.e., nonlinear hypersurfaces) in input
space.

Inseparable:
 Transform the original input data into a higher
dimensional space using a nonlinear mapping.
 Once the data have been transformed into the new
higher space, the second step searches for a linear
separating hyperplane in the new space

Other Classification Methods
 Genetic Algorithms:
 Genetic algorithms attempt to incorporate ideas of
natural evolution.
 An initial population is created consisting of
randomly generated rules.
 Each rule can be represented by a string of bits.

 Based on the notion of survival of the fittest, a new
population is formed to consist of the fittest rules in
the current population, as well as offspring of these
rules.
 The fitness of a rule is assessed by its classification
accuracy on a set of training samples.

 Offspring are created by applying genetic operators such
as crossover and mutation.
 In crossover, substrings from pairs of rules are swapped to
form new pairs of rules.
In mutation, randomly selected bits in a rule’s string are
inverted.
 To evaluate the fitness of other algorithms.

 Rough Set Approach
 Rough set theory can be used for classification to
discover structural relationships within imprecise or
noisy data.
 It applies to discrete-valued attributes. Continuous-
valued attributes must therefore be discretized before
its use.

 Rough set theory is based on the establishment of
equivalence classes within the given training data.
Rough sets can be used to approximately or “roughly”
define such classes.
 A rough set definition for a given class, C, is approximated
by two sets—a lower approximation of C and an upper
approximation of C.

 The lower approximation of C consists of all of the data
tuples that, based on the knowledge of the attributes, are
certain to belong to C without ambiguity.
 The upper approximation of C consists of all of the tuples
that, based on the knowledge of the attributes, cannot be
described as not belonging to C.

 Rough sets can also be used for attribute subset
selection and relevance analysis.

 Fuzzy Set Approaches
 Fuzzy set theory is also known as possibility theory.
 It lets us work at a high level of abstraction and offers a
means for dealing with imprecise measurement of data.
 Fuzzy set theory allows us to deal with vague or inexact
facts.
 Fuzzy set theory is useful for data mining systems
performing rule-based classification.

Prediction
 Numeric prediction is the task of predicting
continuous (or ordered) values for given input.
 For example, to predict the salary of college
graduates with 10 years of work experience, or
the potential sales of a new product given its
price.

Prediction
 Regression analysis can be used to model the
relationship between one or more independent or
predictor variables and a dependent or response
variable (which is continuous-valued).
 In the context of data mining, the predictor variables
are the attributes of interest describing the tuple (i.e.,
making up the attribute vector).

Prediction
 Linear Regression
 Straight-line regression analysis involves a response
variable, y, and a single predictor variable, x.
 It is the simplest form of regression, and models y as a
linear function of x. That is, y = b + wx.
 The regression coefficients, w and b, can also be thought
of as weights, so that we can equivalently write,
y = w0+w1x.

Prediction

Prediction
 Multiple linear regression is an extension of straight-line
regression so as to involve more than one predictor
variable.
 An example of a multiple linear regression model based on
two predictor attributes or variables, A1 and A2, is
y = w0+w1x1+w2x2.

Prediction
 Nonlinear Regression:
 Polynomial regression is often of interest when there is
just one predictor variable.
 It can be modeled by adding polynomial terms to the basic
linear model.
 By applying transformations to the variables, we can
convert the nonlinear model into a linear one that can then
be solved by the method of least squares.

20IT501_DWDM_PPT_Unit_III.ppt

Recommended

Recommended

More Related Content

Similar to 20IT501_DWDM_PPT_Unit_III.ppt

Similar to 20IT501_DWDM_PPT_Unit_III.ppt (20)

Recently uploaded

Recently uploaded (20)

20IT501_DWDM_PPT_Unit_III.ppt