Presented By
G.RAJASEKARAN
Frequent Item Set Extraction Using
Generalized Item Set Mining Algorithms
PONDICHERRY ENGINEERING COLLEGE
(Promoted and fully funded by Govt. of Puducherry, Affiliated to Pondicherry University)
Department of Computer Science and Engineering
Research Proposal Presentation
 A promising approach to Bayesian classification is based on exploiting
frequent patterns, i.e., patterns that frequently occur in the training data
set, to estimate the Bayesian probability. Entropy-based Bayesian
classifier, namely EnBay. This work will address the domain of noisy
data, for which the reliability of the probability estimate becomes
particularly relevant, and the integration of generalized item set mining
algorithms to further enhance classification accuracy.
This proposal is to extended to use Boosting method to give more
accuracy than the previous approach.
Objective
GenIO (GeNeralized Itemset DiscOverer) to support data analysis by
automatically extracting higher level, more abstract correlations
from data. GenIO extends the concept of multi-level itemsets (Han
and Kamber, 2006) by performing an opportunistic extraction of
generalized itemsets.
GenIO is implemented by customizing the Apriori algorithm (Agrawal
and Srikant, 1994) to efficiently extract generalized itemsets, with
variable support thresholds, from structured data.
Literature Survey
A promising approach to Bayesian classification is based on exploiting
frequent patterns, i.e., patterns that frequently occur in the training data set, to
estimate the Bayesian probability. Entropy-based Bayesian classifier, namely
EnBay used. This work will address the domain of noisy data, for which the
reliability of the probability estimate becomes particularly relevant, and the
integration of generalized item set mining algorithms to further enhance
classification accuracy.
Problem Statement
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Large data sets: disk-resident rather than memory-resident data
Introduction
Model construction: describing a set of predetermined classes
Each tuple is assumed to belong to a predefined class, as determined
by the class label attribute (supervised learning)
The set of tuples used for model construction: training set
Model usage: for classifying previously unseen objects
Estimate accuracy of the model using a test set
The known label of test sample is compared with the classified result
from the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set, otherwise over-fitting will occur
Classification - A Two – Step Process
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Classification Process : Model Construction
Entropy Based Bayesian Classification
Boosting Method
Product Approximation
Proposed Methodology
 Agrawal, R., Imielinski, T., and Swami, A. (1993). Mining association rules between
sets of items in large databases. ACM SIGMOD Record, 22(2):207–216.
 Agrawal, R. and Srikant, R. (1994). Fast algorithm for mining association rules. In
VLDB’94, Santiago, Chile.
 Agrawal, R. and Srikant, R. (1997). Mining generalized association rules. FGCS.
Future generations computer systems, 13(23):161–180.
 Antonie, M. L., Zaiane, O. R., and Coman, A. (2001). Application of data mining
techniqeus for medical image classification. In MDM/KDD 2001.
 Han, J. and Fu, Y. (1995). Discovery of multiple-level association rules from large
databases. Proceedings of the 21th International Conference on Very Large Data
Bases, pages 420–431
•References
Thank You

Frequent Item Set Mining by Rajasekaran.ppt

  • 1.
    Presented By G.RAJASEKARAN Frequent ItemSet Extraction Using Generalized Item Set Mining Algorithms PONDICHERRY ENGINEERING COLLEGE (Promoted and fully funded by Govt. of Puducherry, Affiliated to Pondicherry University) Department of Computer Science and Engineering Research Proposal Presentation
  • 3.
     A promisingapproach to Bayesian classification is based on exploiting frequent patterns, i.e., patterns that frequently occur in the training data set, to estimate the Bayesian probability. Entropy-based Bayesian classifier, namely EnBay. This work will address the domain of noisy data, for which the reliability of the probability estimate becomes particularly relevant, and the integration of generalized item set mining algorithms to further enhance classification accuracy. This proposal is to extended to use Boosting method to give more accuracy than the previous approach. Objective
  • 4.
    GenIO (GeNeralized ItemsetDiscOverer) to support data analysis by automatically extracting higher level, more abstract correlations from data. GenIO extends the concept of multi-level itemsets (Han and Kamber, 2006) by performing an opportunistic extraction of generalized itemsets. GenIO is implemented by customizing the Apriori algorithm (Agrawal and Srikant, 1994) to efficiently extract generalized itemsets, with variable support thresholds, from structured data. Literature Survey
  • 5.
    A promising approachto Bayesian classification is based on exploiting frequent patterns, i.e., patterns that frequently occur in the training data set, to estimate the Bayesian probability. Entropy-based Bayesian classifier, namely EnBay used. This work will address the domain of noisy data, for which the reliability of the probability estimate becomes particularly relevant, and the integration of generalized item set mining algorithms to further enhance classification accuracy. Problem Statement
  • 6.
    Classification: predicts categorical classlabels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis Large data sets: disk-resident rather than memory-resident data Introduction
  • 7.
    Model construction: describinga set of predetermined classes Each tuple is assumed to belong to a predefined class, as determined by the class label attribute (supervised learning) The set of tuples used for model construction: training set Model usage: for classifying previously unseen objects Estimate accuracy of the model using a test set The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur Classification - A Two – Step Process
  • 8.
    Training Data NAME RANK YEARSTENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Classification Process : Model Construction
  • 9.
    Entropy Based BayesianClassification Boosting Method Product Approximation Proposed Methodology
  • 10.
     Agrawal, R.,Imielinski, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. ACM SIGMOD Record, 22(2):207–216.  Agrawal, R. and Srikant, R. (1994). Fast algorithm for mining association rules. In VLDB’94, Santiago, Chile.  Agrawal, R. and Srikant, R. (1997). Mining generalized association rules. FGCS. Future generations computer systems, 13(23):161–180.  Antonie, M. L., Zaiane, O. R., and Coman, A. (2001). Application of data mining techniqeus for medical image classification. In MDM/KDD 2001.  Han, J. and Fu, Y. (1995). Discovery of multiple-level association rules from large databases. Proceedings of the 21th International Conference on Very Large Data Bases, pages 420–431 •References
  • 11.