MIT ACADEMY OF ENGINEERING
A LITERATURE SURVEY ON :-
“FREQUENT ITEMSET MINING ON BIGDATA”
PROJECT MEMBER :- UNDER THE GUIDENCE OF :-
RAJU GUPTA Mrs. Prajakta Ugale
PURUSHOTAM SINGH (Asst. Prof.)
AKSHAY KUMAR
SHIVANI
MAHESHWARI TEGAMPURE
Big Data
Big data usually includes data sets with sizes
beyond the ability of commonly used software
tools to capture,curate, manage, and process
the data within a tolerable elapsed time.
Introduction :-
 Frequent Itemset Mining (FIM)
 Support
 The support supp(X) of an itemset X is defined as the proportion of transactions
in the data set which contain the itemset.
supp(X)= no. of transactions which contain the itemset X / total no. of
transactions.
 Confidence
conf(X->Y)= supp(X U Y)/supp(X).
Fig:- Example for support and confidence
Hadoop Framework :-
 Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of commodity
hardware.
 Hadoop Distributed File System (HDFS).
 Hadoop MapReduce.
Map Reduce :-
 Map :-
A mapper processes a part of
data and generates a key-value pair.
 Reduce :-
various key value pair are
combined and fed to reducer which
processes these parts and gives o/p.
MapReduce
Map
Key value
pair
generation
Reduce
Give o/p
EXAMPLE1
EXAMPLE2
• It is a programming model and an associated
implementation for processing and generating
large data sets with a parallel, distributed algorithm
on a cluster..
• Single pass counting utilizes a map reduce phase
for each candidate generation and frequency
counting steps..
• Fixed pass combined counting starts to generate
candidates with n different lengths after p phases
and count their frequencies in one database
scan.
• Dynamic passes counting is similar to fixed passes
combined counting however n and p is
determined dynamically at each phase by the
number of generated candidates.
• Fixed pass combined counting starts to generate
candidates with n different lengths after p phases
and count their frequencies in one database
scan.
• Dynamic passes counting is similar to fixed passes
combined counting however n and p is
determined dynamically at each phase by the
number of generated candidates.
o Parallel FP Growth is a parallel version of well known FP
Growth.. PFP groups the items and distributes their
conditional databases to the mappers..
o The PARMA algorithm finds aproximate collections of
frequent itemsets.
o TWISTER improves the performance between map
reduce cycles or NIMBLE provides better programming
tools for data mining jobs.
Search space distribution :-
The main challenge in adapting algorithms to the
MapReduce Framework.
Task defined at start up.
Prefix tree:
oTree Structure where each path represents an itemset.
oDivided into independent groups.
oEclat traverses the tree in the DFS manner to find FI’s
Running Time in Eclat.
Search space distribution (cont..) :-
 To estimate the computation time of a subtree.
o Total No. of items
o Order of frequency of items.
o Total Frequency of items.
 Balanced Partitioning of prefix tree.
Frequent Itemset Mining on BigData
Frequent Itemset Mining on BigData

Frequent Itemset Mining on BigData

  • 1.
    MIT ACADEMY OFENGINEERING A LITERATURE SURVEY ON :- “FREQUENT ITEMSET MINING ON BIGDATA” PROJECT MEMBER :- UNDER THE GUIDENCE OF :- RAJU GUPTA Mrs. Prajakta Ugale PURUSHOTAM SINGH (Asst. Prof.) AKSHAY KUMAR SHIVANI MAHESHWARI TEGAMPURE
  • 2.
    Big Data Big datausually includes data sets with sizes beyond the ability of commonly used software tools to capture,curate, manage, and process the data within a tolerable elapsed time.
  • 3.
    Introduction :-  FrequentItemset Mining (FIM)  Support  The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. supp(X)= no. of transactions which contain the itemset X / total no. of transactions.  Confidence conf(X->Y)= supp(X U Y)/supp(X).
  • 4.
    Fig:- Example forsupport and confidence
  • 5.
    Hadoop Framework :- Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.  Hadoop Distributed File System (HDFS).  Hadoop MapReduce.
  • 6.
    Map Reduce :- Map :- A mapper processes a part of data and generates a key-value pair.  Reduce :- various key value pair are combined and fed to reducer which processes these parts and gives o/p. MapReduce Map Key value pair generation Reduce Give o/p
  • 7.
  • 8.
  • 9.
    • It isa programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.. • Single pass counting utilizes a map reduce phase for each candidate generation and frequency counting steps..
  • 10.
    • Fixed passcombined counting starts to generate candidates with n different lengths after p phases and count their frequencies in one database scan. • Dynamic passes counting is similar to fixed passes combined counting however n and p is determined dynamically at each phase by the number of generated candidates.
  • 11.
    • Fixed passcombined counting starts to generate candidates with n different lengths after p phases and count their frequencies in one database scan. • Dynamic passes counting is similar to fixed passes combined counting however n and p is determined dynamically at each phase by the number of generated candidates.
  • 12.
    o Parallel FPGrowth is a parallel version of well known FP Growth.. PFP groups the items and distributes their conditional databases to the mappers.. o The PARMA algorithm finds aproximate collections of frequent itemsets. o TWISTER improves the performance between map reduce cycles or NIMBLE provides better programming tools for data mining jobs.
  • 13.
    Search space distribution:- The main challenge in adapting algorithms to the MapReduce Framework. Task defined at start up. Prefix tree: oTree Structure where each path represents an itemset. oDivided into independent groups. oEclat traverses the tree in the DFS manner to find FI’s Running Time in Eclat.
  • 14.
    Search space distribution(cont..) :-  To estimate the computation time of a subtree. o Total No. of items o Order of frequency of items. o Total Frequency of items.  Balanced Partitioning of prefix tree.