Frequent Itemset Mining on BigData

MIT ACADEMY OF ENGINEERING
A LITERATURE SURVEY ON :-
“FREQUENT ITEMSET MINING ON BIGDATA”
PROJECT MEMBER :- UNDER THE GUIDENCE OF :-
RAJU GUPTA Mrs. Prajakta Ugale
PURUSHOTAM SINGH (Asst. Prof.)
AKSHAY KUMAR
SHIVANI
MAHESHWARI TEGAMPURE

Big Data
Big data usually includes data sets with sizes
beyond the ability of commonly used software
tools to capture,curate, manage, and process
the data within a tolerable elapsed time.

Introduction :-
 Frequent Itemset Mining (FIM)
 Support
 The support supp(X) of an itemset X is defined as the proportion of transactions
in the data set which contain the itemset.
supp(X)= no. of transactions which contain the itemset X / total no. of
transactions.
 Confidence
conf(X->Y)= supp(X U Y)/supp(X).

Fig:- Example for support and confidence

Hadoop Framework :-
 Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of commodity
hardware.
 Hadoop Distributed File System (HDFS).
 Hadoop MapReduce.

Map Reduce :-
 Map :-
A mapper processes a part of
data and generates a key-value pair.
 Reduce :-
various key value pair are
combined and fed to reducer which
processes these parts and gives o/p.
MapReduce
Map
Key value
pair
generation
Reduce
Give o/p

• It is a programming model and an associated
implementation for processing and generating
large data sets with a parallel, distributed algorithm
on a cluster..
• Single pass counting utilizes a map reduce phase
for each candidate generation and frequency
counting steps..

• Fixed pass combined counting starts to generate
candidates with n different lengths after p phases
and count their frequencies in one database
scan.
• Dynamic passes counting is similar to fixed passes
combined counting however n and p is
determined dynamically at each phase by the
number of generated candidates.

o Parallel FP Growth is a parallel version of well known FP
Growth.. PFP groups the items and distributes their
conditional databases to the mappers..
o The PARMA algorithm finds aproximate collections of
frequent itemsets.
o TWISTER improves the performance between map
reduce cycles or NIMBLE provides better programming
tools for data mining jobs.

Search space distribution :-
The main challenge in adapting algorithms to the
MapReduce Framework.
Task defined at start up.
Prefix tree:
oTree Structure where each path represents an itemset.
oDivided into independent groups.
oEclat traverses the tree in the DFS manner to find FI’s
Running Time in Eclat.

Search space distribution (cont..) :-
 To estimate the computation time of a subtree.
o Total No. of items
o Order of frequency of items.
o Total Frequency of items.
 Balanced Partitioning of prefix tree.

Frequent Itemset Mining on BigData

Frequent Itemset Mining on BigData

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Frequent Itemset Mining on BigData

Similar to Frequent Itemset Mining on BigData (20)

Recently uploaded

Recently uploaded (20)

Frequent Itemset Mining on BigData