Fi doopdp data partitioning in frequent itemset

CONTACT: PRAVEEN KUMAR. L (,+91 – 9791938249)
MAIL ID: sunsid1989@gmail.com, praveen@nexgenproject.com
Web: www.nexgenproject.com, www.finalyear-ieeeprojects.com
FiDoop-DP: Data Partitioning in Frequent Itemset Mining on
Hadoop Clusters
Abstract:
Traditional parallel algorithms for mining frequent itemsets aim to balance
load by equally partitioning data among a group of computing nodes. We start
this study by discovering a serious performanceproblem of the existing parallel
Frequent Itemset Mining algorithms. Given a large dataset, data partitioning
strategies in the existing solutions suffer high communication and mining
overhead induced by redundant transactions transmitted among computing
nodes. We address this problem by developing a data partitioning approach
called FiDoop-DP using the MapReduce programming model. The overarching
goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset
Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-
based data partitioning technique, which exploits correlations among
transactions. Incorporating the similarity metric and the Locality-Sensitive
Hashing technique, FiDoop-DP places highly similar transactions into a data
partition to improve locality without creating an excessive number of
redundant transactions. We implement FiDoop-DP on a 24-node Hadoop
cluster, driven by a wide range of datasets created by IBM Quest Market-
Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is
conducive to reducing network and computing loads by the virtue of
eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly

improves the performance of the existing parallel frequent-pattern scheme by
up to31% with an average of 18%
CONCLUSIONS AND FUTURE WORK
To mitigate high communication and reduce computing cost in MapReduce-
based FIM algorithms, we developed FiDoop-DP, which exploits correlation
among transactions to partition a large dataset across data nodes in a
Hadoopcluster. FiDoop-DP is able to (1) partition transactions with high
similarity together and (2) group highly correlated frequent items into a list.
One of the salient features of FiDoop-DP lies in its capability of lowering
network traffic and computing load through reducing the number of
redundant transactions, which are transmitted among Hadoop nodes. FiDoop-
DP applies the Voronoidiagrambaseddata partitioning technique to accomplish
data partition,in which LSH is incorporated to offer an analysisof correlation
among transactions. At the heart of FiDoop-DP is the second MapReduce job,
which (1) partitions alarge database to form a complete dataset for item
groupsand (2) conducts FP-Growth processing in parallel on localpartitions to
generate all frequent patterns. Our experimentalresults reveal that FiDoop-DP
significantly improves theFIMperformance of the existing Pfp solution by up to
31%with an average of 18%.We introduced in this study a similarity metric to
facilitatedata-aware partitioning. As a future research direction,we will apply
this metric to investigate advanced loadbalancingstrategies on a

heterogeneous Hadoop cluster.In one of our earlier studies (see [32] for
details), we addressedthe data-placement issue in heterogeneous
Hadoopclusters, where data are placed across nodes in a way thateach node
has a balanced data processing load. Our dataplacement scheme [32] can
balance the amount of datastored in heterogeneous nodes to achieve
improved dataprocessingperformance. Such a scheme implemented at
thelevel of Hadoop distributed file system (HDFS) is unawareof correlations
among application data. To further improveload balancing mechanisms
implemented in HDFS, we planto integrate FiDoop-DP with a data-placement
mechanismin HDFS on heterogeneous clusters. In addition to
performanceissues, energy efficiency of parallel FIM systems willbe an
intriguing research direction.
REFERENCES
[1] M. J. Zaki, “Parallel and distributed association mining: A
survey,”Concurrency, IEEE, vol. 7, no. 4, pp. 14–25, 1999.
[2] I. Pramudiono and M. Kitsuregawa, “Fp-tax: Tree structure
basedgeneralized association rule mining,” in Proceedings of the 9th
ACMSIGMOD workshop on Research issues in data mining and
knowledgediscovery. ACM, 2004, pp. 60–63.
[3] J. Dean and S. Ghemawat, “Mapreduce: simplified data processingon large
clusters,” Communications of the ACM, vol. 51, no. 1, pp.107–113, 2008.

[4] S. Sakr, A. Liu, and A. G. Fayoumi, “The family of mapreduceand large-scale
data processing systems,” ACMComputing Surveys(CSUR), vol. 46, no. 1, p. 11,
2013.
[5] M.-Y. Lin, P.-Y.Lee, and S.-C. Hsueh, “Apriori-based frequent itemsetmining
algorithms on mapreduce,” in Proceedings of the 6th InternationalConference
on Ubiquitous Information Management and Communication,ser. ICUIMC ’12.
New York, NY, USA: ACM, 2012, pp.76:1–76:8.
[6] X. Lin, “Mr-apriori: Association rules algorithm based on mapreduce,”in
Software Engineering and Service Science (ICSESS), 2014 5thIEEE International
Conference on. IEEE, 2014, pp. 141–144.
[7] L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng, “Balancedparallel fp-
growth with mapreduce,” in Information Computing andTelecommunications
(YC-ICT), 2010 IEEE Youth Conference on. IEEE,2010, pp. 243–246.
[8] S. Hong, Z. Huaxuan, C. Shiping, and H. Chunyan, “The study ofimproved fp-
growth algorithm in mapreduce,” in 1st InternationalWorkshop on Cloud
Computing and Information Security. Atlantis Press,2013.
[9] M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal, “Parma:a parallel
randomized algorithm for approximate association rulesmining in mapreduce,”
in Proceedings of the 21st ACM internationalconference on Information and
knowledge management. ACM, 2012, pp.85–94.
[10] C. Lam, Hadoop in action. Manning Publications Co., 2010.

Fi doopdp data partitioning in frequent itemset

Recommended

Recommended

More Related Content

More from Nexgen Technology

More from Nexgen Technology (20)

Recently uploaded

Recently uploaded (20)

Fi doopdp data partitioning in frequent itemset