Your SlideShare is downloading. ×
IEEE ISPA 2013
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

IEEE ISPA 2013

184
views

Published on

Large volumes of data are being produced by various modern applications at an ever increasing rate. These applications range from wireless sensors networks to social networks. The automatic analysis …

Large volumes of data are being produced by various modern applications at an ever increasing rate. These applications range from wireless sensors networks to social networks. The automatic analysis of such huge data volume is a challenging task since a large amount of interesting knowledge can be extracted. Association rule mining is an exploratory data analysis method able to discover interesting and hidden correlations among data. Since this data mining process is characterized by computationally intensive tasks, efficient distributed approaches are needed to increase its scalability. This paper proposes a novel cloud-based service, named SEARUM, to efficiently mine association rules on a distributed computing model. SEARUM consists of a series of distributed MapReduce jobs run in the cloud. Each job performs a different step in the association rule mining process. As a case study, the proposed approach has been applied to the network data scenario. The experimental validation, performed on two real network datasets, shows the effectiveness and the efficiency of SEARUM in mining association rules on a distributed computing model.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
184
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA-13) Melbourne, Australia, 16-18 July, 2013 Daniele Apiletti, Elena Baralis, Tania Cerquitelli, Silvia Chiusano and Luigi Grimaudo SeARuM: A cloud-based Service for Association Rule Mining
  • 2. Motivations Conclusions and Future Work Association Rules Application Scenario SeARuM Architecture Experimental Validation 2 Outline SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 3. q  Large volume of data are being produced by modern application q  On-line social network data q  Network traffic data q  Passive probes continuously collect flow-level measurements q  Exploit data mining techniques to extract interesting knowledge, in an automatic way 3 Motivations SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 4. q  Exploratory data analysis technique to discover interesting and hidden correlations among data q  Input: a set of transactions including data items q  E.g., {PORT=80, RTT=10, CLASS=HTTP} q  Output: a set of rules revealing co-occurrence relationships among data items q  E.g., {PORT=80} is correlated with {CLASS=HTTP} 4 Association Rule Mining SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 5. q  Itemset: a collection of one or more items q  Support: fraction of transactions that contain an itemset q  E.g., 5% of transactions contain {PORT=80, CLASS=HTTP} 5 Association Rule Mining SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 6. q  Association rule: a rule in the form X → Y: q  X and Y are itemsets q  → means co-occurrence, not causality! q  Support: percentage of transactions containing both X and Y q  Confidence: conditional probability of finding Y given X q  Lift: measures the correlation between X and Y 6 Association Rule Mining SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 7. q  Large volume of data q  Huge number of itemsets q  Computationally intensive tasks q  Horizontally-scalable distributed approaches 7 Association Rule Mining SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 8. 8 Application Scenario Statistics Passive Analyzer [1] Private Network Rest of the world Border router STORE ANALYSIS REQUEST [1] - A.Finamore, M.Mellia, M.Meo, M.Munafo, D.Rossi, ”Experiences of Internet traffic monitoring with tstat”, Network, IEEE, 2011. SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 9. 9 SeARuM Architecture Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Association rule mining framework as a SaaS q  data analytics service to cloud users. q  Distributed jobs running in the cloud q  Hadoop cluster q  HDFS file system q  Each job performs one of the steps required for ARM process SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 10. 10 SeARuM - Pre-processing Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Map-only job q  Input: flow-level HTTP logs SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 11. 11 SeARuM - Pre-processing Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Map-only job q  Input: flow-level HTTP logs q  Feature selection SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo q  RTT q  Hop q  P{reord} q  P{dup} q  NumPkt q  DataPkt q  DataByte q  WinMax q  WinMin q  WinScale q  Port q  Class
  • 12. 12 SeARuM - Pre-processing Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Map-only job q  Input: flow-level HTTP logs q  Feature selection q  Value discretization SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo Expert-domain driven: q  RTT: a bin each 5 ms q  NumPkt, DataPkt, and DataBytes: logarithmic bins q  WinMax and WinMin: a bin for each multiple N of 4 Kb
  • 13. 13 SeARuM - Pre-processing Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Map-only job q  Input: flow-level HTTP logs q  Feature selection q  Value discretization q  Transactional format conversion SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 14. 14 SeARuM - Item Frequency Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Counting the support value of all items q  Discover the items’ vocabulary q  Mapper q  Input: a transaction q  Output: pair<k=item, v=1> q  Reducer q  Input: pair<k=item, v=set(1’s)> q  Output: pair<k=item, v=sum(1’s)> SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 15. 15 SeARuM - Itemset Mining Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Exploiting the parallel FP-growth algorithm [2] q  All the items are split into groups q  Mapper q  Generating group-dependent transactions q  Reducer q  FP-Growth on group-dependent shards [2] - H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “Pfp: parallel fp-growth for query recommendation” in Proceedings of the 2008 ACM conference on Recommender systems, ser. RecSys ’08. SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 16. 16 SeARuM - Rule Extraction Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  Rules with a consequent made of a single item q  Mapper q  For each l-itemset emits <k=l-itemset, v=support> q  For each (l-1)-itemset emits <k=(l-1)-itemset, v=(l-itemset, support)> q  Reducer q  Generating association rules q  Computing support, confidence and lift SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 17. 17 SeARuM - Rule Aggregation and Sorting Pre- processing Item frequency Itemset mining Rule extraction Rule aggregation and sorting q  To help in analyzing the extracted rules: q  Aggregation by rule consequent q  Sorting by quality measures SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 18. q  Cluster of 5 nodes running Cloudera’s distribution of Apache Hadoop q  2.67 GHz six-cores Intel(R) Xeon(R) X5650 q  32 Gbyte of main memory q  Ubuntu 12.04 server q  Real network traffic traces 18 Experimental Validation Dataset Number of TCP flows Size [Gbyte] D1 11,325,006 5.28 D2 413,012,989 192.56 SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 19. 19 Execution time distribution Job 1 Time: 46% Volume: 413MJob 2 Time: 27% Volume: 151M Job 3 Time: 25% Volume: 151M 1% 1% Job1: Data pre-processing Job 2: Item frequency computation Job 3: Itemset Mining Job 4: Rule extraction Job 5: Rule aggregation and sorting 0 1 2 3 4 5 1 node 3 nodes 5 nodes Speedup Number of nodes Speedup Experimental Validation SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 20. 20 Experimental Validation SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo 0 50 100 150 200 250 30 35 40 45 50 #itemsets MinSup (%) 0 100 200 300 400 500 600 50 60 70 80 90 #rules MinConf (%) MinSup=50% MinSup=45% MinSup=40% MinSup=35% MinSup=30% Number of rules (D2)Extracted itemsets (D2)
  • 21. 21 Experimental Validation SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo Number of rules (D1)Extracted itemsets (D1) 0 50 100 150 200 250 30 35 40 45 50 #itemsets MinSup (%) 0 100 200 300 400 500 600 700 50 60 70 80 90 MinSup=50% MinSup=45% MinSup=40% MinSup=35% MinSup=30%
  • 22. 22 {Port = 80, P{reord} = 0−0.1, DataPkt = 1−2, DataBytes = 4−5} {Class = HTTP} support = 31.3 % - confidence = 99.9 % - lift = 1.765 {P{dup} = 0−0.1, NumPkt <= 1, DataPkt <= 1, Class = SSL} {Port = 443} support = 1.3 % - confidence = 99.3 % - lift = 4.944 Experimental Validation SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 23. q  Cloud-based service for association rule mining q  Horizontal scalability q  Meaningfulness of the extracted knowledge q  Ongoing/future work (mPlane Project) q  Optimizing the MapReduce workflow q  Support domain expert 23 Conclusions and Future Work SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo
  • 24. 24 Thanks for attention! SeARuM: A cloud-based Service for Association Rule Mining – Luigi Grimaudo Questions? Luigi Grimaudo – luigi.grimaudo@polito.it