Market Basket Analysis Algorithm with Map/Reduce of   Cloud Computing 2011 PDPTA Jongwook Woo, PhD [email_address] High-Pe...
Contents <ul><li>Map/Reduce Brief Introduction </li></ul><ul><li>Market Basket Analysis </li></ul><ul><li>Map/Reduce Algor...
What is Map/Reduce Cloud Computing Cloudera HortonWorks AWS Parallel Computing
Have you heard about Cloud Computing? <ul><li>First Impression </li></ul><ul><ul><li>In late 2007, the New York Times want...
What is MapReduce <ul><li>Functions borrowed from functional programming languages (eg. Lisp) ‏ </li></ul><ul><li>Provides...
Map <ul><li>Convert data to (key, value) pairs </li></ul><ul><li>map() functions run in parallel,  </li></ul><ul><ul><li>c...
Reduce <ul><li>reduce() combines those intermediate values into one or more  final values  for that same output key </li><...
Example:  Sort URLs in the largest hit order <ul><li>Map() ‏ </li></ul><ul><ul><li>Input <logFilename, file text> </li></u...
Market Basket Analysis (MBA) <ul><li>Collect the list of pair of transaction items most frequently occurred together at a ...
Market Basket Analysis (MBA) <ul><li>Transactions in Store A: Input data </li></ul><ul><ul><li>Transaction 1: cracker, ice...
Map Algorithm <ul><li>1: Reads each transaction of input file and generates the data set of the items:  </li></ul><ul><ul>...
Reduce Algorithm <ul><li>1: Take (y nl , <number of occurrences>) as input data from multiple Map nodes </li></ul><ul><li>...
Market Basket Analysis (Cont’d) <ul><li>Transactions in Store A </li></ul><ul><ul><li>Transaction 1: cracker, icecream, be...
Market Basket Analysis (Cont’d) <ul><li>Note: order of pairs should be sorted as it becomes a key </li></ul><ul><ul><li>Fo...
Market Basket Analysis (Cont’d) <ul><li>Output of Map node </li></ul><ul><ul><li>Pair of Items in (key, value) structure i...
Market Basket Analysis (Cont’d) <ul><li>Data Aggregation/Combine </li></ul><ul><ul><li>(key, <value>): (pair of items, lis...
Market Basket Analysis (Cont’d) <ul><li>Reduce nodes </li></ul><ul><ul><li>(key, value): (pair of items, total number of o...
Map/Reduce for MBA … … Map 1 () Map 2 () Map m () Reduce 1  () Reduce l () Data Aggregation/Combine ((coke, pizza), <1, 1,...
Experimental Result <ul><li>5 transaction files for the experiment:  </li></ul><ul><ul><li>400 MB (6.7M transactions), 800...
Experimental Result <ul><li>Execution time (sec) </li></ul>5,671 2,911 2,868 20 5,898 2,917 2,792 15 8,845 5,998 2,910 10 ...
Experimental Result <ul><li>Execution time (sec) </li></ul>
Conclusion <ul><li>The Market Basket Analysis Algorithm on Map/Reduce is presented </li></ul><ul><ul><li>data mining analy...
 
Upcoming SlideShare
Loading in...5
×

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

3,395

Published on

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing presented at PDPTA 2011 (http://www.world-academy-of-science.org/worldcomp11/ws/conferences/pdpta11)

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,395
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
93
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

  1. 1. Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing 2011 PDPTA Jongwook Woo, PhD [email_address] High-Performance Internet Computing Center (HiPIC) Computer Information Systems Department California State University, Los Angeles
  2. 2. Contents <ul><li>Map/Reduce Brief Introduction </li></ul><ul><li>Market Basket Analysis </li></ul><ul><li>Map/Reduce Algorithm for MBA </li></ul><ul><li>Experimental Result </li></ul><ul><li>Conclusion </li></ul>
  3. 3. What is Map/Reduce Cloud Computing Cloudera HortonWorks AWS Parallel Computing
  4. 4. Have you heard about Cloud Computing? <ul><li>First Impression </li></ul><ul><ul><li>In late 2007, the New York Times wanted to make available over the web its entire archive of articles, </li></ul></ul><ul><ul><ul><li>11 million in all, dating back to 1851. </li></ul></ul></ul><ul><ul><ul><li>four-terabyte pile of images in TIFF format. </li></ul></ul></ul><ul><ul><ul><li>needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. </li></ul></ul></ul><ul><ul><ul><ul><li>not a particularly complicated but large computing chore, </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>requiring a whole lot of computer processing time. </li></ul></ul></ul></ul></ul><ul><ul><ul><li>a software programmer at the Times, Derek Gottfrid, </li></ul></ul></ul><ul><ul><ul><ul><li>playing around with Amazon Web Services, Elastic Compute Cloud (EC2), </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site. </li></ul></ul></ul></ul></ul><ul><ul><ul><li>The total cost for the computing job? $240 </li></ul></ul></ul><ul><ul><ul><ul><li>10 cents per computer-hour times 100 computers times 24 hours </li></ul></ul></ul></ul>
  5. 5. What is MapReduce <ul><li>Functions borrowed from functional programming languages (eg. Lisp) ‏ </li></ul><ul><li>Provides Restricted parallel programming model </li></ul><ul><ul><li>User implements Map() and Reduce() ‏ </li></ul></ul><ul><ul><li>Libraries (Hadoop) take care of EVERYTHING else </li></ul></ul><ul><ul><ul><li>Parallelization </li></ul></ul></ul><ul><ul><ul><li>Fault Tolerance </li></ul></ul></ul><ul><ul><ul><li>Data Distribution </li></ul></ul></ul><ul><ul><ul><li>Load Balancing </li></ul></ul></ul><ul><li>Useful for huge (peta- or Terra-bytes) but non-complicated data </li></ul><ul><ul><li>New York Times case </li></ul></ul><ul><ul><li>Log file for web companies </li></ul></ul>
  6. 6. Map <ul><li>Convert data to (key, value) pairs </li></ul><ul><li>map() functions run in parallel, </li></ul><ul><ul><li>creating different intermediate values from different input data sets </li></ul></ul>
  7. 7. Reduce <ul><li>reduce() combines those intermediate values into one or more final values for that same output key </li></ul><ul><li>reduce() functions also run in parallel, </li></ul><ul><ul><li>each working on a different output key </li></ul></ul><ul><li>Bottleneck: </li></ul><ul><ul><li>reduce phase can’t start until map phase is completely finished. </li></ul></ul>
  8. 8. Example: Sort URLs in the largest hit order <ul><li>Map() ‏ </li></ul><ul><ul><li>Input <logFilename, file text> </li></ul></ul><ul><ul><li>Parses file and emits <url, hit counts> pairs </li></ul></ul><ul><ul><ul><li>eg. <http://hello.com, 1> </li></ul></ul></ul><ul><li>Reduce() ‏ </li></ul><ul><ul><li>Sums all values for the same key and emits <url, TotalCount> </li></ul></ul><ul><ul><ul><li>eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17> </li></ul></ul></ul>
  9. 9. Market Basket Analysis (MBA) <ul><li>Collect the list of pair of transaction items most frequently occurred together at a store(s) </li></ul><ul><li>Traditional Business Intelligence Analysis </li></ul><ul><ul><li>much better opportunity to make a profit by controlling the order of products and marketing </li></ul></ul><ul><ul><ul><li>control the stocks more intelligently </li></ul></ul></ul><ul><ul><ul><li>arrange items on shelves </li></ul></ul></ul><ul><ul><ul><li>promote items together etc. </li></ul></ul></ul>
  10. 10. Market Basket Analysis (MBA) <ul><li>Transactions in Store A: Input data </li></ul><ul><ul><li>Transaction 1: cracker, icecream, beer </li></ul></ul><ul><ul><li>Transaction 2: chicken, pizza, coke, bread </li></ul></ul><ul><ul><li>Transaction 3: baguette, soda, hering, cracker, beer </li></ul></ul><ul><ul><li>Transaction 4: bourbon, coke, turkey </li></ul></ul><ul><ul><li>Transaction 5: sardines, beer, chicken, coke </li></ul></ul><ul><ul><li>Transaction 6: apples, peppers, avocado, steak </li></ul></ul><ul><ul><li>Transaction 7: sardines, apples, peppers, avocado, steak </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>What is a pair of items that people frequently buy at Store A? </li></ul>
  11. 11. Map Algorithm <ul><li>1: Reads each transaction of input file and generates the data set of the items: </li></ul><ul><ul><li>(<V 1 >, <V 2 >, …, <V n >) where < V n >: (v n1 , v n2 ,.. v nm ) </li></ul></ul><ul><li>2: Sort all data set <V n > and generates sorted data set <Un>: </li></ul><ul><ul><li>(<U 1 >, <U 2 >, …, <U n >) where < U n >: (u n1 , u n2 ,.. u nm ) </li></ul></ul><ul><li>3: Loop For each item from u n1 to u nm of < U n > </li></ul><ul><ul><li>3.a: generate the data set <Yn>: (y n1 , y n2 ,.. y nl ); y nl : (u nx , u ny ) where u nx ≢ u ny </li></ul></ul><ul><ul><li>3.b: increment the occurrence of y nl ; </li></ul></ul><ul><ul><li>note: (key, value) = (ynl, number of occurrences) </li></ul></ul><ul><li>4. Data set is created as input of Reducer: </li></ul><ul><ul><li>(key, <value>) = (y nl , <number of occurrences>) </li></ul></ul>
  12. 12. Reduce Algorithm <ul><li>1: Take (y nl , <number of occurrences>) as input data from multiple Map nodes </li></ul><ul><li>2. Add the values for y nl to have (y nl , total number of occurrences) as output </li></ul>
  13. 13. Market Basket Analysis (Cont’d) <ul><li>Transactions in Store A </li></ul><ul><ul><li>Transaction 1: cracker, icecream, beer </li></ul></ul><ul><ul><li>Transaction 2: chicken, pizza, coke, bread </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Distribute Transaction data to Map nodes </li></ul><ul><li>Pair of Items restructured in each Map node </li></ul><ul><ul><li>Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)> </li></ul></ul><ul><ul><li>Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)> </li></ul></ul><ul><ul><li>… </li></ul></ul>
  14. 14. Market Basket Analysis (Cont’d) <ul><li>Note: order of pairs should be sorted as it becomes a key </li></ul><ul><ul><li>For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream) </li></ul></ul><ul><li>Pair of Items sorted in MBA </li></ul><ul><ul><li>Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)> </li></ul></ul><ul><ul><li>Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)> </li></ul></ul><ul><ul><li>… </li></ul></ul>
  15. 15. Market Basket Analysis (Cont’d) <ul><li>Output of Map node </li></ul><ul><ul><li>Pair of Items in (key, value) structure in each Map node </li></ul></ul><ul><ul><li>(key, value): (pair of items, number of occurences) </li></ul></ul><ul><ul><li>((cracker, icecream), 1) </li></ul></ul><ul><ul><li>((beer, cracker), 1) </li></ul></ul><ul><ul><li>((beer, icecream),1) </li></ul></ul><ul><ul><li>(chicken, pizza), 1) </li></ul></ul><ul><ul><li>((chicken, coke), 1) </li></ul></ul><ul><ul><li>((chicken, bread) , 1) </li></ul></ul><ul><ul><li>((coke, pizza), 1) </li></ul></ul><ul><ul><li>((bread, pizza), 1) </li></ul></ul><ul><ul><li>((coke , bread), 1) </li></ul></ul><ul><ul><li>… </li></ul></ul>
  16. 16. Market Basket Analysis (Cont’d) <ul><li>Data Aggregation/Combine </li></ul><ul><ul><li>(key, <value>): (pair of items, list number of occurences) </li></ul></ul><ul><ul><li>((cracker, icecream), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>((beer, cracker), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>((beer, icecream), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>(chicken, pizza), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>… </li></ul></ul>
  17. 17. Market Basket Analysis (Cont’d) <ul><li>Reduce nodes </li></ul><ul><ul><li>(key, value): (pair of items, total number of occurences) </li></ul></ul><ul><ul><li>((cracker, icecream), 421) </li></ul></ul><ul><ul><li>((beer, cracker), 341) </li></ul></ul><ul><ul><li>((beer, icecream), 231) </li></ul></ul><ul><ul><li>(chicken, pizza), 111) </li></ul></ul><ul><ul><li>… </li></ul></ul>
  18. 18. Map/Reduce for MBA … … Map 1 () Map 2 () Map m () Reduce 1 () Reduce l () Data Aggregation/Combine ((coke, pizza), <1, 1, …, 1>) ((ham, juice), <1, 1, …, 1>) ((coke, pizza), 3,421) ((ham, juice), 2,346) Input Trax Data Reduce 2 () ((coke, pizza), 1) ((bear, corn), 1) … ((ham, juice), 1) ((coke, pizza), 1) …
  19. 19. Experimental Result <ul><li>5 transaction files for the experiment: </li></ul><ul><ul><li>400 MB (6.7M transactions), 800MB (13M transactions), 1.6 GB (26M transactions). </li></ul></ul><ul><li>run on small instances of AWS EC2 </li></ul><ul><ul><li>each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor </li></ul></ul><ul><ul><li>1.7GB memory </li></ul></ul><ul><ul><li>160GB storage on 32 bits platform. </li></ul></ul><ul><li>The data are executed on 2, 5, 10, 15, and 20 nodes </li></ul>
  20. 20. Experimental Result <ul><li>Execution time (sec) </li></ul>5,671 2,911 2,868 20 5,898 2,917 2,792 15 8,845 5,998 2,910 10 15,963 8,717 5,442 5 NA NA 9,133 2 26M (1.6GB) 13M (800MB) 6.7M (400MB)
  21. 21. Experimental Result <ul><li>Execution time (sec) </li></ul>
  22. 22. Conclusion <ul><li>The Market Basket Analysis Algorithm on Map/Reduce is presented </li></ul><ul><ul><li>data mining analysis to find the most frequently occurred pair of products in baskets at a store. </li></ul></ul><ul><li>The associated items can be paired with Map/Reduce approach. </li></ul><ul><ul><li>Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper </li></ul></ul><ul><li>a bottle-neck for distributing, aggregating, and reducing the data set among nodes </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×