Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Upcoming SlideShare
Loading in...5
×
 

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

on

  • 2,656 views

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing presented at PDPTA 2011 (http://www.world-academy-of-science.org/worldcomp11/ws/conferences/pdpta11)

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing presented at PDPTA 2011 (http://www.world-academy-of-science.org/worldcomp11/ws/conferences/pdpta11)

Statistics

Views

Total Views
2,656
Views on SlideShare
2,652
Embed Views
4

Actions

Likes
4
Downloads
63
Comments
0

1 Embed 4

http://www.slashdocs.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Presentation Transcript

  • Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing 2011 PDPTA Jongwook Woo, PhD [email_address] High-Performance Internet Computing Center (HiPIC) Computer Information Systems Department California State University, Los Angeles
  • Contents
    • Map/Reduce Brief Introduction
    • Market Basket Analysis
    • Map/Reduce Algorithm for MBA
    • Experimental Result
    • Conclusion
  • What is Map/Reduce Cloud Computing Cloudera HortonWorks AWS Parallel Computing
  • Have you heard about Cloud Computing?
    • First Impression
      • In late 2007, the New York Times wanted to make available over the web its entire archive of articles,
        • 11 million in all, dating back to 1851.
        • four-terabyte pile of images in TIFF format.
        • needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files.
          • not a particularly complicated but large computing chore,
            • requiring a whole lot of computer processing time.
        • a software programmer at the Times, Derek Gottfrid,
          • playing around with Amazon Web Services, Elastic Compute Cloud (EC2),
            • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3)
            • In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.
        • The total cost for the computing job? $240
          • 10 cents per computer-hour times 100 computers times 24 hours
  • What is MapReduce
    • Functions borrowed from functional programming languages (eg. Lisp) ‏
    • Provides Restricted parallel programming model
      • User implements Map() and Reduce() ‏
      • Libraries (Hadoop) take care of EVERYTHING else
        • Parallelization
        • Fault Tolerance
        • Data Distribution
        • Load Balancing
    • Useful for huge (peta- or Terra-bytes) but non-complicated data
      • New York Times case
      • Log file for web companies
  • Map
    • Convert data to (key, value) pairs
    • map() functions run in parallel,
      • creating different intermediate values from different input data sets
  • Reduce
    • reduce() combines those intermediate values into one or more final values for that same output key
    • reduce() functions also run in parallel,
      • each working on a different output key
    • Bottleneck:
      • reduce phase can’t start until map phase is completely finished.
  • Example: Sort URLs in the largest hit order
    • Map() ‏
      • Input <logFilename, file text>
      • Parses file and emits <url, hit counts> pairs
        • eg. <http://hello.com, 1>
    • Reduce() ‏
      • Sums all values for the same key and emits <url, TotalCount>
        • eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17>
  • Market Basket Analysis (MBA)
    • Collect the list of pair of transaction items most frequently occurred together at a store(s)
    • Traditional Business Intelligence Analysis
      • much better opportunity to make a profit by controlling the order of products and marketing
        • control the stocks more intelligently
        • arrange items on shelves
        • promote items together etc.
  • Market Basket Analysis (MBA)
    • Transactions in Store A: Input data
      • Transaction 1: cracker, icecream, beer
      • Transaction 2: chicken, pizza, coke, bread
      • Transaction 3: baguette, soda, hering, cracker, beer
      • Transaction 4: bourbon, coke, turkey
      • Transaction 5: sardines, beer, chicken, coke
      • Transaction 6: apples, peppers, avocado, steak
      • Transaction 7: sardines, apples, peppers, avocado, steak
    • What is a pair of items that people frequently buy at Store A?
  • Map Algorithm
    • 1: Reads each transaction of input file and generates the data set of the items:
      • (<V 1 >, <V 2 >, …, <V n >) where < V n >: (v n1 , v n2 ,.. v nm )
    • 2: Sort all data set <V n > and generates sorted data set <Un>:
      • (<U 1 >, <U 2 >, …, <U n >) where < U n >: (u n1 , u n2 ,.. u nm )
    • 3: Loop For each item from u n1 to u nm of < U n >
      • 3.a: generate the data set <Yn>: (y n1 , y n2 ,.. y nl ); y nl : (u nx , u ny ) where u nx ≢ u ny
      • 3.b: increment the occurrence of y nl ;
      • note: (key, value) = (ynl, number of occurrences)
    • 4. Data set is created as input of Reducer:
      • (key, <value>) = (y nl , <number of occurrences>)
  • Reduce Algorithm
    • 1: Take (y nl , <number of occurrences>) as input data from multiple Map nodes
    • 2. Add the values for y nl to have (y nl , total number of occurrences) as output
  • Market Basket Analysis (Cont’d)
    • Transactions in Store A
      • Transaction 1: cracker, icecream, beer
      • Transaction 2: chicken, pizza, coke, bread
    • Distribute Transaction data to Map nodes
    • Pair of Items restructured in each Map node
      • Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)>
      • Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)>
  • Market Basket Analysis (Cont’d)
    • Note: order of pairs should be sorted as it becomes a key
      • For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream)
    • Pair of Items sorted in MBA
      • Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)>
      • Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)>
  • Market Basket Analysis (Cont’d)
    • Output of Map node
      • Pair of Items in (key, value) structure in each Map node
      • (key, value): (pair of items, number of occurences)
      • ((cracker, icecream), 1)
      • ((beer, cracker), 1)
      • ((beer, icecream),1)
      • (chicken, pizza), 1)
      • ((chicken, coke), 1)
      • ((chicken, bread) , 1)
      • ((coke, pizza), 1)
      • ((bread, pizza), 1)
      • ((coke , bread), 1)
  • Market Basket Analysis (Cont’d)
    • Data Aggregation/Combine
      • (key, <value>): (pair of items, list number of occurences)
      • ((cracker, icecream), <1, 1, …, 1>)
      • ((beer, cracker), <1, 1, …, 1>)
      • ((beer, icecream), <1, 1, …, 1>)
      • (chicken, pizza), <1, 1, …, 1>)
  • Market Basket Analysis (Cont’d)
    • Reduce nodes
      • (key, value): (pair of items, total number of occurences)
      • ((cracker, icecream), 421)
      • ((beer, cracker), 341)
      • ((beer, icecream), 231)
      • (chicken, pizza), 111)
  • Map/Reduce for MBA … … Map 1 () Map 2 () Map m () Reduce 1 () Reduce l () Data Aggregation/Combine ((coke, pizza), <1, 1, …, 1>) ((ham, juice), <1, 1, …, 1>) ((coke, pizza), 3,421) ((ham, juice), 2,346) Input Trax Data Reduce 2 () ((coke, pizza), 1) ((bear, corn), 1) … ((ham, juice), 1) ((coke, pizza), 1) …
  • Experimental Result
    • 5 transaction files for the experiment:
      • 400 MB (6.7M transactions), 800MB (13M transactions), 1.6 GB (26M transactions).
    • run on small instances of AWS EC2
      • each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor
      • 1.7GB memory
      • 160GB storage on 32 bits platform.
    • The data are executed on 2, 5, 10, 15, and 20 nodes
  • Experimental Result
    • Execution time (sec)
    5,671 2,911 2,868 20 5,898 2,917 2,792 15 8,845 5,998 2,910 10 15,963 8,717 5,442 5 NA NA 9,133 2 26M (1.6GB) 13M (800MB) 6.7M (400MB)
  • Experimental Result
    • Execution time (sec)
  • Conclusion
    • The Market Basket Analysis Algorithm on Map/Reduce is presented
      • data mining analysis to find the most frequently occurred pair of products in baskets at a store.
    • The associated items can be paired with Map/Reduce approach.
      • Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper
    • a bottle-neck for distributing, aggregating, and reducing the data set among nodes
  •