Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop


Published on

Draft Slide for EDB 2011
(Songdo Park Hotel, Incheon, Korea Aug. 25-27, 2011)

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop

  1. 1. Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop EDB 2011 ( Songdo Park Hotel, Incheon, Korea Aug. 25-27, 2011 ) Seon Ho Kim, PhD @Integrated Media Systems Center, USC Jongwook Woo (PhD), Siddharth Basopia, Yuhang Xu @CSULA High-Performance Internet Computing Center (HiPIC) Computer Information Systems Department California State University, Los Angeles
  2. 2. Contents <ul><li>Map/Reduce Brief Introduction </li></ul><ul><li>Market Basket Analysis </li></ul><ul><li>Map/Reduce Algorithm for MBA </li></ul><ul><li>NoSQL HBase </li></ul><ul><li>Experimental Result </li></ul><ul><li>Conclusion </li></ul>
  3. 3. What is Map/Reduce and NoSQL DB on Cloud Computing Cloudera HortonWorks AWS NoSQL DB
  4. 4. Big Data for RDBMS <ul><li>Issues in RDBMS </li></ul><ul><ul><li>Hard to scale </li></ul></ul><ul><ul><ul><li>Relation gets broken </li></ul></ul></ul><ul><ul><ul><ul><li>Partitioning for scalability </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Replication for availability </li></ul></ul></ul></ul><ul><ul><li>Speed </li></ul></ul><ul><ul><ul><li>The Seek times of physical storage </li></ul></ul></ul><ul><ul><ul><ul><li>Slower than N/W speed </li></ul></ul></ul></ul><ul><ul><ul><ul><li>1TB disk: 10Mbps transfer rate </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>100Ksec =>27.7 hrs </li></ul></ul></ul></ul></ul><ul><ul><ul><li>Multiple data sources at difference places </li></ul></ul></ul><ul><ul><ul><ul><li>100 10GB disks: each 10Mbps transfer rate </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>1.7min </li></ul></ul></ul></ul></ul>
  5. 5. Big Data for RDBMS (Cont’d) <ul><li>Issues in RDBMS (Cont’d) </li></ul><ul><ul><li>Data Integration </li></ul></ul><ul><ul><ul><li>Many unstructured data </li></ul></ul></ul><ul><ul><ul><ul><li>Web data </li></ul></ul></ul></ul><ul><li>Solution </li></ul><ul><ul><li>Map/Reduce </li></ul></ul><ul><ul><ul><li>(Key, Value) parallel computing </li></ul></ul></ul><ul><ul><ul><li>Apache Hadoop </li></ul></ul></ul><ul><ul><li>Big Data => Data Cleansing by Hadoop => Data Repositories (Pig, Hive, Mahout, HBase, Cassandra, MongoDB) => Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting) </li></ul></ul>
  6. 6. What is MapReduce <ul><li>Functions borrowed from functional programming languages (eg. Lisp) ‏ </li></ul><ul><li>Provides Restricted parallel programming model: Hadoop </li></ul><ul><ul><li>User implements Map() and Reduce() ‏ </li></ul></ul><ul><ul><li>Libraries (Hadoop) take care of EVERYTHING else </li></ul></ul><ul><ul><ul><li>Parallelization </li></ul></ul></ul><ul><ul><ul><li>Fault Tolerance </li></ul></ul></ul><ul><ul><ul><li>Data Distribution </li></ul></ul></ul><ul><ul><ul><li>Load Balancing </li></ul></ul></ul><ul><li>Useful for huge (peta- or Terra-bytes) but non-complicated data </li></ul><ul><ul><li>Log file for web companies </li></ul></ul><ul><ul><li>New York Times case </li></ul></ul>
  7. 7. Map <ul><li>Convert input data to (key, value) pairs </li></ul><ul><li>map() functions run in parallel, </li></ul><ul><ul><li>creating different intermediate (key, value) values from different input data sets </li></ul></ul>
  8. 8. Reduce <ul><li>reduce() combines those intermediate values into one or more final values for that same key </li></ul><ul><li>reduce() functions also run in parallel, </li></ul><ul><ul><li>each working on a different output key </li></ul></ul><ul><li>Bottleneck: </li></ul><ul><ul><li>reduce phase can’t start until map phase is completely finished. </li></ul></ul>
  9. 9. Example: Sort URLs in the largest hit order <ul><li>Map() ‏ </li></ul><ul><ul><li>Input <logFilename, file text> </li></ul></ul><ul><ul><li>Parses file and emits <url, hit counts> pairs </li></ul></ul><ul><ul><ul><li>eg. <, 1> </li></ul></ul></ul><ul><li>Reduce() ‏ </li></ul><ul><ul><li>Sums all values for the same key and emits <url, TotalCount> </li></ul></ul><ul><ul><ul><li>eg. <, (3 5 2 7)> => <, 17> </li></ul></ul></ul>
  10. 10. Legacy Example <ul><li>In late 2007, the New York Times wanted to make available over the web its entire archive of articles, </li></ul><ul><ul><li>11 million in all, dating back to 1851. </li></ul></ul><ul><ul><li>four-terabyte pile of images in TIFF format. </li></ul></ul><ul><ul><li>needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. </li></ul></ul><ul><ul><ul><li>not a particularly complicated but large computing chore, </li></ul></ul></ul><ul><ul><ul><ul><li>requiring a whole lot of computer processing time. </li></ul></ul></ul></ul><ul><ul><li>a software programmer at the Times, Derek Gottfrid, </li></ul></ul><ul><ul><ul><li>playing around with Amazon Web Services, Elastic Compute Cloud (EC2), </li></ul></ul></ul><ul><ul><ul><ul><li>uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site. </li></ul></ul></ul></ul><ul><ul><li>The total cost for the computing job? $240 </li></ul></ul><ul><ul><ul><li>10 cents per computer-hour times 100 computers times 24 hours </li></ul></ul></ul>
  11. 11. noSQL DBs <ul><li>Key/Value, Column, Document </li></ul><ul><li>Column-Oriented DB </li></ul><ul><ul><li>HBase </li></ul></ul><ul><li>Fast Index on large amount of data </li></ul><ul><ul><li>Lookup by one more keys (key/value) </li></ul></ul><ul><li>NoSQL normally supports MapReduce </li></ul>
  12. 12. Data Store of noSQL DB <ul><li>Key/Value store </li></ul><ul><ul><li>(Key, Value) </li></ul></ul><ul><ul><li>Index, versioning, sorting, locking, transaction, replication </li></ul></ul><ul><ul><li>Memcached </li></ul></ul><ul><li>Document Store </li></ul><ul><ul><li>Search Engine/Repository </li></ul></ul><ul><ul><li>Multiple indexed to store indexed document </li></ul></ul><ul><ul><li>No locking, replication, Transaction </li></ul></ul><ul><ul><li>MongoDB, CouchDB, , ThruDB, SimpleDB </li></ul></ul><ul><li>Column-Oriented Stores (Extensible Record Stores) </li></ul><ul><ul><li>Extensible record horizontally and vertically partitioned across nodes </li></ul></ul><ul><ul><li>Rows and Columns are distributed over multiple nodes </li></ul></ul><ul><ul><li>BigTable, HBase , Cassandra, Hypertable </li></ul></ul>
  13. 13. Market Basket Analysis (MBA) <ul><li>Collect the list of pair of transaction items most frequently occurred together at a store(s) </li></ul><ul><li>Traditional Business Intelligence Analysis </li></ul><ul><ul><li>much better opportunity to make a profit by controlling the order of products and marketing </li></ul></ul><ul><ul><ul><li>control the stocks more intelligently </li></ul></ul></ul><ul><ul><ul><li>arrange items on shelves </li></ul></ul></ul><ul><ul><ul><li>promote items together etc. </li></ul></ul></ul>
  14. 14. Market Basket Analysis (MBA) <ul><li>Transactions in Store A: Input data </li></ul><ul><ul><li>Transaction 1: cracker, icecream, beer </li></ul></ul><ul><ul><li>Transaction 2: chicken, pizza, coke, bread </li></ul></ul><ul><ul><li>Transaction 3: baguette, soda, hering, cracker, beer </li></ul></ul><ul><ul><li>Transaction 4: bourbon, coke, turkey </li></ul></ul><ul><ul><li>Transaction 5: sardines, beer, chicken, coke </li></ul></ul><ul><ul><li>Transaction 6: apples, peppers, avocado, steak </li></ul></ul><ul><ul><li>Transaction 7: sardines, apples, peppers, avocado, steak </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>What is a pair of items that people frequently buy at Store A? </li></ul>
  15. 15. Map Algorithm <ul><li>1: Reads each transaction of input file and generates the data set of the items: </li></ul><ul><ul><li>(<V 1 >, <V 2 >, …, <V n >) where < V n >: (v n1 , v n2 ,.. v nm ) </li></ul></ul><ul><li>2: Loop For each item from v n1 to v nm of < V n > </li></ul><ul><ul><li>2.a: generate the data set <Yn>: (y n1 , y n2 ,.. y nl ); y nl : (u nx , u ny ) where u nx ≢ u ny </li></ul></ul><ul><ul><li>2.b: (u nx , u ny ) : sort data set </li></ul></ul><ul><ul><li>note: (key, value) = (y nl , number of occurrences); </li></ul></ul><ul><ul><li>2.c: increment the occurrence of y nl ; </li></ul></ul><ul><li>3. Data set is created as input of Reducer: </li></ul><ul><ul><li>(key, <value>) = (y nl , <number of occurrences>) </li></ul></ul>
  16. 16. Reduce Algorithm <ul><li>1: Take (y nl , <number of occurrences>) as input data from multiple Map nodes </li></ul><ul><li>2. Add the values for y nl to have (y nl , total number of occurrences) as output </li></ul>
  17. 17. Market Basket Analysis (Cont’d) <ul><li>Transactions in Store A </li></ul><ul><ul><li>Transaction 1: cracker, icecream, beer </li></ul></ul><ul><ul><li>Transaction 2: chicken, pizza, coke, bread </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Distribute Transaction data to Map nodes </li></ul><ul><li>Pair of Items restructured in each Map node </li></ul><ul><ul><li>Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)> </li></ul></ul><ul><ul><li>Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)> </li></ul></ul><ul><ul><li>… </li></ul></ul>
  18. 18. Market Basket Analysis (Cont’d) <ul><li>Note: order of pairs should be sorted as it becomes a key </li></ul><ul><ul><li>For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream) </li></ul></ul><ul><li>Pair of Items sorted in MBA </li></ul><ul><ul><li>Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)> </li></ul></ul><ul><ul><li>Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)> </li></ul></ul><ul><ul><li>… </li></ul></ul>
  19. 19. Market Basket Analysis (Cont’d) <ul><li>Output of Map node </li></ul><ul><ul><li>Pair of Items in (key, value) structure in each Map node </li></ul></ul><ul><ul><li>(key, value): (pair of items, number of occurences) </li></ul></ul><ul><ul><li>((cracker, icecream), 1) </li></ul></ul><ul><ul><li>((beer, cracker), 1) </li></ul></ul><ul><ul><li>((beer, icecream),1) </li></ul></ul><ul><ul><li>(chicken, pizza), 1) </li></ul></ul><ul><ul><li>((chicken, coke), 1) </li></ul></ul><ul><ul><li>((chicken, bread) , 1) </li></ul></ul><ul><ul><li>((coke, pizza), 1) </li></ul></ul><ul><ul><li>((bread, pizza), 1) </li></ul></ul><ul><ul><li>((coke , bread), 1) </li></ul></ul><ul><ul><li>… </li></ul></ul>
  20. 20. Market Basket Analysis (Cont’d) <ul><li>Data Aggregation/Combine </li></ul><ul><ul><li>(key, <value>): (pair of items, list number of occurences) </li></ul></ul><ul><ul><li>((cracker, icecream), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>((beer, cracker), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>((beer, icecream), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>(chicken, pizza), <1, 1, …, 1>) </li></ul></ul><ul><ul><li>… </li></ul></ul>
  21. 21. Market Basket Analysis (Cont’d) <ul><li>Reduce nodes </li></ul><ul><ul><li>(key, value): (pair of items, total number of occurences) </li></ul></ul><ul><ul><li>((cracker, icecream), 421) </li></ul></ul><ul><ul><li>((beer, cracker), 341) </li></ul></ul><ul><ul><li>((beer, icecream), 231) </li></ul></ul><ul><ul><li>(chicken, pizza), 111) </li></ul></ul><ul><ul><li>… </li></ul></ul>
  22. 22. Map/Reduce for MBA … … Map 1 () Map 2 () Map m () Reduce 1 () Reduce l () Data Aggregation/Combine ((coke, pizza), <1, 1, …, 1>) ((ham, juice), <1, 1, …, 1>) ((coke, pizza), 3,421) ((ham, juice), 2,346) Input Trax Data Reduce 2 () ((coke, pizza), 1) ((bear, corn), 1) … ((ham, juice), 1) ((coke, pizza), 1) …
  23. 23. HBase Schema <ul><li>Input Data: (Key, [Column Family, Column Item]) </li></ul><ul><ul><li>(Transaction #, [Items, Items:List]) </li></ul></ul>… … chicken, pizza, coke, bread Trax 2 cracker, icecream, beer Trax 1 Items:List Items
  24. 24. HBase Schema <ul><li>Output Item Pairs after Map/Reduce computing </li></ul><ul><ul><li>(Key, [Column Family, Column Item]) </li></ul></ul><ul><ul><ul><li>(Item Pair, [Items, Items:Count]) </li></ul></ul></ul>… … 2,346 (ham, juice) 3,421 (coke, pizza) Items:Count Items
  25. 25. Experimental Result <ul><li>3 transaction files for the experiment: </li></ul><ul><ul><li>64MB (1.1M transactions), 128MB (2.2M transactions), 400 MB (6.7M transactions), 800MB (13M transactions) </li></ul></ul><ul><li>run on small instances of AWS EC2 </li></ul><ul><ul><li>each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor </li></ul></ul><ul><ul><li>1.7GB memory </li></ul></ul><ul><ul><li>350GB storage on 32 bits platform </li></ul></ul><ul><ul><li>Install and run both Hadoop/HBase by Apache Whirr </li></ul></ul><ul><ul><ul><li>Blog: </li></ul></ul></ul><ul><ul><ul><ul><li>“ Market Basket Analysis Example in Hadoop”, Jongwook Woo,, March 2011 </li></ul></ul></ul></ul><ul><li>The data are executed on 5, 10, and 15 nodes </li></ul>
  26. 26. Experimental Result on HDFS <ul><li>Execution time (msec) </li></ul>100000 150000 200000 250000 300000 350000 5 10 15 no of instances msec 64M 128M 400M 800M
  27. 27. Experimental Result on HDFS and HBase <ul><li>Execution time (msec) </li></ul>
  28. 28. Conclusion <ul><li>The Market Basket Analysis Algorithm on Map/Reduce and HBase is presented </li></ul><ul><ul><li>data mining analysis to find the most frequently occurred pair of products in baskets at a store. </li></ul></ul><ul><li>The associated items can be paired with Map/Reduce approach. </li></ul><ul><ul><li>Shows the possibility and performance using HBase for Market Basket Analysis Data </li></ul></ul><ul><ul><li>Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper </li></ul></ul><ul><ul><ul><li>Support </li></ul></ul></ul><ul><ul><ul><li>Confidence </li></ul></ul></ul><ul><li>Parallelism is negatively affected by </li></ul><ul><ul><li>a bottle-neck for distributing, aggregating, and reducing the data set among nodes </li></ul></ul>