Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Mahout 於電子商務的應用

8,186 views

Published on

Apache Mahout是一個架構在MapReduce之上的演算法函式庫,內建許多經典的演算法,借助MapReduce的平行處理架構讓巨量資料的分析更容易。電子商務的發展越來越趨向個人化,帶來對於使用者行為分析的需求、精準推薦、精準廣告投放、個人化商品推薦,個人化內容推薦等個性化服務不斷推出,使得Apache Mahout在Hadoop Ecosystem中的角色日益重要。MapReduce的平行運算能力與Machine Learning的結合,協助電商業者從巨量的網站日誌中,提取出有價值的使用者行為數據,Etu將在這個session中,介紹Mahout內建的商品推薦演算法原理,以及如何step by step打造一個end-to-end的商品推薦系統。

Published in: Technology, Education

Apache Mahout 於電子商務的應用

  1. 1. Apache Mahout 於電子商務的應用 James Chen, Etu Solution Hadoop in TW 2013 Sep 28, 2013
  2. 2. 2 台灣Hadoop 2013現狀問卷調查 填寫問卷就有機會抽電影票兩張 2013/10/7 截止
  3. 3. 3 • Apache Mahout Introduction • Machine Learning Use Cases • Building a recommendation system • Collaboration Filtering • System Architecture • Performance • Future Roadmap Agenda
  4. 4. 4 Apache Mahout • ASF project to create scalable machine learning libraries – http://mahout.apache.org • Why Mahout? – Many open source machine learning libraries either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Lack the Apache License
  5. 5. 5 Algorithms in Mahout Regression Recommenders ClusteringClassification Freq. Pattern Mining Vector Similarity Non-MR Algorithms Examples See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Dimension Reduction Evolution
  6. 6. 6 Algorithms in Mahout • Classification – Logistic Regression – Bayesian – Support Vector Machines – Perceptron and Winnow – Neural Network – Random Forests – Restricted Boltzmann Machines – Online Passive Aggressive – Boosting – Hidden Markov Models • Clustering – Canopy Clustering – K-Means – Fuzzy K-Means – Expectation Maximization – Mean Shift – Hierarchical Clustering – Dirichlet Process Clustering – Latent Dirichlet Allocation – Spectral Clustering – Minhash Clustering – Top Down Clustering
  7. 7. 7 Algorithms in Mahout – Cont. • Pattern Mining – Parallel FP Growth • Regression – Locally Weighted Linear Regression • Dimension Reduction – SVD – Stochastic SVD with PCA – PCA – Independent Component Analysis – Gaussian Discriminative Analysis • Evolution Algorithms – Genetic Algorithms • Recommenders – Non-distributed recommenders (“Taste”) – Distributed Item-Based Collaboration Filtering – Collaboration Filtering using a parallel matrix factorization – Slope One
  8. 8. 8 Algorithms in Mahout – Cont. • Vector Similarity – RowSimiliarityJob (MR) – VectorDistanceJob (MR) • Other – Collocations • Non-MapReduce algorithms See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
  9. 9. 9 Mahout Focus on Scalability • Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm – Some algorithms won‟t scale to massive machine clusters – Others fit logically on a Map Reduce framework like Apache Hadoop – Still others will need alternative distributed programming models – Be pragmatic • Most Mahout implementations are Map Reduce enabled • (Always a) Work in Progress
  10. 10. 10 Prepare Data from Raw content • Lucene integration – bin/mahout lucenevector … • Document Vectorizer – bin/mahout seqdirectory … – bin/mahout seq2sparse … • Programmatically – See the Utils module in Mahout • Database (JDBC) • File System (HDFS)
  11. 11. 11 Machine Learning • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” – Intro. To Machine Learning by E. Alpaydin • Subset of Artificial Intelligence • Lots of related fields: – Information Retrieval – Stats – Biology – Linear algebra – Many more
  12. 12. 12 Use Case : Recommendation
  13. 13. 13 Use Cases : Classification
  14. 14. 14 User Cases : Clustering
  15. 15. 15 More use cases • Recommend products/books/friends … • Classify content into predefined groups • Find similar content based on object properties • Find associations/patterns in actions/behaviors • Identify key topics in large collections of text • Detect anomalies in machine output • Ranking search results (PageRank) • Others
  16. 16. 16 Building recommendation system • Help users find items they might like based on historical preferences
  17. 17. 17 Approach • Collect User Preferences -> User vs Item Matrix • Find Similar Users or Items (Neighborhood-based approach) • Works by finding similarly rated items in the user- item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient) • Estimates a user's preference towards an item by looking at his/her preferences towards similar items
  18. 18. 18 Collaborative Filtering – User Based Find User Similarity 1. 如何預測用戶1對於商品4的 喜好程度? 2. 找尋n個和用戶1相似的用戶 且購買過商品4(基於購買 記錄的評價)為用戶n 3. 根據用戶n對商品4的評價, 以相似度為權重回填結果 4. 針對所有用戶組合,重覆 1~3,直到所有空格都被填 滿 Items User 1 ? User n 回填結果
  19. 19. 19 Find Item Similarity (Amazon) 1. 如何預測用戶1對於商品4的喜 好程度? 2. 從用戶1歷史記錄中,計算商 品n和商品4的相似度(以其他 用戶的歷史記錄) 3. 將用戶1對於商品n的評價,以 產品相似度為權重回填 4. 針對所有商品組合,重覆1~3 直到所有空格都被填滿 Items Users ? 回填結果 Collaborative Filtering – Item Based
  20. 20. 20 Test Drive of Mahout Recommender • Group Len Dataset: http://www.grouplens.org/node/12 • 1,000,209 anonymous ratings of 3,900 movies made by 6,040 MovieLens users • movies.dat (movie ids with title and category) • ratings.dat (ratings of movies) • users.dat (user information)
  21. 21. 21 Ratings File • Each line of ratings file has the format UserID::MovieID::Rating::Timestamp • Mahout requires following csv format UserID,ItemID,Value • tr –s „:‟ „,‟ < ratings.dat | cut –f1-3 –d, > rating.csv
  22. 22. 22 Run Recommendation Job • $ mahout recommenditembased –i [input-hdfs-path] -o [output-hdfs-path] --usersFile [File listing users] --tempDir
  23. 23. 23 Recommendation Result • Recommendation Result will look like UserID [ItemID:Weight, ItemID:Weight,…] • Each line represents a UserID with associated recommended ItemID
  24. 24. 24 Collect User Behavior Events Implicit (Easy to collect) Explicit (Hard to collect) View Rating (0~5) Shopping Cart (0 or 1) Voting (0 or 1) Order or Buy (0 or 1) Forward or Share (0 or 1) Duration Time (Noisy) Add favorite (0 or 1) Tag (text analysis) Comments (text analysis)
  25. 25. 25 Process Event into Preference • Group by different event type, and calculate similarity based on event types. Ex. Also View, Also Buy.. • Weighting: – Explicit Event > Implicit Event – Order, Cart > View • Noise Reduction • Normalization
  26. 26. 26 Similarity (Vector Similarity) • Euclidean Distance • Pearson Correlation Coefficient (-1 ~ +1) • Cosine Similarity • Tanimoto Coefficient
  27. 27. 27 Complementary • Sometimes CF cannot generate enough recommendation to all users • Cold start problem • New user and new item • Some statistical approaches can be complementary • Ranking is very easy to implement by MR. Word Count ?
  28. 28. 28 Etu Recommender Application The Whole System 協同過濾分析 (Collaborative Filtering) 客戶 相似度 分析 轉化率分析 資 料 擷 取 產品 關聯性 分析 推 薦 清 單 推薦引擎 用戶個性化推薦 交易資料 Transaction Info • 歷史訂單資料 • 產品被購買紀錄 Web 互動資料 • 瀏覽 • 點擊 Click • 搜尋 Search • 購物車 Cart • 結帳 check-out • 評論 Rating Mobile 互動資料 • 下載 Download • 點擊 Click • 打卡 Check-in • 付費 Payment • 位置 Location Social Media (3rd party feed) Etu Appliance 瀏覽過本商品的顧客還瀏覽了 購買過此商品的顧客還買了 購物車商品的推薦 組合購買的商品 基於瀏覽,你可能會喜歡 瀏覽過本商品的顧客最終買了 Etu Recommender
  29. 29. 29 Data Process Flow Front End Java Script Event Colloector (Nginx) HDFS Log Parser HBase Core Engine Mahout Job User Based Item Based MR Job Ranking & Stats. Rec API Item Mgmt. API Dashboard & Mgmt Console request access log Preprocess & Dispatch Schedule & Flow Control Front End Backend Admin
  30. 30. 30 System Components • Nginx – Event Collector & Request Forwarder • Log Parser – Preprocess collected log and dispatch log to HDFS • HDFS – Fundamental storage of the system • Core Engine – Scheduling & Workflow Control – Job Driver • Management Console – Dashboard (PV,UV,Conv. Rate) – Scheduling, Log Viewer, System Configuration
  31. 31. 31 System Components – Cont. • Recommendation Jobs – Mahout jobs for CF – MR jobs for Ranking • HBase – Recommendation Result for query • Recommendation API – API wrapper for frontend to query result from HBase table – Handle business logic and policy here • Item Management API – API interface for frontend item management – Allow List, Exception List
  32. 32. 32 HBase Table Table Rowkey Column CATEGORY CategoryID column=f:id Category ID column=f:rank ranking by view column=f:rank_cart ranking by cart column=f:rank_order ranking by order column=f:rank_view ranking by view ERUID_USE R ErUid column=f:uid ERUID/UID mapping USER_ERUI D uid column=f:eruid UID/ERUID mapping SEARCH Keyword column=f:id search ID column=f:rec item list
  33. 33. 33 HBase Table – Cont. Table Rowkey Column STATS date(Ex:20 13-06-25) column=f:amount (全站成交金額) column=f:item (全站成交商品數) column=f:order (全站成交訂單數) column=f:pv (全站PV數) column=f:uv (全站UV數) column=f:erAmount (推薦成交金額) column=f:erItem (推薦成交商品數) column=f:erOrder (推薦成交訂單數) column=f:erPv (推薦版位 PV) column=f:erUv (推薦版位 UV)
  34. 34. 34 HBase Table – Item Table Table Rowkey Column ITEM PID column=f:avl_cat 此category是否處於可以推薦的狀態 column=f:avl_item 此item是否處於可以推薦的狀態 column=f:cat 此item所屬的類別 column=f:id item ID column=f:pry Priority 推薦優先權值 column=f:rec_view 用mahout算出來推薦view的商品 column=f:rec_cart 用mahout算出來推薦放進cart的商品 column=f:rec_order 用mahout算出來推薦購買的商品 column=f:rec_order_co_occurrence 經常與此item一起購 買的商品
  35. 35. 35 HBase Table – User Table Table Rowkey Column USER UID column=f:id 用戶ID column=f:rec_cart 用mahout算出來推薦放進cart的商品 column=f:rec_view 用mahout算出來推薦view的商品 column=f:rec_order 用mahout算出來推薦購買的商品 column=f:rec_view_last_item_views 推薦用戶最常被看的 商品
  36. 36. 36 Tracking Code Snippet <script id="etu-recommender" type="text/javascript"> var erHostname='${erHostname}' var _qevents = _qevents || []; _qevents.push({ ${paramName} : '${paramValue}', ... }); var erUrlPrefix=('https:' == document.location.protocol ? 'https://':'http://')+erHostname+'/'; (function() { var er = document.createElement('script'); er.type = 'text/javascript'; er.async = true; er.src = erUrlPrefix+'/er.js?'+(new Date().getTime()); var currentJs=document.getElementById('etu-recommender'); currentJs.parentNode.insertBefore(er,currentJs); })(); </script>
  37. 37. 37 Sample parameters for tracking a "view" action # Parameter Name Parameter Type Sample Value Required 1 cid String "www.etusolution. com" Yes 2 uid String "johnny_nien" Yes 3 act String "view" Yes 4 pid String "P00001" Yes 5 cat String Array [ "C", "C00001" ] No, but please take it as a yes. 6 avl * Boolean(0 or 1) 1 No Note: Explanation about "avl" will be available later
  38. 38. 38 Query Recommendations <script id="etu-recommender" type="text/javascript"> var erUrlPrefix='${erUrlPrefix}'; var _qquery = _qquery || []; _qquery.push({ ${paramName} : '${paramValue}', …… }); function etuRecQueryCallBack(queryParams,queryResult) { // Implement Your Logic Here!!! } var erUrlPrefix=('https:' == document.location.protocol ? 'https://':'http://')+erHostname+'/'; (function() { var er = document.createElement('script'); er.type = 'text/javascript'; er.async = true; er.src = erUrlPrefix+'/er.js?'+(new Date().getTime()); var currentJs=document.getElementById('etu-recommender'); currentJs.parentNode.insertBefore(er,currentJs); })(); </script>
  39. 39. 39 Sample parameter for Also Buy … (Item based) # Parameter Name Parameter Type Sample Value Required 1 cid String "www.etusolution. com" Yes 2 type String “item” Yes 3 act String ”order" Yes 4 pid String "P001" Yes 5 cat String "C001" No, but highly recommended
  40. 40. 40 Network Topology Router Nginx NN HM DN RS DN RS L2 L2 User LAN Private LAN isolated Master IP (22,443,8888) Web Server Web Server internet Public IP (22,443,8888) Recommender ClusterWeb Server Farm
  41. 41. 41 User Event Collection 嵌入JavaScript 擷取客戶線上行為 相關網頁 • 己登入用戶的首頁 • 搜尋頁 • 商品詳情頁 • 添加商品至購物車頁 • 下單付款頁 • 付款完成頁 Online Behavior Online / Offline Records 客戶行為 (Event) • 瀏覽、點擊 Click • 搜尋 Search • 放入購物車 Cart • 下單付款 Check-out • 評論 Rating 交易資料 • 歷史訂單 • 產品資料 Recommender Batch Import
  42. 42. 42 Generate Recommendation Result 用戶個性化推薦 瀏覽過本商品的顧客還瀏覽了 購買過此商品的顧客還買了 瀏覽過本商品的顧客最終買了 購物車商品的推薦 組合購買的商品 基於瀏覽,你可能會喜歡 資料來源:瀏覽記錄 + 購物車記錄 + 購買記錄 作用:強化推薦個性,提高使用者體驗度,提高訂單轉化率 資料來源:瀏覽記錄 作用:降低使用者的跳出率,提高訂單轉化率 資料來源:購買記錄 作用:強化交叉銷售效果,激發顧客再次下單的欲望 資料來源:瀏覽記錄 + 購買記錄 作用:降低用戶跳出率,幫助用戶提高決策率,提高訂單轉化率 資料來源:購物車記錄 + 購買記錄 作用:通過向上銷售原理,在幫助顧客滿足基本需求之後,引導其購買 更多感興趣的商品,有效提升銷售量,增加毛利率 資料來源:購物車記錄 + 購買記錄 作用:商品組合,有效提高商品交叉行銷,提高銷售量 資料來源:瀏覽記錄 作用:最大程度減少跳出率,提升顧客忠誠度,增加商品的複購率
  43. 43. 43 Recommender 如何應用在電子商務 Etu Recommender Application 協同過濾 分析 Collaborative Filtering 轉化率分析 資 料 擷 取 推 薦 清 單 推薦引擎 Etu Appliance Etu Recommender Product Pages Category Pages Search Results Cart Pages Email Confirmation EDM 歷史訂單 產品資料 即時訂單 瀏覽、點擊 放入購物車 結帳 線上評論 搜尋
  44. 44. 44 Recommender 的轉化率分析 Online Performance Tracking Item A Item A 透過點擊 推薦清單 透過主頁或其他所有頁面 PV1, UV1 PV2, UV2 推薦商品點擊率 = PV2 or UV2 PV1 or UV1 推薦商品轉化率 = 透過 推薦清單 U-Cart 2 UV2 or PV2 **  PV : page view  UV : unique visitor  U-Cart : added to cart by UV U-Cart 1 U-Cart 2 Algorithm Benchmark • Train vs Test (80-20) • A/B test
  45. 45. 45 Summary • Mahout is very useful if you would like to build a machine learning application on top of Hadoop • BUT, a recommendation system is not algorithm only • DON‟T re-invent the wheels. Leverage mahout and hadoop • Put most of your efforts on integration, performance tuning, and business logic
  46. 46. 46 Future Roadmaps • Offline to online integration -> Offline User Event Collection • 360 Degree CRM -> CRM Connector • Social Recommendation -> Social Connector • Retargeting -> Customer Behavior Data Warehouse • Go real-time!
  47. 47. 47 WE’ RE HIRING!
  48. 48. www.etusolution.com info@etusolution.com Taipei, Taiwan 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 Beijing, China Room B-26, Landgent Center, No. 24, East Third Ring Middle Rd., Beijing, China 100022 T: +86 10 8441 7988 F: +86 10 8441 7227 Contact
  49. 49. 49 Recommendation Alice Bob Peter 5 1 4 ? 2 5 4 3 2
  50. 50. 50 Algorithms Examples – Recommendation • Prediction: Estimate Bob's preference towards “The Matrix” 1. Look at all items that – a) are similar to “The Matrix“ – b) have been rated by Bob => “Alien“, “Inception“ 2. Estimate the unknown preference with a weighted sum
  51. 51. 51 Algorithms Examples – Recommendation • MapReduce phase 1 – Map – Make user the key (Alice, Matrix, 5) (Alice, Alien, 1) (Alice, Inception, 4) (Bob, Alien, 2) (Bob, Inception, 5) (Peter, Matrix, 4) (Peter, Alien, 3) (Peter, Inception, 2) Alice (Matrix, 5) Alice (Alien, 1) Alice (Inception, 4) Bob (Alien, 2) Bob (Inception, 5) Peter (Matrix, 4) Peter (Alien, 3) Peter (Inception, 2)
  52. 52. 52 Algorithms Examples – Recommendation • MapReduce phase 1 – Reduce – Create inverted index Alice (Matrix, 5) Alice (Alien, 1) Alice (Inception, 4) Bob (Alien, 2) Bob (Inception, 5) Peter (Matrix, 4) Peter (Alien, 3) Peter (Inception, 2) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Bob (Alien, 2) (Inception, 5) Peter(Matrix, 4) (Alien, 3) (Inception, 2)
  53. 53. 53 Algorithms Examples – Recommendation • MapReduce phase 2 – Map – Isolate all co-occurred ratings (all cases where a user rated both items) Matrix, Alien (5,1) Matrix, Alien (4,3) Alien, Inception (1,4) Alien, Inception (2,5) Alien, Inception (3,2) Matrix, Inception (4,2) Matrix, Inception (5,4) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Bob (Alien, 2) (Inception, 5) Peter(Matrix, 4) (Alien, 3) (Inception, 2)
  54. 54. 54 Algorithms Examples – Recommendation • MapReduce phase 2 – Reduce – Compute similarities Matrix, Alien (5,1) Matrix, Alien (4,3) Alien, Inception (1,4) Alien, Inception (2,5) Alien, Inception (3,2) Matrix, Inception (4,2) Matrix, Inception (5,4) Matrix, Alien (-0.47) Matrix, Inception (0.47) Alien, Inception(-0.63)
  55. 55. 55 Recommendation Alice Bob Peter 5 1 4 2 5 4 3 2 1.5

×