Clickstream Data Warehouse - Turning clicks into customers

16,247 views
15,931 views

Published on

As web is becoming a main channel for reaching customers and prospects, Clickstream data generated by websites has become another important enterprise data source, like other traditional business data sources, like store transactions, CRM data, call center’s logs etc. As simple as it sounds for recording every click a customer made, Clickstream data actually offers a wide range of opportunities for modelling user behaviour, gaining valuable customer insights. This is definitely a data source which has been under utilized. However, benefits also come with a problem. Amazon records 5 Billion clicks a day and the whole US generates 400 Billion clicks, equivalent to 3.4 Petabytes a day. This immense volume has given enterprises and their IT professionals a big data problem before they can fully utilize this insight-rich data source.

This presentation will use big data technology to help solve this big data problem; the presenter will explain everything about Clickstream data, like benefits, challenges and the solution. The end-to-end solution will include proposed data architecture, ETL, and various machine learning algorithms. A real world successful example will also be presented for audience to better grasp the concept and its applications. Sample codes and demo will also be presented for audience to apply in their respective areas.

Published in: Technology

Clickstream Data Warehouse - Turning clicks into customers

  1. 1. _____________________________________________ Clickstream Data Warehouse – turning clicks into customers Albert Hui 1
  2. 2. About Me • Associate Director with EPAM Canada • Over 12 years with Business Intelligence/Data Warehousing • Over 7 years with Java and web technologies • BIDW Architect, Big Data Evangelist • Conference Speaker at IOUG, TOUG Collaborate 2011, 2012 and 2013 • Technical editor on Oracle 12c Book. • Master in Engineering in the area of Artificial Intelligence – Fuzzy logic • MBA, University of Toronto • Toronto based • Twitter: @dataeconomist • Father of two twin boys 24/19/2013 2
  3. 3. Agenda Objective of this Session What is Clickstream data? How to collect Clickstream data? Use Cases Challenges – what are we trying to solve? Solutions Live Demo How to Start? Concluding Thoughts Q/A’s4/19/2013 3
  4. 4. Some Leaders Who Chose EPAM.4/19/2013 4
  5. 5. Objective of this session Introduction of Clickstream DataStart thinking how to fully Get startedUtilize Clickstream Data Individually and as Solutions and An organization - a Available Sample Demo Technologies 4/19/2013 5
  6. 6. Movie – A Beautiful Mind4/19/2013 6
  7. 7. Sales – how to sell a lobster www.bishopbigideas.com4/19/2013 7
  8. 8. Let’s have a quick quiz 8
  9. 9. Quick Quiz • In US, a 45year male, 3 children, Around 150-180K income, Post Graduate Education, if he wants to buy a car. Which brand?4/19/2013 9
  10. 10. Quick Quiz • In US, a 45year male, 3 children, 180K income, Graduate School Education, if he wants to buy a car. And he lives in Texas, then which brand?4/19/2013 10
  11. 11. Quick Quiz • In US, a 45year male, 3 children, Graduate School Education, if he wants to buy a car. And he lives in Texas, he is a single parent, <Unknown> income, but he is looking to travel to Florida ONLY. then Which brand?4/19/2013 11
  12. 12. Quick Quiz • But, would these preferences change (evolving behaviours) over time? How do we catch-up?4/19/2013 12
  13. 13. @MiamiParking lot 4/19/2013 13
  14. 14. What is Clickstream Data? What is Clickstream? 14
  15. 15. Clickstream Data • A Clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using another software application. As the user clicks What is Clickstream? anywhere in the webpage or application, the action is logged on a client or inside the web server, as well as possibly the web browser, router, proxy server or ad server. Clickstream analysis is useful for web activity analysis, software testing, market research, and for analyzing employee productivity. Source: wikipedia4/19/2013 15
  16. 16. Clickstream Data • Clickstream is not just weblogs. • They can be essentially every interaction that you transact with any electronic devices. What is Clickstream? – TV PVRs. – Smart phones. – Game consoles. – Sensors: security systems, highways. – E-Payment cards, Loyalty cards. – Geolocation – Maybe more: • Alarm clocks. • Printers • Parking etc.....4/19/2013 16
  17. 17. Clickstream Data • Clickstream Data is not new. – Published in January 2002, Clickstream Data Warehousing, by Mark Sweiger What is Clickstream? • There are essentially two types of Clickstream data – Individual Site’s Clickstream, - click path – Internet Clickstream Data • Server weblog accounts for 75% of daily data generation according to Gartner. • Facebook alone captures 1.5PB of weblog data daily. • Amazon captures 200TB of weblog data daily.4/19/2013 17
  18. 18. Sample of Clickstream Data • Web logs 204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)“ 204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“ 204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“ What is Clickstream?4/19/2013 18
  19. 19. Clickstream – Click-path Analytics • A click path is the sequence of links a site visitor follows. What is Clickstream?4/19/2013 19
  20. 20. Clickstream – Click-path Analytics • A click path is the sequence of links a site visitor follows. What is Clickstream?4/19/2013 20
  21. 21. Let’s take another quickquiz 21
  22. 22. Quiz 2: Which one is a more frustrated customer? Customer A What is Clickstream? Customer B4/19/2013 22
  23. 23. Quiz 2: Which one is a more frustrated customer? What is Clickstream? What about I tell you the customer is a Deal finder?4/19/2013 23
  24. 24. How Clickstream Data iscollected? 24
  25. 25. Clickstream – how to collect • Web Logs – Here no need to use JavaScript code for tracking purpose. The data is collected by the web server independently of a visitor’s browser. It captures all the requests made to your web server including pages, images and PDFs.4/19/2013 25
  26. 26. Clickstream – how to collect • Page Tagging – Google Analytics is implemented with "page tags". A page tag, in this case called the Google Analytics Tracking Code (GATC) is a snippet of JavaScript code that the website owner user adds to every page of the website. The GATC code runs in the client browser when the client browses the page (if JavaScript is enabled in the browser) and collects visitor data and sends it to a Google data collection server as part of a request for a web beacon.4/19/2013 26
  27. 27. What about someUse Cases for Clickstream? 27
  28. 28. Clickstream – Use Cases • Internet Traffic Analytics is another type of Clickstream data. E.g. – Google Analytics What is Clickstream? – Yandex – Kontagent4/19/2013 28
  29. 29. Clickstream – Use case – Google Analytics • Google Analytics measure how your site is performing – Competitor Analytics – Social Mobile analytics – Advertising Analytics What is Clickstream?4/19/2013 29
  30. 30. Clickstream – Use Case - Yandex • Yandex is another big one based in Russia What is Clickstream?4/19/2013 30
  31. 31. Clickstream – Use Cases – make money Advertising on the Internet 1. Banner Ads 2. Paid Search 3. Email Campaign Use cases4/19/2013 31
  32. 32. Clickstream – Use Cases – make money Personalized Advertising Minority Report-style shopping? The billboard that profiles you and then flashes up ads Use cases tailored to your tastes4/19/2013 32
  33. 33. Clickstream – Use Cases – medical field Medical Science – electronic clicksUse cases4/19/2013 33
  34. 34. Clickstream – Use Cases - games • Kontagent is the user analytics platform for developers, marketers, product managers, and strategic partners across the social and mobile web. The platform kSuite provides social data pattern visualization and analysis that delivers actionable insights via an on-demand services. Use cases • San Francisco/Toronto based. • It focuses on the gaming industry, - records every click of the gamers. • It tries to make gaming sites more sticky. • Raised $50M+ US in the last 3years.4/19/2013 34
  35. 35. Quiz #3 35
  36. 36. Clickstream – Quiz #3 1. What is the main focus on these analytics? What is Clickstream? 2. What are they missing?4/19/2013 36
  37. 37. YOU4/19/2013 37
  38. 38. SIMILARITY BETWEEN all of YOU4/19/2013 38
  39. 39. Collective Intelligence Crowd Sourcing4/19/2013 39
  40. 40. What are we trying to solve? 40
  41. 41. Clickstream - ChallengesChallenges • Yes, you are right! We have too much data.Yes, we have a lot of data 4/19/2013 41
  42. 42. Clickstream - Challenges • And user demographics data is hard to get, due to localized privacy laws.Challenges • Users’ sense of privacy. • User preferences change constantly, there are no one-size-fit-all rules.4/19/2013 42
  43. 43. Clickstream – What are we trying to solve?4/19/2013 Rules inside the data 43
  44. 44. Clickstream - ChallengesChallenges Gende Age Marita occupa No. Of Incom Region Race Own a Car Like Like Like ... click Buy r l tion Kids e house brand sport politics busine path status ss M 25-35 M Engine 3 80-90K Toront Caucas Y BMW - N Y ... ABACB Y er o ian CDE... M 25-35 S Chemis 1 50-60K NY Asian Y N/A N - N ... AABEB Y t FGHIG SJBA.. F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N t o ian A FGHIG FSBA... . F 50-60K M Doctor 6 - Minsk Caucas Y BMW N Y Y ... ABAEB Y ian FGHIG FSBA... .. F 35-45 D Resear 0 50-60K Toront Caucas Y N/A N Y N ... ABAEB N cher o ian FGHIG FSBA.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 4/19/2013 44
  45. 45. Clickstream - ChallengesChallenges Gende Age Marita occupa No. Of Incom Region Race Own a Car Like Like Like ... click Buy r l tion Kids e house brand sport politics busine path status ss M 25-35 M Engine 3 80-90K Toront Caucas Y BMW - N Y ... ABACB Y er o ian CDE... M 25-35 S Chemis 1 50-60K NY Asian Y N/A N - N ... AABEB Y t FGHIG SJBA.. F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N t o ian A FGHIG FSBA... . F 50-60K M Doctor 6 - Minsk Caucas Y BMW N Y Y ... ABAEB Y ian FGHIG FSBA... .. F 35-45 D Resear 0 50-60K Toront Caucas Y N/A N Y N ... ABAEB N cher o ian FGHIG FSBA.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 4/19/2013 45
  46. 46. Clickstream – What are we trying to solve? Prediction4/19/2013 46
  47. 47. Solutions here. 47
  48. 48. Clickstream – Solutions – Clickstream Data Warehouse Problems Solutions Solutions Too much Data Rules inside the data Data Vectorization Prediction4/19/2013 Architecture and Schema 48
  49. 49. Clickstream – Solutions – handling too much data • Top level Apache project • Open source • Software Framework - Java • Inspired by Google’s white papers on Map/Reduce (MR) Solutions Google File System (GFS) Big Table • Originally developed to support Apache Nutch • Designed for – Large scale data processing – For batch processing – For sophisticated analysis – To deal with structured and unstructured data4/19/2013 49
  50. 50. Clickstream – Solutions – Data Vectorization Clustering: Understanding data as vectorsSolutions YMahout Vector Implementation1. DenseVector2. RandomAccessSparseVector3. SequentialAccessSpareVector X=5, Y=3 (5, 3)Storing non-zero values in memoryVectors must implements JavaInterfacejava.io.serializablejava.mahout.VectorWritable X • The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])4/19/2013 50
  51. 51. Clickstream – Solutions – Data as n-dimensional vectorsSolutions Clustering: Understanding data as vectors• Imagine one dimension for each feature for user, product, geography, time etc.• Each dimension is also called a feature or label• Support Vector Machine (SVM) age occupation income4/19/2013 51
  52. 52. Clickstream – Solutions – Predictive Algorithms Then predict Four major What to happen steps Solutions Train/test the model Select/build a modelCollectionAnd modelThe Data4/19/2013 52
  53. 53. Clickstream – Solutions – Predictive Algorithms • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License Solutions • http://mahout.apache.org • Why Mahout? – Many Open Source ML libraries either: • Lack Community“Hindi” word stands for • Lack Documentation and ExamplesElephant Driver • Lack Scalability • Lack the Apache License • Or are research-oriented 4/19/2013 53
  54. 54. Clickstream – Solutions – Algorithms Solutions Algorithms and Applications Freq. Pattern Classification Clustering Recommenders Mining Math Utilities Statistics Apache Hadoop Vectors/Matric Lucene/Solr Probability es/SVD See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms4/19/2013 54
  55. 55. Clickstream – Solutions – Algorithms – Mahout Command line launcher bin/mahout list (This shows the list of algorithms) Valid program names are: 1. canopy: : Canopy clustering 2. cleansvd: : Cleanup and verification of SVD output 3. clusterdump: : Dump cluster output to text Solutions 4. dirichlet: : Dirichlet Clustering 5. fkmeans: : Fuzzy K-means clustering 6. fpg: : Frequent Pattern Growth 7. itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering 8. kmeans: : K-means clustering 9. lda: : Latent Dirchlet Allocation 10. ldatopics: : LDA Print Topics 11. lucene.vector: : Generate Vectors from a Lucene index 12. matrixmult: : Take the product of two matrices 13. meanshift: : Mean Shift clustering 14. recommenditembased: : Compute recommendations using item-based collaborative filtering …..4/19/2013 55
  56. 56. Clickstream – Solutions – Algorithms – build a model • Learn a model from a manually trained dataset • Predict the class of an unseen object based on features • E.g. features of user profile, product, click path to predict users’ preferences. Solutions4/19/2013 56
  57. 57. Clickstream – Solutions – Algorithms – build a model • Learn a model from a manually trained dataset • Predict the class of an unseen object based on features • E.g. features of user profile, product, click path to predict users’ preferences. Solutions4/19/2013 57
  58. 58. Clickstream – Solutions – Clickstream Data Warehouse Traditional Clickstream Data Warehouse Schema Common Dimensions: 1. Customer 2. Product 3. Time Solutions 4. Geography 5. Page 6. Content (meta-data) 7. User Facts: 1. Sales 2. User Activities Design: Schema Design depends on the data we have and the measures we have4/19/2013 58
  59. 59. Clickstream – Solutions – Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse By Mark Sweiger4/19/2013 59
  60. 60. Clickstream – Solutions – Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse By Mark Sweiger4/19/2013 60
  61. 61. Clickstream – Solutions – Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse by Albert H4/19/2013 61
  62. 62. Clickstream – Solutions – technology stackSolutions Reports BI TOOL Application Web AppReporting RMDB, Oracle MySQL ZooKeeper Hosting Models ETL (INFA, ModelData Movement Talend)Data- APACHE HIVE, STATISTICALwarehouse Algorithms HBASE MAHOUT APACHE HADOOP Clickstream logs4/19/2013 62
  63. 63. What about a Casestudy - demo? 63
  64. 64. Clickstream – Case Demo • An Asia based Hotspot Wi-Fi provider, wireless routers throughout China/Hong Kong. • Revenue Model: Advertising – Advertisers place ads when users browse the Net. Demo • Data – Survey data: Users are required to fill a survey before logging in. – Click logs including Ad click-through • Data Size: – 12GB+ compressed a day. – 150M+ clicks and 2.4M click through a day. • Problem definition: click-through rate is too low4/19/2013 64
  65. 65. Clickstream – Case Demo Hadoop – running Cloudera CDH4 Demo4/19/2013 65
  66. 66. Clickstream – Case Demo • Meet the Clickstream logs Demo When the click is Router Location AD Site Clicked MAC Addr recorded4/19/2013 66
  67. 67. Clickstream – Case Demo • Meet the survey questions Demo Some Sample of Survey Questions4/19/2013 67
  68. 68. Clickstream – Case Demo • Meet the answer and survey results Demo Options For Survey Answers4/19/2013 68
  69. 69. Clickstream – Case Demo Demo • Vectorize the data for users who click weibo.com Data VectorsMAC Addr 4/19/2013 69
  70. 70. Clickstream – Case Demo Training data set DemoResultantVector“cosine valueDistance” 4/19/2013 70
  71. 71. Clickstream – Case Demo Test data Set DemoArea underThe curve is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. 4/19/2013 71
  72. 72. Clickstream – Case Demo Test Results > 0.5 is good 2 out Demo Of 6 macaddr q16 q17 q18 q19 q20 q21 q22 q23 q24 q25 AUC Value Actually chicked Are 00:22:5f:34:54:3e 00:1f:5b:b3:26:6d 116 117 166 125 135 136 146 144 157 162 169 0 172 0 177 0 183 0 193 197 0.76 0.65 Y N Predicted 00:1a:73:e8:56:c6 00:18:de:1f:fe:c0 117 0 122 0 137 0 152 0 159 0 169 0 172 0 177 0 190 0 195 193 0.65 0.61 N Y right 00:1e:65:51:34:80 0 0 137 141 157 0 0 0 0 210 0.59 N 00:17:c4:a9:16:6c 0 0 0 0 0 0 0 0 0 0 0.53 N 00:1f:3b:06:87:3d 118 131 0 0 0 0 0 0 0 201 0.41 Y 00:21:19:a4:8d:ea 0 0 134 151 157 170 172 177 184 211 0.32 N 00:1e:65:7d:2d:d2 0 0 0 0 0 0 0 0 0 0 0.29 N 00:16:44:c7:80:35 0 0 0 0 0 0 0 0 0 0 0.24 Y 00:16:44:d4:11:9a 0 0 0 0 0 0 0 0 0 0 0.22 Y 00:13:02:a4:33:9c 0 0 0 0 0 0 0 0 0 0 0.2 Y 00:21:19:9a:64:ad 0 0 0 0 0 0 0 0 0 0 0.18 N 00:1f:df:75:0a:8e 0 0 0 0 0 0 0 0 0 0 0.16 N 00:25:d3:50:37:92 118 127 0 0 0 0 0 0 0 0 0.13 Y 00:21:00:d6:98:2c 118 123 0 0 0 169 172 176 187 192 0.11 Y 00:17:c4:9b:2c:e2 0 0 0 0 0 0 0 0 0 0 0.11 N 00:0d:f0:6d:fc:47 0 0 0 0 0 0 0 0 0 0 0.11 Y 00:1e:65:3f:e1:6c 0 0 0 0 0 0 0 177 188 0 0.1 N 00:21:00:e3:a5:f1 0 0 0 0 0 0 0 0 0 0 0.08 N4/19/2013 72
  73. 73. Clickstream – Case Demo • Meet the ETL process with Talend BD V 5.2 Demo4/19/2013 73
  74. 74. Clickstream – Case Demo • Meet some sample reports Demo4/19/2013 74
  75. 75. Clickstream – Case Demo • Meet some sample reports Demo4/19/2013 75
  76. 76. Objective of this session Introduction of Clickstream DataStart thinking how to fully Get startedUtilize Clickstream Data Individually and as Solutions and An organization - a Available Sample Demo Technologies 4/19/2013 76
  77. 77. 4/19/2013 77
  78. 78. Thank you! Albert Hui, MBA, MASc., P.Eng, CSM EPAM Canada, Associate Director Email: albert_hui@epam.com Follow me at Twitter: @dataeconomist Please help fill an evaluation form www.ioug.org/eval Session # 3534/19/2013 78

×