Clickstream Data Warehouse - Turning clicks into customers
Upcoming SlideShare
Loading in...5
×
 

Clickstream Data Warehouse - Turning clicks into customers

on

  • 4,722 views

As web is becoming a main channel for reaching customers and prospects, Clickstream data generated by websites has become another important enterprise data source, like other traditional business data ...

As web is becoming a main channel for reaching customers and prospects, Clickstream data generated by websites has become another important enterprise data source, like other traditional business data sources, like store transactions, CRM data, call center’s logs etc. As simple as it sounds for recording every click a customer made, Clickstream data actually offers a wide range of opportunities for modelling user behaviour, gaining valuable customer insights. This is definitely a data source which has been under utilized. However, benefits also come with a problem. Amazon records 5 Billion clicks a day and the whole US generates 400 Billion clicks, equivalent to 3.4 Petabytes a day. This immense volume has given enterprises and their IT professionals a big data problem before they can fully utilize this insight-rich data source.

This presentation will use big data technology to help solve this big data problem; the presenter will explain everything about Clickstream data, like benefits, challenges and the solution. The end-to-end solution will include proposed data architecture, ETL, and various machine learning algorithms. A real world successful example will also be presented for audience to better grasp the concept and its applications. Sample codes and demo will also be presented for audience to apply in their respective areas.

Statistics

Views

Total Views
4,722
Slideshare-icon Views on SlideShare
4,717
Embed Views
5

Actions

Likes
6
Downloads
217
Comments
0

2 Embeds 5

https://twitter.com 4
http://dschool.co 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Clickstream Data Warehouse - Turning clicks into customers Clickstream Data Warehouse - Turning clicks into customers Presentation Transcript

    • _____________________________________________ Clickstream Data Warehouse – turning clicks into customers Albert Hui 1
    • About Me • Associate Director with EPAM Canada • Over 12 years with Business Intelligence/Data Warehousing • Over 7 years with Java and web technologies • BIDW Architect, Big Data Evangelist • Conference Speaker at IOUG, TOUG Collaborate 2011, 2012 and 2013 • Technical editor on Oracle 12c Book. • Master in Engineering in the area of Artificial Intelligence – Fuzzy logic • MBA, University of Toronto • Toronto based • Twitter: @dataeconomist • Father of two twin boys 24/19/2013 2
    • Agenda Objective of this Session What is Clickstream data? How to collect Clickstream data? Use Cases Challenges – what are we trying to solve? Solutions Live Demo How to Start? Concluding Thoughts Q/A’s4/19/2013 3
    • Some Leaders Who Chose EPAM.4/19/2013 4
    • Objective of this session Introduction of Clickstream DataStart thinking how to fully Get startedUtilize Clickstream Data Individually and as Solutions and An organization - a Available Sample Demo Technologies 4/19/2013 5
    • Movie – A Beautiful Mind4/19/2013 6
    • Sales – how to sell a lobster www.bishopbigideas.com4/19/2013 7
    • Let’s have a quick quiz 8
    • Quick Quiz • In US, a 45year male, 3 children, Around 150-180K income, Post Graduate Education, if he wants to buy a car. Which brand?4/19/2013 9
    • Quick Quiz • In US, a 45year male, 3 children, 180K income, Graduate School Education, if he wants to buy a car. And he lives in Texas, then which brand?4/19/2013 10
    • Quick Quiz • In US, a 45year male, 3 children, Graduate School Education, if he wants to buy a car. And he lives in Texas, he is a single parent, <Unknown> income, but he is looking to travel to Florida ONLY. then Which brand?4/19/2013 11
    • Quick Quiz • But, would these preferences change (evolving behaviours) over time? How do we catch-up?4/19/2013 12
    • @MiamiParking lot 4/19/2013 13
    • What is Clickstream Data? What is Clickstream? 14
    • Clickstream Data • A Clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using another software application. As the user clicks What is Clickstream? anywhere in the webpage or application, the action is logged on a client or inside the web server, as well as possibly the web browser, router, proxy server or ad server. Clickstream analysis is useful for web activity analysis, software testing, market research, and for analyzing employee productivity. Source: wikipedia4/19/2013 15
    • Clickstream Data • Clickstream is not just weblogs. • They can be essentially every interaction that you transact with any electronic devices. What is Clickstream? – TV PVRs. – Smart phones. – Game consoles. – Sensors: security systems, highways. – E-Payment cards, Loyalty cards. – Geolocation – Maybe more: • Alarm clocks. • Printers • Parking etc.....4/19/2013 16
    • Clickstream Data • Clickstream Data is not new. – Published in January 2002, Clickstream Data Warehousing, by Mark Sweiger What is Clickstream? • There are essentially two types of Clickstream data – Individual Site’s Clickstream, - click path – Internet Clickstream Data • Server weblog accounts for 75% of daily data generation according to Gartner. • Facebook alone captures 1.5PB of weblog data daily. • Amazon captures 200TB of weblog data daily.4/19/2013 17
    • Sample of Clickstream Data • Web logs 204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)“ 204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“ 204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“ What is Clickstream?4/19/2013 18
    • Clickstream – Click-path Analytics • A click path is the sequence of links a site visitor follows. What is Clickstream?4/19/2013 19
    • Clickstream – Click-path Analytics • A click path is the sequence of links a site visitor follows. What is Clickstream?4/19/2013 20
    • Let’s take another quickquiz 21
    • Quiz 2: Which one is a more frustrated customer? Customer A What is Clickstream? Customer B4/19/2013 22
    • Quiz 2: Which one is a more frustrated customer? What is Clickstream? What about I tell you the customer is a Deal finder?4/19/2013 23
    • How Clickstream Data iscollected? 24
    • Clickstream – how to collect • Web Logs – Here no need to use JavaScript code for tracking purpose. The data is collected by the web server independently of a visitor’s browser. It captures all the requests made to your web server including pages, images and PDFs.4/19/2013 25
    • Clickstream – how to collect • Page Tagging – Google Analytics is implemented with "page tags". A page tag, in this case called the Google Analytics Tracking Code (GATC) is a snippet of JavaScript code that the website owner user adds to every page of the website. The GATC code runs in the client browser when the client browses the page (if JavaScript is enabled in the browser) and collects visitor data and sends it to a Google data collection server as part of a request for a web beacon.4/19/2013 26
    • What about someUse Cases for Clickstream? 27
    • Clickstream – Use Cases • Internet Traffic Analytics is another type of Clickstream data. E.g. – Google Analytics What is Clickstream? – Yandex – Kontagent4/19/2013 28
    • Clickstream – Use case – Google Analytics • Google Analytics measure how your site is performing – Competitor Analytics – Social Mobile analytics – Advertising Analytics What is Clickstream?4/19/2013 29
    • Clickstream – Use Case - Yandex • Yandex is another big one based in Russia What is Clickstream?4/19/2013 30
    • Clickstream – Use Cases – make money Advertising on the Internet 1. Banner Ads 2. Paid Search 3. Email Campaign Use cases4/19/2013 31
    • Clickstream – Use Cases – make money Personalized Advertising Minority Report-style shopping? The billboard that profiles you and then flashes up ads Use cases tailored to your tastes4/19/2013 32
    • Clickstream – Use Cases – medical field Medical Science – electronic clicksUse cases4/19/2013 33
    • Clickstream – Use Cases - games • Kontagent is the user analytics platform for developers, marketers, product managers, and strategic partners across the social and mobile web. The platform kSuite provides social data pattern visualization and analysis that delivers actionable insights via an on-demand services. Use cases • San Francisco/Toronto based. • It focuses on the gaming industry, - records every click of the gamers. • It tries to make gaming sites more sticky. • Raised $50M+ US in the last 3years.4/19/2013 34
    • Quiz #3 35
    • Clickstream – Quiz #3 1. What is the main focus on these analytics? What is Clickstream? 2. What are they missing?4/19/2013 36
    • YOU4/19/2013 37
    • SIMILARITY BETWEEN all of YOU4/19/2013 38
    • Collective Intelligence Crowd Sourcing4/19/2013 39
    • What are we trying to solve? 40
    • Clickstream - ChallengesChallenges • Yes, you are right! We have too much data.Yes, we have a lot of data 4/19/2013 41
    • Clickstream - Challenges • And user demographics data is hard to get, due to localized privacy laws.Challenges • Users’ sense of privacy. • User preferences change constantly, there are no one-size-fit-all rules.4/19/2013 42
    • Clickstream – What are we trying to solve?4/19/2013 Rules inside the data 43
    • Clickstream - ChallengesChallenges Gende Age Marita occupa No. Of Incom Region Race Own a Car Like Like Like ... click Buy r l tion Kids e house brand sport politics busine path status ss M 25-35 M Engine 3 80-90K Toront Caucas Y BMW - N Y ... ABACB Y er o ian CDE... M 25-35 S Chemis 1 50-60K NY Asian Y N/A N - N ... AABEB Y t FGHIG SJBA.. F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N t o ian A FGHIG FSBA... . F 50-60K M Doctor 6 - Minsk Caucas Y BMW N Y Y ... ABAEB Y ian FGHIG FSBA... .. F 35-45 D Resear 0 50-60K Toront Caucas Y N/A N Y N ... ABAEB N cher o ian FGHIG FSBA.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 4/19/2013 44
    • Clickstream - ChallengesChallenges Gende Age Marita occupa No. Of Incom Region Race Own a Car Like Like Like ... click Buy r l tion Kids e house brand sport politics busine path status ss M 25-35 M Engine 3 80-90K Toront Caucas Y BMW - N Y ... ABACB Y er o ian CDE... M 25-35 S Chemis 1 50-60K NY Asian Y N/A N - N ... AABEB Y t FGHIG SJBA.. F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N t o ian A FGHIG FSBA... . F 50-60K M Doctor 6 - Minsk Caucas Y BMW N Y Y ... ABAEB Y ian FGHIG FSBA... .. F 35-45 D Resear 0 50-60K Toront Caucas Y N/A N Y N ... ABAEB N cher o ian FGHIG FSBA.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 4/19/2013 45
    • Clickstream – What are we trying to solve? Prediction4/19/2013 46
    • Solutions here. 47
    • Clickstream – Solutions – Clickstream Data Warehouse Problems Solutions Solutions Too much Data Rules inside the data Data Vectorization Prediction4/19/2013 Architecture and Schema 48
    • Clickstream – Solutions – handling too much data • Top level Apache project • Open source • Software Framework - Java • Inspired by Google’s white papers on Map/Reduce (MR) Solutions Google File System (GFS) Big Table • Originally developed to support Apache Nutch • Designed for – Large scale data processing – For batch processing – For sophisticated analysis – To deal with structured and unstructured data4/19/2013 49
    • Clickstream – Solutions – Data Vectorization Clustering: Understanding data as vectorsSolutions YMahout Vector Implementation1. DenseVector2. RandomAccessSparseVector3. SequentialAccessSpareVector X=5, Y=3 (5, 3)Storing non-zero values in memoryVectors must implements JavaInterfacejava.io.serializablejava.mahout.VectorWritable X • The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])4/19/2013 50
    • Clickstream – Solutions – Data as n-dimensional vectorsSolutions Clustering: Understanding data as vectors• Imagine one dimension for each feature for user, product, geography, time etc.• Each dimension is also called a feature or label• Support Vector Machine (SVM) age occupation income4/19/2013 51
    • Clickstream – Solutions – Predictive Algorithms Then predict Four major What to happen steps Solutions Train/test the model Select/build a modelCollectionAnd modelThe Data4/19/2013 52
    • Clickstream – Solutions – Predictive Algorithms • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License Solutions • http://mahout.apache.org • Why Mahout? – Many Open Source ML libraries either: • Lack Community“Hindi” word stands for • Lack Documentation and ExamplesElephant Driver • Lack Scalability • Lack the Apache License • Or are research-oriented 4/19/2013 53
    • Clickstream – Solutions – Algorithms Solutions Algorithms and Applications Freq. Pattern Classification Clustering Recommenders Mining Math Utilities Statistics Apache Hadoop Vectors/Matric Lucene/Solr Probability es/SVD See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms4/19/2013 54
    • Clickstream – Solutions – Algorithms – Mahout Command line launcher bin/mahout list (This shows the list of algorithms) Valid program names are: 1. canopy: : Canopy clustering 2. cleansvd: : Cleanup and verification of SVD output 3. clusterdump: : Dump cluster output to text Solutions 4. dirichlet: : Dirichlet Clustering 5. fkmeans: : Fuzzy K-means clustering 6. fpg: : Frequent Pattern Growth 7. itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering 8. kmeans: : K-means clustering 9. lda: : Latent Dirchlet Allocation 10. ldatopics: : LDA Print Topics 11. lucene.vector: : Generate Vectors from a Lucene index 12. matrixmult: : Take the product of two matrices 13. meanshift: : Mean Shift clustering 14. recommenditembased: : Compute recommendations using item-based collaborative filtering …..4/19/2013 55
    • Clickstream – Solutions – Algorithms – build a model • Learn a model from a manually trained dataset • Predict the class of an unseen object based on features • E.g. features of user profile, product, click path to predict users’ preferences. Solutions4/19/2013 56
    • Clickstream – Solutions – Algorithms – build a model • Learn a model from a manually trained dataset • Predict the class of an unseen object based on features • E.g. features of user profile, product, click path to predict users’ preferences. Solutions4/19/2013 57
    • Clickstream – Solutions – Clickstream Data Warehouse Traditional Clickstream Data Warehouse Schema Common Dimensions: 1. Customer 2. Product 3. Time Solutions 4. Geography 5. Page 6. Content (meta-data) 7. User Facts: 1. Sales 2. User Activities Design: Schema Design depends on the data we have and the measures we have4/19/2013 58
    • Clickstream – Solutions – Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse By Mark Sweiger4/19/2013 59
    • Clickstream – Solutions – Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse By Mark Sweiger4/19/2013 60
    • Clickstream – Solutions – Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse by Albert H4/19/2013 61
    • Clickstream – Solutions – technology stackSolutions Reports BI TOOL Application Web AppReporting RMDB, Oracle MySQL ZooKeeper Hosting Models ETL (INFA, ModelData Movement Talend)Data- APACHE HIVE, STATISTICALwarehouse Algorithms HBASE MAHOUT APACHE HADOOP Clickstream logs4/19/2013 62
    • What about a Casestudy - demo? 63
    • Clickstream – Case Demo • An Asia based Hotspot Wi-Fi provider, wireless routers throughout China/Hong Kong. • Revenue Model: Advertising – Advertisers place ads when users browse the Net. Demo • Data – Survey data: Users are required to fill a survey before logging in. – Click logs including Ad click-through • Data Size: – 12GB+ compressed a day. – 150M+ clicks and 2.4M click through a day. • Problem definition: click-through rate is too low4/19/2013 64
    • Clickstream – Case Demo Hadoop – running Cloudera CDH4 Demo4/19/2013 65
    • Clickstream – Case Demo • Meet the Clickstream logs Demo When the click is Router Location AD Site Clicked MAC Addr recorded4/19/2013 66
    • Clickstream – Case Demo • Meet the survey questions Demo Some Sample of Survey Questions4/19/2013 67
    • Clickstream – Case Demo • Meet the answer and survey results Demo Options For Survey Answers4/19/2013 68
    • Clickstream – Case Demo Demo • Vectorize the data for users who click weibo.com Data VectorsMAC Addr 4/19/2013 69
    • Clickstream – Case Demo Training data set DemoResultantVector“cosine valueDistance” 4/19/2013 70
    • Clickstream – Case Demo Test data Set DemoArea underThe curve is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. 4/19/2013 71
    • Clickstream – Case Demo Test Results > 0.5 is good 2 out Demo Of 6 macaddr q16 q17 q18 q19 q20 q21 q22 q23 q24 q25 AUC Value Actually chicked Are 00:22:5f:34:54:3e 00:1f:5b:b3:26:6d 116 117 166 125 135 136 146 144 157 162 169 0 172 0 177 0 183 0 193 197 0.76 0.65 Y N Predicted 00:1a:73:e8:56:c6 00:18:de:1f:fe:c0 117 0 122 0 137 0 152 0 159 0 169 0 172 0 177 0 190 0 195 193 0.65 0.61 N Y right 00:1e:65:51:34:80 0 0 137 141 157 0 0 0 0 210 0.59 N 00:17:c4:a9:16:6c 0 0 0 0 0 0 0 0 0 0 0.53 N 00:1f:3b:06:87:3d 118 131 0 0 0 0 0 0 0 201 0.41 Y 00:21:19:a4:8d:ea 0 0 134 151 157 170 172 177 184 211 0.32 N 00:1e:65:7d:2d:d2 0 0 0 0 0 0 0 0 0 0 0.29 N 00:16:44:c7:80:35 0 0 0 0 0 0 0 0 0 0 0.24 Y 00:16:44:d4:11:9a 0 0 0 0 0 0 0 0 0 0 0.22 Y 00:13:02:a4:33:9c 0 0 0 0 0 0 0 0 0 0 0.2 Y 00:21:19:9a:64:ad 0 0 0 0 0 0 0 0 0 0 0.18 N 00:1f:df:75:0a:8e 0 0 0 0 0 0 0 0 0 0 0.16 N 00:25:d3:50:37:92 118 127 0 0 0 0 0 0 0 0 0.13 Y 00:21:00:d6:98:2c 118 123 0 0 0 169 172 176 187 192 0.11 Y 00:17:c4:9b:2c:e2 0 0 0 0 0 0 0 0 0 0 0.11 N 00:0d:f0:6d:fc:47 0 0 0 0 0 0 0 0 0 0 0.11 Y 00:1e:65:3f:e1:6c 0 0 0 0 0 0 0 177 188 0 0.1 N 00:21:00:e3:a5:f1 0 0 0 0 0 0 0 0 0 0 0.08 N4/19/2013 72
    • Clickstream – Case Demo • Meet the ETL process with Talend BD V 5.2 Demo4/19/2013 73
    • Clickstream – Case Demo • Meet some sample reports Demo4/19/2013 74
    • Clickstream – Case Demo • Meet some sample reports Demo4/19/2013 75
    • Objective of this session Introduction of Clickstream DataStart thinking how to fully Get startedUtilize Clickstream Data Individually and as Solutions and An organization - a Available Sample Demo Technologies 4/19/2013 76
    • 4/19/2013 77
    • Thank you! Albert Hui, MBA, MASc., P.Eng, CSM EPAM Canada, Associate Director Email: albert_hui@epam.com Follow me at Twitter: @dataeconomist Please help fill an evaluation form www.ioug.org/eval Session # 3534/19/2013 78