_____________________________________________




  Clickstream Data
  Warehouse – turning
  clicks into customers

  Albert Hui




                                                1
About Me
            •   Associate Director with EPAM Canada
            •   Over 12 years with Business Intelligence/Data Warehousing
            •   Over 7 years with Java and web technologies
            •   BIDW Architect, Big Data Evangelist
            •   Conference Speaker at IOUG, TOUG Collaborate 2011, 2012 and 2013
            •   Technical editor on Oracle 12c Book.
            •   Master in Engineering in the area of Artificial Intelligence – Fuzzy logic
            •   MBA, University of Toronto
            •   Toronto based
            •   Twitter: @dataeconomist
            •   Father of two twin boys


                                                                                                 2
4/19/2013                                                                                    2
Agenda
            Objective of this Session
            What is Clickstream data?
            How to collect Clickstream data?
            Use Cases
            Challenges – what are we trying to solve?
            Solutions
            Live Demo
            How to Start?
            Concluding Thoughts
            Q/A’s
4/19/2013                                               3
Some Leaders Who Chose EPAM.




4/19/2013                      4
Objective of this session



                          Introduction of Clickstream Data




Start thinking how to fully                                  Get started
Utilize Clickstream Data                                     Individually and as
                                  Solutions and              An organization - a
                                  Available                  Sample Demo
                                  Technologies
  4/19/2013                                                                    5
Movie – A Beautiful Mind




4/19/2013                              6
Sales – how to sell a lobster




                                www.bishopbigideas.com




4/19/2013                                                7
Let’s have a quick quiz




                          8
Quick Quiz
            • In US, a 45year male, 3 children, Around 150-180K
              income, Post Graduate Education, if he wants to buy a
              car. Which brand?




4/19/2013                                                         9
Quick Quiz
            • In US, a 45year male, 3 children, 180K income,
              Graduate School Education, if he wants to buy a car.
              And he lives in Texas, then which brand?




4/19/2013                                                            10
Quick Quiz
            • In US, a 45year male, 3 children, Graduate School
              Education, if he wants to buy a car. And he lives in
              Texas, he is a single parent, <Unknown> income, but
              he is looking to travel to Florida ONLY. then Which
              brand?




4/19/2013                                                            11
Quick Quiz
            • But, would these preferences change (evolving
              behaviours) over time? How do we catch-up?




4/19/2013                                                     12
@Miami
Parking lot




  4/19/2013   13
What is Clickstream Data?




      What is Clickstream?

                             14
Clickstream Data
                         • A Clickstream is the recording of the parts of the screen
                           a computer user clicks on while web browsing or using
                           another software application. As the user clicks
 What is Clickstream?
                           anywhere in the webpage or application, the action is
                           logged on a client or inside the web server, as well as
                           possibly the web browser, router, proxy server or ad
                           server. Clickstream analysis is useful for web activity
                           analysis, software testing, market research, and for
                           analyzing employee productivity.

                         Source: wikipedia

4/19/2013                                                                         15
Clickstream Data
                         •   Clickstream is not just weblogs.
                         •   They can be essentially every interaction that you transact
                             with any electronic devices.
 What is Clickstream?
                              –   TV PVRs.
                              –   Smart phones.
                              –   Game consoles.
                              –   Sensors: security systems, highways.
                              –   E-Payment cards, Loyalty cards.
                              –   Geolocation
                              –   Maybe more:
                                   • Alarm clocks.
                                   • Printers
                                   • Parking etc.....

4/19/2013                                                                              16
Clickstream Data
                         • Clickstream Data is not new.
                             – Published in January 2002, Clickstream Data
                               Warehousing, by Mark Sweiger
 What is Clickstream?    • There are essentially two types of Clickstream data
                             – Individual Site’s Clickstream, - click path
                             – Internet Clickstream Data
                         • Server weblog accounts for 75% of daily data
                           generation according to Gartner.
                         • Facebook alone captures 1.5PB of weblog data daily.
                         • Amazon captures 200TB of weblog data daily.


4/19/2013                                                                        17
Sample of Clickstream Data
                         • Web logs
                         204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437
                               "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)“
                         204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/"
                               "Mozilla/4.5 [en] (Win98; I)“
                         204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/"
                               "Mozilla/4.5 [en] (Win98; I)“
 What is Clickstream?




4/19/2013                                                                                                                                           18
Clickstream – Click-path Analytics
                         • A click path is the sequence of links a site visitor
                           follows.
 What is Clickstream?




4/19/2013                                                                         19
Clickstream – Click-path Analytics
                         • A click path is the sequence of links a site visitor
                           follows.
 What is Clickstream?




4/19/2013                                                                         20
Let’s take another quick
quiz



                           21
Quiz 2: Which one is a more frustrated customer?


                               Customer A

 What is Clickstream?




                                                    Customer B




4/19/2013                                                          22
Quiz 2: Which one is a more frustrated customer?




 What is Clickstream?
                        What about I tell you the
                        customer is a Deal finder?



4/19/2013                                                      23
How Clickstream Data is
collected?




                          24
Clickstream – how to collect
            •   Web Logs
                 – Here no need to use JavaScript code for tracking purpose.
                   The data is collected by the web server independently of
                   a visitor’s browser. It captures all the requests made to
                   your web server including pages, images and PDFs.




4/19/2013                                                                      25
Clickstream – how to collect
             •   Page Tagging
                  – Google Analytics is implemented with "page tags". A
                     page tag, in this case called the Google Analytics Tracking
                     Code (GATC) is a snippet of JavaScript code that the
                     website owner user adds to every page of the website.
                     The GATC code runs in the client browser when the client
                     browses the page (if JavaScript is enabled in the browser)
                     and collects visitor data and sends it to a Google data
                     collection server as part of a request for a web beacon.




4/19/2013                                                                     26
What about some
Use Cases for Clickstream?




                        27
Clickstream – Use Cases
                         • Internet Traffic Analytics is another type of
                           Clickstream data. E.g.
                             – Google Analytics
 What is Clickstream?
                             – Yandex
                             – Kontagent




4/19/2013                                                                  28
Clickstream – Use case – Google Analytics
                         •   Google Analytics measure how your site is performing
                              – Competitor Analytics
                              – Social Mobile analytics
                              – Advertising Analytics
 What is Clickstream?




4/19/2013                                                                           29
Clickstream – Use Case - Yandex
                         • Yandex is another big one based in Russia


 What is Clickstream?




4/19/2013                                                              30
Clickstream – Use Cases – make money
              Advertising on the Internet
              1. Banner Ads
              2. Paid Search
              3. Email Campaign


  Use cases




4/19/2013                                        31
Clickstream – Use Cases – make money
             Personalized Advertising

              Minority Report-style
              shopping? The billboard
              that profiles you and
              then flashes up ads
 Use cases    tailored to your tastes




4/19/2013                                       32
Clickstream – Use Cases – medical field
            Medical Science – electronic clicks



Use cases




4/19/2013                                          33
Clickstream – Use Cases - games
             •   Kontagent is the user analytics platform for
                 developers, marketers, product managers, and
                 strategic partners across the social and mobile
                 web. The platform kSuite provides social data
                 pattern visualization and analysis that delivers
                 actionable insights via an on-demand services.
 Use cases   •   San Francisco/Toronto based.
             •   It focuses on the gaming industry, - records every
                 click of the gamers.
             •   It tries to make gaming sites more sticky.
             •   Raised $50M+ US in the last 3years.



4/19/2013                                                             34
Quiz #3




          35
Clickstream – Quiz #3

                          1. What is the main focus on these
                             analytics?
 What is Clickstream?




                          2. What are they missing?




4/19/2013                                                      36
YOU

4/19/2013         37
SIMILARITY
            BETWEEN all of
            YOU
4/19/2013                    38
Collective Intelligence

            Crowd Sourcing

4/19/2013                      39
What are we trying to solve?




                          40
Clickstream - Challenges
Challenges    •   Yes, you are right! We have too much data.




Yes, we have a lot of data
  4/19/2013                                                    41
Clickstream - Challenges
             • And user demographics
               data is hard to get, due to
               localized privacy laws.
Challenges
             • Users’ sense of privacy.
             • User preferences change
               constantly, there are no
               one-size-fit-all rules.


4/19/2013                                    42
Clickstream – What are we trying to solve?




4/19/2013
                      Rules inside the data              43
Clickstream - Challenges
Challenges
  Gende   Age      Marita   occupa   No. Of   Incom    Region   Race     Own a   Car     Like    Like       Like     ...   click     Buy
  r                l        tion     Kids     e                          house   brand   sport   politics   busine         path
                   status                                                                                   ss

  M       25-35    M        Engine   3        80-90K   Toront   Caucas   Y       BMW     -       N          Y        ...   ABACB     Y
                            er                         o        ian                                                        CDE...

  M       25-35    S        Chemis   1        50-60K   NY       Asian    Y       N/A     N       -          N        ...   AABEB     Y
                            t                                                                                              FGHIG
                                                                                                                           SJBA..

  F       35-45    D        Chemis   0        50-60K   Toront   Caucas   N       TOYOT   N       N          -        ...   ABAEB     N
                            t                          o        ian              A                                         FGHIG
                                                                                                                           FSBA...
                                                                                                                           .
  F       50-60K   M        Doctor   6        -        Minsk    Caucas   Y       BMW     N       Y          Y        ...   ABAEB     Y
                                                                ian                                                        FGHIG
                                                                                                                           FSBA...
                                                                                                                           ..

  F       35-45    D        Resear   0        50-60K   Toront   Caucas   Y       N/A     N       Y          N        ...   ABAEB     N
                            cher                       o        ian                                                        FGHIG
                                                                                                                           FSBA..
  ...     ...      ...      ...      ...      ...      ...      ...      ...     ...     ...     ...        ...      ...   ...       ...



  ...     ...      ...      ...      ...      ...      ...      ...      ...     ...     ...     ...        ...      ...   ...       ...

  4/19/2013                                                                                                                          44
Clickstream - Challenges
Challenges
  Gende   Age      Marita   occupa   No. Of   Incom    Region   Race     Own a   Car     Like    Like       Like     ...   click     Buy
  r                l        tion     Kids     e                          house   brand   sport   politics   busine         path
                   status                                                                                   ss

  M       25-35    M        Engine   3        80-90K   Toront   Caucas   Y       BMW     -       N          Y        ...   ABACB     Y
                            er                         o        ian                                                        CDE...

  M       25-35    S        Chemis   1        50-60K   NY       Asian    Y       N/A     N       -          N        ...   AABEB     Y
                            t                                                                                              FGHIG
                                                                                                                           SJBA..

  F       35-45    D        Chemis   0        50-60K   Toront   Caucas   N       TOYOT   N       N          -        ...   ABAEB     N
                            t                          o        ian              A                                         FGHIG
                                                                                                                           FSBA...
                                                                                                                           .
  F       50-60K   M        Doctor   6        -        Minsk    Caucas   Y       BMW     N       Y          Y        ...   ABAEB     Y
                                                                ian                                                        FGHIG
                                                                                                                           FSBA...
                                                                                                                           ..

  F       35-45    D        Resear   0        50-60K   Toront   Caucas   Y       N/A     N       Y          N        ...   ABAEB     N
                            cher                       o        ian                                                        FGHIG
                                                                                                                           FSBA..
  ...     ...      ...      ...      ...      ...      ...      ...      ...     ...     ...     ...        ...      ...   ...       ...



  ...     ...      ...      ...      ...      ...      ...      ...      ...     ...     ...     ...        ...      ...   ...       ...

  4/19/2013                                                                                                                          45
Clickstream – What are we trying to solve?




                 Prediction




4/19/2013                                                46
Solutions here.




                  47
Clickstream – Solutions – Clickstream Data
               Warehouse
             Problems                Solutions


 Solutions   Too much Data

             Rules inside the data               Data
                                                 Vectorization

             Prediction



4/19/2013
                Architecture and Schema                          48
Clickstream – Solutions – handling too much data
             •    Top level Apache project
             •    Open source
             •    Software Framework - Java
             •    Inspired by Google’s white papers on
                  Map/Reduce (MR)
 Solutions
                  Google File System (GFS)
                  Big Table
             •    Originally developed to support Apache Nutch
             •    Designed for
                   – Large scale data processing
                   – For batch processing
                   – For sophisticated analysis
                   – To deal with structured and unstructured data



4/19/2013                                                            49
Clickstream – Solutions – Data Vectorization
                                            Clustering: Understanding data as vectors
Solutions
                                    Y
Mahout Vector Implementation
1. DenseVector
2. RandomAccessSparseVector
3. SequentialAccessSpareVector                                          X=5, Y=3
                                                                        (5, 3)
Storing non-zero values in memory

Vectors must implements Java
Interface
java.io.serializable
java.mahout.VectorWritable
                                                                          X
                                        • The vector denoted by point (5, 3) is simply
                                          Array([5, 3]) or HashMap([0 => 5], [1 => 3])
4/19/2013                                                                               50
Clickstream – Solutions – Data as n-dimensional
             vectors

Solutions      Clustering: Understanding data as vectors

• Imagine one dimension for each feature for user,
  product, geography, time etc.
• Each dimension is also called a feature or label
• Support Vector Machine (SVM)                    age




                                                           occupation

                                             income
4/19/2013                                                      51
Clickstream – Solutions – Predictive Algorithms
                                                Then predict
                Four major                      What to happen
                steps
 Solutions
                                    Train/test the
                                    model


             Select/build a model



Collection
And model
The Data
4/19/2013                                                           52
Clickstream – Solutions – Predictive Algorithms

                      • An Apache Software Foundation project to create
                        scalable machine learning libraries under the Apache
                        Software License
    Solutions
                      • http://mahout.apache.org
                      • Why Mahout?
                         – Many Open Source ML libraries either:
                              • Lack Community
“Hindi” word stands for
                              • Lack Documentation and Examples
Elephant Driver               • Lack Scalability
                              • Lack the Apache License
                              • Or are research-oriented
  4/19/2013                                                                    53
Clickstream – Solutions – Algorithms

 Solutions

                                            Algorithms and Applications




               Freq. Pattern
                                        Classification              Clustering       Recommenders
               Mining
                                                                           Math
                  Utilities                                                                 Statistics
                                              Apache Hadoop                Vectors/Matric
                  Lucene/Solr                                                               Probability
                                                                           es/SVD
             See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms


4/19/2013                                                                                                 54
Clickstream – Solutions – Algorithms – Mahout

                Command line launcher
                     bin/mahout list (This shows the list of algorithms)
                     Valid program names are:
                1.    canopy: : Canopy clustering
                2.    cleansvd: : Cleanup and verification of SVD output
                3.    clusterdump: : Dump cluster output to text
 Solutions      4.    dirichlet: : Dirichlet Clustering
                5.    fkmeans: : Fuzzy K-means clustering
                6.    fpg: : Frequent Pattern Growth
                7.    itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
                8.    kmeans: : K-means clustering
                9.    lda: : Latent Dirchlet Allocation
                10.   ldatopics: : LDA Print Topics
                11.   lucene.vector: : Generate Vectors from a Lucene index
                12.   matrixmult: : Take the product of two matrices
                13.   meanshift: : Mean Shift clustering
                14.   recommenditembased: : Compute recommendations using item-based collaborative filtering
                 …..

4/19/2013                                                                                                           55
Clickstream – Solutions – Algorithms – build a model

                • Learn a model from a manually trained dataset
                • Predict the class of an unseen object based on features
                • E.g. features of user profile, product, click path to
                  predict users’ preferences.
 Solutions




4/19/2013                                                               56
Clickstream – Solutions – Algorithms – build a model

                • Learn a model from a manually trained dataset
                • Predict the class of an unseen object based on features
                • E.g. features of user profile, product, click path to
                  predict users’ preferences.
 Solutions




4/19/2013                                                               57
Clickstream – Solutions – Clickstream Data
             Warehouse
             Traditional Clickstream Data Warehouse Schema
             Common Dimensions:
                    1. Customer
                    2. Product
                    3. Time
 Solutions
                    4. Geography
                    5. Page
                    6. Content (meta-data)
                    7. User
             Facts:
                    1. Sales
                    2. User Activities
             Design:
             Schema Design depends on the data we have and the measures we have

4/19/2013                                                                         58
Clickstream – Solutions – Clickstream Data
             Warehouse




 Solutions




                Source: Clickstream Data warehouse By Mark Sweiger


4/19/2013                                                            59
Clickstream – Solutions – Clickstream Data
             Warehouse




 Solutions




                Source: Clickstream Data warehouse By Mark Sweiger



4/19/2013                                                            60
Clickstream – Solutions – Clickstream Data
             Warehouse




 Solutions




              Source: Clickstream Data warehouse by Albert H
4/19/2013                                                      61
Clickstream – Solutions – technology stack
Solutions


            Reports        BI TOOL                    Application        Web App

Reporting             RMDB, Oracle MySQL              ZooKeeper          Hosting Models
                           ETL (INFA,                   Model
Data Movement
                            Talend)
Data-           APACHE HIVE,            STATISTICAL
warehouse                                                   Algorithms
                   HBASE                 MAHOUT

                         APACHE HADOOP                          Clickstream logs


4/19/2013                                                                           62
What about a Case
study - demo?




                    63
Clickstream – Case Demo
            •   An Asia based Hotspot Wi-Fi provider, wireless routers throughout
                China/Hong Kong.
            •   Revenue Model: Advertising
                 – Advertisers place ads when users browse the Net.
  Demo
            •   Data
                 – Survey data: Users are required to fill a survey before logging in.
                 – Click logs including Ad click-through
            •   Data Size:
                 – 12GB+ compressed a day.
                 – 150M+ clicks and 2.4M click through a day.
            •   Problem definition: click-through rate is too low



4/19/2013                                                                                64
Clickstream – Case Demo




               Hadoop – running Cloudera CDH4




     Demo
4/19/2013                                       65
Clickstream – Case Demo
                   •     Meet the Clickstream logs
 Demo




     When the click is        Router Location        AD Site Clicked   MAC Addr
     recorded
4/19/2013                                                                  66
Clickstream – Case Demo
              •   Meet the survey questions
 Demo




            Some Sample of Survey Questions
4/19/2013                                     67
Clickstream – Case Demo
            •   Meet the answer and survey results
 Demo                                                Options
                                                     For Survey
                                                     Answers




4/19/2013                                                         68
Clickstream – Case Demo
     Demo
             •   Vectorize the data for users who click weibo.com




                                      Data Vectors
MAC Addr
 4/19/2013                                                          69
Clickstream – Case Demo
                               Training data set
       Demo




Resultant
Vector


“cosine value
Distance”




   4/19/2013                                       70
Clickstream – Case Demo
                                 Test data Set
      Demo




Area under
The curve       is a table with two rows and two columns that reports
                the number of false positives, false negatives, true positives,
                and true negatives.

  4/19/2013                                                               71
Clickstream – Case Demo
                              Test Results                                     > 0.5 is good                 2 out
    Demo                                                                                                     Of 6
    macaddr             q16    q17   q18   q19   q20   q21   q22   q23   q24    q25   AUC Value
                                                                                                  Actually
                                                                                                  chicked    Are
    00:22:5f:34:54:3e
    00:1f:5b:b3:26:6d
                        116
                        117
                               166
                               125
                                     135
                                     136
                                           146
                                           144
                                                 157
                                                 162
                                                       169
                                                        0
                                                             172
                                                              0
                                                                   177
                                                                    0
                                                                         183
                                                                          0
                                                                                193
                                                                                197
                                                                                        0.76
                                                                                        0.65
                                                                                                     Y
                                                                                                     N
                                                                                                             Predicted
    00:1a:73:e8:56:c6
    00:18:de:1f:fe:c0
                        117
                         0
                               122
                                0
                                     137
                                      0
                                           152
                                            0
                                                 159
                                                  0
                                                       169
                                                        0
                                                             172
                                                              0
                                                                   177
                                                                    0
                                                                         190
                                                                          0
                                                                                195
                                                                                193
                                                                                        0.65
                                                                                        0.61
                                                                                                     N
                                                                                                     Y
                                                                                                             right
    00:1e:65:51:34:80    0      0    137   141   157    0     0     0     0     210     0.59         N
    00:17:c4:a9:16:6c    0      0     0     0     0     0     0     0     0      0      0.53         N
    00:1f:3b:06:87:3d   118    131    0     0     0     0     0     0     0     201     0.41         Y
    00:21:19:a4:8d:ea    0      0    134   151   157   170   172   177   184    211     0.32         N
    00:1e:65:7d:2d:d2    0      0     0     0     0     0     0     0     0      0      0.29         N
    00:16:44:c7:80:35    0      0     0     0     0     0     0     0     0      0      0.24         Y
    00:16:44:d4:11:9a    0      0     0     0     0     0     0     0     0      0      0.22         Y
    00:13:02:a4:33:9c    0      0     0     0     0     0     0     0     0      0       0.2         Y
    00:21:19:9a:64:ad    0      0     0     0     0     0     0     0     0      0      0.18         N
    00:1f:df:75:0a:8e    0      0     0     0     0     0     0     0     0      0      0.16         N
    00:25:d3:50:37:92   118    127    0     0     0     0     0     0     0      0      0.13         Y
    00:21:00:d6:98:2c   118    123    0     0     0    169   172   176   187    192     0.11         Y
    00:17:c4:9b:2c:e2    0      0     0     0     0     0     0     0     0      0      0.11         N
    00:0d:f0:6d:fc:47    0      0     0     0     0     0     0     0     0      0      0.11         Y
    00:1e:65:3f:e1:6c    0      0     0     0     0     0     0    177   188     0       0.1         N
    00:21:00:e3:a5:f1    0      0     0     0     0     0     0     0     0      0      0.08         N


4/19/2013                                                                                                      72
Clickstream – Case Demo
            •   Meet the ETL process with Talend BD V 5.2



  Demo




4/19/2013                                                   73
Clickstream – Case Demo
            •   Meet some sample reports



  Demo




4/19/2013                                  74
Clickstream – Case Demo
            •   Meet some sample reports



  Demo




4/19/2013                                  75
Objective of this session



                          Introduction of Clickstream Data




Start thinking how to fully                                  Get started
Utilize Clickstream Data                                     Individually and as
                                  Solutions and              An organization - a
                                  Available                  Sample Demo
                                  Technologies
  4/19/2013                                                                   76
4/19/2013   77
Thank you!

            Albert Hui, MBA, MASc., P.Eng, CSM
            EPAM Canada, Associate Director
            Email: albert_hui@epam.com
            Follow me at Twitter: @dataeconomist


             Please help fill an evaluation form

             www.ioug.org/eval

             Session # 353


4/19/2013                                          78

Clickstream Data Warehouse - Turning clicks into customers

  • 1.
    _____________________________________________ ClickstreamData Warehouse – turning clicks into customers Albert Hui 1
  • 2.
    About Me • Associate Director with EPAM Canada • Over 12 years with Business Intelligence/Data Warehousing • Over 7 years with Java and web technologies • BIDW Architect, Big Data Evangelist • Conference Speaker at IOUG, TOUG Collaborate 2011, 2012 and 2013 • Technical editor on Oracle 12c Book. • Master in Engineering in the area of Artificial Intelligence – Fuzzy logic • MBA, University of Toronto • Toronto based • Twitter: @dataeconomist • Father of two twin boys 2 4/19/2013 2
  • 3.
    Agenda Objective of this Session What is Clickstream data? How to collect Clickstream data? Use Cases Challenges – what are we trying to solve? Solutions Live Demo How to Start? Concluding Thoughts Q/A’s 4/19/2013 3
  • 4.
    Some Leaders WhoChose EPAM. 4/19/2013 4
  • 5.
    Objective of thissession Introduction of Clickstream Data Start thinking how to fully Get started Utilize Clickstream Data Individually and as Solutions and An organization - a Available Sample Demo Technologies 4/19/2013 5
  • 6.
    Movie – ABeautiful Mind 4/19/2013 6
  • 7.
    Sales – howto sell a lobster www.bishopbigideas.com 4/19/2013 7
  • 8.
    Let’s have aquick quiz 8
  • 9.
    Quick Quiz • In US, a 45year male, 3 children, Around 150-180K income, Post Graduate Education, if he wants to buy a car. Which brand? 4/19/2013 9
  • 10.
    Quick Quiz • In US, a 45year male, 3 children, 180K income, Graduate School Education, if he wants to buy a car. And he lives in Texas, then which brand? 4/19/2013 10
  • 11.
    Quick Quiz • In US, a 45year male, 3 children, Graduate School Education, if he wants to buy a car. And he lives in Texas, he is a single parent, <Unknown> income, but he is looking to travel to Florida ONLY. then Which brand? 4/19/2013 11
  • 12.
    Quick Quiz • But, would these preferences change (evolving behaviours) over time? How do we catch-up? 4/19/2013 12
  • 13.
    @Miami Parking lot 4/19/2013 13
  • 14.
    What is ClickstreamData? What is Clickstream? 14
  • 15.
    Clickstream Data • A Clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using another software application. As the user clicks What is Clickstream? anywhere in the webpage or application, the action is logged on a client or inside the web server, as well as possibly the web browser, router, proxy server or ad server. Clickstream analysis is useful for web activity analysis, software testing, market research, and for analyzing employee productivity. Source: wikipedia 4/19/2013 15
  • 16.
    Clickstream Data • Clickstream is not just weblogs. • They can be essentially every interaction that you transact with any electronic devices. What is Clickstream? – TV PVRs. – Smart phones. – Game consoles. – Sensors: security systems, highways. – E-Payment cards, Loyalty cards. – Geolocation – Maybe more: • Alarm clocks. • Printers • Parking etc..... 4/19/2013 16
  • 17.
    Clickstream Data • Clickstream Data is not new. – Published in January 2002, Clickstream Data Warehousing, by Mark Sweiger What is Clickstream? • There are essentially two types of Clickstream data – Individual Site’s Clickstream, - click path – Internet Clickstream Data • Server weblog accounts for 75% of daily data generation according to Gartner. • Facebook alone captures 1.5PB of weblog data daily. • Amazon captures 200TB of weblog data daily. 4/19/2013 17
  • 18.
    Sample of ClickstreamData • Web logs 204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)“ 204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“ 204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“ What is Clickstream? 4/19/2013 18
  • 19.
    Clickstream – Click-pathAnalytics • A click path is the sequence of links a site visitor follows. What is Clickstream? 4/19/2013 19
  • 20.
    Clickstream – Click-pathAnalytics • A click path is the sequence of links a site visitor follows. What is Clickstream? 4/19/2013 20
  • 21.
    Let’s take anotherquick quiz 21
  • 22.
    Quiz 2: Whichone is a more frustrated customer? Customer A What is Clickstream? Customer B 4/19/2013 22
  • 23.
    Quiz 2: Whichone is a more frustrated customer? What is Clickstream? What about I tell you the customer is a Deal finder? 4/19/2013 23
  • 24.
    How Clickstream Datais collected? 24
  • 25.
    Clickstream – howto collect • Web Logs – Here no need to use JavaScript code for tracking purpose. The data is collected by the web server independently of a visitor’s browser. It captures all the requests made to your web server including pages, images and PDFs. 4/19/2013 25
  • 26.
    Clickstream – howto collect • Page Tagging – Google Analytics is implemented with "page tags". A page tag, in this case called the Google Analytics Tracking Code (GATC) is a snippet of JavaScript code that the website owner user adds to every page of the website. The GATC code runs in the client browser when the client browses the page (if JavaScript is enabled in the browser) and collects visitor data and sends it to a Google data collection server as part of a request for a web beacon. 4/19/2013 26
  • 27.
    What about some UseCases for Clickstream? 27
  • 28.
    Clickstream – UseCases • Internet Traffic Analytics is another type of Clickstream data. E.g. – Google Analytics What is Clickstream? – Yandex – Kontagent 4/19/2013 28
  • 29.
    Clickstream – Usecase – Google Analytics • Google Analytics measure how your site is performing – Competitor Analytics – Social Mobile analytics – Advertising Analytics What is Clickstream? 4/19/2013 29
  • 30.
    Clickstream – UseCase - Yandex • Yandex is another big one based in Russia What is Clickstream? 4/19/2013 30
  • 31.
    Clickstream – UseCases – make money Advertising on the Internet 1. Banner Ads 2. Paid Search 3. Email Campaign Use cases 4/19/2013 31
  • 32.
    Clickstream – UseCases – make money Personalized Advertising Minority Report-style shopping? The billboard that profiles you and then flashes up ads Use cases tailored to your tastes 4/19/2013 32
  • 33.
    Clickstream – UseCases – medical field Medical Science – electronic clicks Use cases 4/19/2013 33
  • 34.
    Clickstream – UseCases - games • Kontagent is the user analytics platform for developers, marketers, product managers, and strategic partners across the social and mobile web. The platform kSuite provides social data pattern visualization and analysis that delivers actionable insights via an on-demand services. Use cases • San Francisco/Toronto based. • It focuses on the gaming industry, - records every click of the gamers. • It tries to make gaming sites more sticky. • Raised $50M+ US in the last 3years. 4/19/2013 34
  • 35.
  • 36.
    Clickstream – Quiz#3 1. What is the main focus on these analytics? What is Clickstream? 2. What are they missing? 4/19/2013 36
  • 37.
  • 38.
    SIMILARITY BETWEEN all of YOU 4/19/2013 38
  • 39.
    Collective Intelligence Crowd Sourcing 4/19/2013 39
  • 40.
    What are wetrying to solve? 40
  • 41.
    Clickstream - Challenges Challenges • Yes, you are right! We have too much data. Yes, we have a lot of data 4/19/2013 41
  • 42.
    Clickstream - Challenges • And user demographics data is hard to get, due to localized privacy laws. Challenges • Users’ sense of privacy. • User preferences change constantly, there are no one-size-fit-all rules. 4/19/2013 42
  • 43.
    Clickstream – Whatare we trying to solve? 4/19/2013 Rules inside the data 43
  • 44.
    Clickstream - Challenges Challenges Gende Age Marita occupa No. Of Incom Region Race Own a Car Like Like Like ... click Buy r l tion Kids e house brand sport politics busine path status ss M 25-35 M Engine 3 80-90K Toront Caucas Y BMW - N Y ... ABACB Y er o ian CDE... M 25-35 S Chemis 1 50-60K NY Asian Y N/A N - N ... AABEB Y t FGHIG SJBA.. F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N t o ian A FGHIG FSBA... . F 50-60K M Doctor 6 - Minsk Caucas Y BMW N Y Y ... ABAEB Y ian FGHIG FSBA... .. F 35-45 D Resear 0 50-60K Toront Caucas Y N/A N Y N ... ABAEB N cher o ian FGHIG FSBA.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 4/19/2013 44
  • 45.
    Clickstream - Challenges Challenges Gende Age Marita occupa No. Of Incom Region Race Own a Car Like Like Like ... click Buy r l tion Kids e house brand sport politics busine path status ss M 25-35 M Engine 3 80-90K Toront Caucas Y BMW - N Y ... ABACB Y er o ian CDE... M 25-35 S Chemis 1 50-60K NY Asian Y N/A N - N ... AABEB Y t FGHIG SJBA.. F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N t o ian A FGHIG FSBA... . F 50-60K M Doctor 6 - Minsk Caucas Y BMW N Y Y ... ABAEB Y ian FGHIG FSBA... .. F 35-45 D Resear 0 50-60K Toront Caucas Y N/A N Y N ... ABAEB N cher o ian FGHIG FSBA.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 4/19/2013 45
  • 46.
    Clickstream – Whatare we trying to solve? Prediction 4/19/2013 46
  • 47.
  • 48.
    Clickstream – Solutions– Clickstream Data Warehouse Problems Solutions Solutions Too much Data Rules inside the data Data Vectorization Prediction 4/19/2013 Architecture and Schema 48
  • 49.
    Clickstream – Solutions– handling too much data • Top level Apache project • Open source • Software Framework - Java • Inspired by Google’s white papers on Map/Reduce (MR) Solutions Google File System (GFS) Big Table • Originally developed to support Apache Nutch • Designed for – Large scale data processing – For batch processing – For sophisticated analysis – To deal with structured and unstructured data 4/19/2013 49
  • 50.
    Clickstream – Solutions– Data Vectorization Clustering: Understanding data as vectors Solutions Y Mahout Vector Implementation 1. DenseVector 2. RandomAccessSparseVector 3. SequentialAccessSpareVector X=5, Y=3 (5, 3) Storing non-zero values in memory Vectors must implements Java Interface java.io.serializable java.mahout.VectorWritable X • The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3]) 4/19/2013 50
  • 51.
    Clickstream – Solutions– Data as n-dimensional vectors Solutions Clustering: Understanding data as vectors • Imagine one dimension for each feature for user, product, geography, time etc. • Each dimension is also called a feature or label • Support Vector Machine (SVM) age occupation income 4/19/2013 51
  • 52.
    Clickstream – Solutions– Predictive Algorithms Then predict Four major What to happen steps Solutions Train/test the model Select/build a model Collection And model The Data 4/19/2013 52
  • 53.
    Clickstream – Solutions– Predictive Algorithms • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License Solutions • http://mahout.apache.org • Why Mahout? – Many Open Source ML libraries either: • Lack Community “Hindi” word stands for • Lack Documentation and Examples Elephant Driver • Lack Scalability • Lack the Apache License • Or are research-oriented 4/19/2013 53
  • 54.
    Clickstream – Solutions– Algorithms Solutions Algorithms and Applications Freq. Pattern Classification Clustering Recommenders Mining Math Utilities Statistics Apache Hadoop Vectors/Matric Lucene/Solr Probability es/SVD See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms 4/19/2013 54
  • 55.
    Clickstream – Solutions– Algorithms – Mahout Command line launcher bin/mahout list (This shows the list of algorithms) Valid program names are: 1. canopy: : Canopy clustering 2. cleansvd: : Cleanup and verification of SVD output 3. clusterdump: : Dump cluster output to text Solutions 4. dirichlet: : Dirichlet Clustering 5. fkmeans: : Fuzzy K-means clustering 6. fpg: : Frequent Pattern Growth 7. itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering 8. kmeans: : K-means clustering 9. lda: : Latent Dirchlet Allocation 10. ldatopics: : LDA Print Topics 11. lucene.vector: : Generate Vectors from a Lucene index 12. matrixmult: : Take the product of two matrices 13. meanshift: : Mean Shift clustering 14. recommenditembased: : Compute recommendations using item-based collaborative filtering ….. 4/19/2013 55
  • 56.
    Clickstream – Solutions– Algorithms – build a model • Learn a model from a manually trained dataset • Predict the class of an unseen object based on features • E.g. features of user profile, product, click path to predict users’ preferences. Solutions 4/19/2013 56
  • 57.
    Clickstream – Solutions– Algorithms – build a model • Learn a model from a manually trained dataset • Predict the class of an unseen object based on features • E.g. features of user profile, product, click path to predict users’ preferences. Solutions 4/19/2013 57
  • 58.
    Clickstream – Solutions– Clickstream Data Warehouse Traditional Clickstream Data Warehouse Schema Common Dimensions: 1. Customer 2. Product 3. Time Solutions 4. Geography 5. Page 6. Content (meta-data) 7. User Facts: 1. Sales 2. User Activities Design: Schema Design depends on the data we have and the measures we have 4/19/2013 58
  • 59.
    Clickstream – Solutions– Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse By Mark Sweiger 4/19/2013 59
  • 60.
    Clickstream – Solutions– Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse By Mark Sweiger 4/19/2013 60
  • 61.
    Clickstream – Solutions– Clickstream Data Warehouse Solutions Source: Clickstream Data warehouse by Albert H 4/19/2013 61
  • 62.
    Clickstream – Solutions– technology stack Solutions Reports BI TOOL Application Web App Reporting RMDB, Oracle MySQL ZooKeeper Hosting Models ETL (INFA, Model Data Movement Talend) Data- APACHE HIVE, STATISTICAL warehouse Algorithms HBASE MAHOUT APACHE HADOOP Clickstream logs 4/19/2013 62
  • 63.
    What about aCase study - demo? 63
  • 64.
    Clickstream – CaseDemo • An Asia based Hotspot Wi-Fi provider, wireless routers throughout China/Hong Kong. • Revenue Model: Advertising – Advertisers place ads when users browse the Net. Demo • Data – Survey data: Users are required to fill a survey before logging in. – Click logs including Ad click-through • Data Size: – 12GB+ compressed a day. – 150M+ clicks and 2.4M click through a day. • Problem definition: click-through rate is too low 4/19/2013 64
  • 65.
    Clickstream – CaseDemo Hadoop – running Cloudera CDH4 Demo 4/19/2013 65
  • 66.
    Clickstream – CaseDemo • Meet the Clickstream logs Demo When the click is Router Location AD Site Clicked MAC Addr recorded 4/19/2013 66
  • 67.
    Clickstream – CaseDemo • Meet the survey questions Demo Some Sample of Survey Questions 4/19/2013 67
  • 68.
    Clickstream – CaseDemo • Meet the answer and survey results Demo Options For Survey Answers 4/19/2013 68
  • 69.
    Clickstream – CaseDemo Demo • Vectorize the data for users who click weibo.com Data Vectors MAC Addr 4/19/2013 69
  • 70.
    Clickstream – CaseDemo Training data set Demo Resultant Vector “cosine value Distance” 4/19/2013 70
  • 71.
    Clickstream – CaseDemo Test data Set Demo Area under The curve is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. 4/19/2013 71
  • 72.
    Clickstream – CaseDemo Test Results > 0.5 is good 2 out Demo Of 6 macaddr q16 q17 q18 q19 q20 q21 q22 q23 q24 q25 AUC Value Actually chicked Are 00:22:5f:34:54:3e 00:1f:5b:b3:26:6d 116 117 166 125 135 136 146 144 157 162 169 0 172 0 177 0 183 0 193 197 0.76 0.65 Y N Predicted 00:1a:73:e8:56:c6 00:18:de:1f:fe:c0 117 0 122 0 137 0 152 0 159 0 169 0 172 0 177 0 190 0 195 193 0.65 0.61 N Y right 00:1e:65:51:34:80 0 0 137 141 157 0 0 0 0 210 0.59 N 00:17:c4:a9:16:6c 0 0 0 0 0 0 0 0 0 0 0.53 N 00:1f:3b:06:87:3d 118 131 0 0 0 0 0 0 0 201 0.41 Y 00:21:19:a4:8d:ea 0 0 134 151 157 170 172 177 184 211 0.32 N 00:1e:65:7d:2d:d2 0 0 0 0 0 0 0 0 0 0 0.29 N 00:16:44:c7:80:35 0 0 0 0 0 0 0 0 0 0 0.24 Y 00:16:44:d4:11:9a 0 0 0 0 0 0 0 0 0 0 0.22 Y 00:13:02:a4:33:9c 0 0 0 0 0 0 0 0 0 0 0.2 Y 00:21:19:9a:64:ad 0 0 0 0 0 0 0 0 0 0 0.18 N 00:1f:df:75:0a:8e 0 0 0 0 0 0 0 0 0 0 0.16 N 00:25:d3:50:37:92 118 127 0 0 0 0 0 0 0 0 0.13 Y 00:21:00:d6:98:2c 118 123 0 0 0 169 172 176 187 192 0.11 Y 00:17:c4:9b:2c:e2 0 0 0 0 0 0 0 0 0 0 0.11 N 00:0d:f0:6d:fc:47 0 0 0 0 0 0 0 0 0 0 0.11 Y 00:1e:65:3f:e1:6c 0 0 0 0 0 0 0 177 188 0 0.1 N 00:21:00:e3:a5:f1 0 0 0 0 0 0 0 0 0 0 0.08 N 4/19/2013 72
  • 73.
    Clickstream – CaseDemo • Meet the ETL process with Talend BD V 5.2 Demo 4/19/2013 73
  • 74.
    Clickstream – CaseDemo • Meet some sample reports Demo 4/19/2013 74
  • 75.
    Clickstream – CaseDemo • Meet some sample reports Demo 4/19/2013 75
  • 76.
    Objective of thissession Introduction of Clickstream Data Start thinking how to fully Get started Utilize Clickstream Data Individually and as Solutions and An organization - a Available Sample Demo Technologies 4/19/2013 76
  • 77.
  • 78.
    Thank you! Albert Hui, MBA, MASc., P.Eng, CSM EPAM Canada, Associate Director Email: albert_hui@epam.com Follow me at Twitter: @dataeconomist Please help fill an evaluation form www.ioug.org/eval Session # 353 4/19/2013 78