Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web analytics with R

365 views

Published on

Niche bloggers up to multinational corporations, they are all interested in monitoring their web traffic and its patterns across time.

Google Analytics is the most widely used solution to keep track of this type of data. It provides a UI for a wide range of reports and possibilities for various types of visualizations.

Moreover, the availability of the Analytics API coupled with the corresponding R packages can now give more options for custom web analyses.

The plan for this talk is to cover the following :

• What is web analytics ? How it works ?

• Interfacing with the Analytics Reporting API via an R package (RGA)

• Practical analytics applications with R

• Discussion

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Web analytics with R

  1. 1. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 1/38 Web Analytics with R DublinR,September2015 Alexandros Papageorgiou analyst@alex-papageo.com
  2. 2. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 2/38 About me Started @ Google Ireland Career break Web analyst @ WhatClinic.com · · ·
  3. 3. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 3/38 About the talk Intro Analytics Live Demo Practical Applications x 3 Discussion · · · ·
  4. 4. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 4/38 Part I: Intro
  5. 5. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 5/38 Web analytics now and then…
  6. 6. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 6/38 Getting started overview 1. Get some web data for a start 2. Get the right / acurate / relevant data *** 3. Analyse the data
  7. 7. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 7/38 Google Analytics API + R Why ? Large queries ? Freedom from the limits of the GA user interface Automation, reproducibility, applications Richer datasets up to 7 Dimensions and 10 Metrics · · · Handle queries of 10K - 1M records Mitigate the effect of Query Sampling · ·
  8. 8. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 8/38 The package: RGA install("RGA")  Author Artem Klevtsov Access to multiple GA APIs Shiny app to explore dimensions and metrics. Actively developped + good documentation · · · ·
  9. 9. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 9/38 Part II: Demo
  10. 10. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 10/38 Part III: Applications
  11. 11. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 11/38 Practical applications Ecommerce website (simulated data) Advertising campaign effectiveness (Key Ratios) Adgroup performance (Clustering) Key factors leading to conversion (Decision Tree) · · ·
  12. 12. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 12/38 Libraries library(RGA) library(dplyr) library(tidyr) library(ggplot2)
  13. 13. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 13/38 1. Key Performance Ratios Commonly used in Business and finance analysis Good for data exploration in context · ·
  14. 14. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 14/38 Key Ratios: Getting the data by_medium <‐ get_ga(profile.id = 106368203,                     start.date = "2015‐11‐01",                      end.date = "2015‐08‐21",                                                  metrics = "ga:transactions, ga:sessions",                     dimensions = "ga:date, ga:medium",                                                 sort = NULL,                      filters = NULL,                      segment = NULL,                                                  sampling.level = NULL,                     start.index = NULL,                      max.results = NULL)
  15. 15. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 15/38 Sessions and Transactions by medium head(by_medium) ##         date   medium transactions sessions ## 1 2014‐11‐01   (none)            0       57 ## 2 2014‐11‐01   search            0       10 ## 3 2014‐11‐01  display            3      422 ## 4 2014‐11‐01  organic            0       30 ## 5 2014‐11‐01 referral            1       40 ## 6 2014‐11‐02   (none)            0       63
  16. 16. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 16/38 Calculating the ratios ConversionQualityIndex = %Transactions/Medium %Sessions/Medium by_medium_ratios <‐ by_medium  %>%           group_by(date) %>%  # sum sessions & transactions by date          mutate(tot.sess = sum(sessions), tot.trans = sum(transactions)) %>%           mutate(pct.sessions = 100*sessions/tot.sess,   # calculate % sessions by medium            pct.trans = 100*transactions/tot.trans, # calculate % transactions by medium            conv.rate = 100*transactions/sessions) %>%     # conversion rate by medium          mutate(ConvQualityIndex = pct.trans/pct.sessions) %>%  # conv quality index.          filter(medium %in% c("search", "display", "referral"))    # the top 3 channels 
  17. 17. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 17/38 Ratios table columns <‐ c(1, 2, 7:10) head(by_medium_ratios[columns])  # display selected columns ## Source: local data frame [6 x 6] ## Groups: date ##  ##         date   medium pct.sessions pct.trans conv.rate ConvQualityIndex ## 1 2014‐11‐01   search    1.7889088   0.00000 0.0000000        0.0000000 ## 2 2014‐11‐01  display   75.4919499  75.00000 0.7109005        0.9934834 ## 3 2014‐11‐01 referral    7.1556351  25.00000 2.5000000        3.4937500 ## 4 2014‐11‐02   search    0.5995204   0.00000 0.0000000        0.0000000 ## 5 2014‐11‐02  display   79.1366906  66.66667 0.3030303        0.8424242 ## 6 2014‐11‐02 referral    9.5923261  33.33333 1.2500000        3.4750000
  18. 18. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 18/38 Sessions % by medium ggplot(by_medium_ratios, aes(date, pct.sessions, color = medium)) +      geom_point() + geom_jitter()+ geom_smooth() + ylim(0, 100)
  19. 19. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 19/38 Transactions % by medium ggplot(by_medium_ratios, aes(date, pct.trans, color = medium)) +      geom_point() + geom_jitter() + geom_smooth()  
  20. 20. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 20/38 Conversion Quality Index by medium ggplot(by_medium_ratios, aes(date, ConvQualityIndex , color = medium)) +      geom_point(aes(size=tot.trans)) + geom_jitter() + geom_smooth() + ylim(0,  5) +     geom_hline(yintercept = 1, linetype="dashed", size = 1, color = "white") 
  21. 21. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 21/38 2. Clustering for Ad groups Unsupervised learning Discovers structure in data Based on a similarity criterion · · ·
  22. 22. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 22/38 Ad Group Clustering: Getting the Data profile.id = "12345678" start.date = "2015‐01‐01" end.date = "2015‐03‐31" metrics = "ga:sessions, ga:transactions,             ga:adCost, ga:transactionRevenue,             ga:pageviewsPerSession" dimensions = "ga:adGroup" adgroup_data <‐  get_ga(profile.id = profile.id,                      start.date = start.date,                      end.date = end.date,                     metrics = metrics,                      dimensions = dimensions)
  23. 23. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 23/38 Hierarchical Clustering top_adgroups <‐ adgroup_data %>%      filter(transactions >10)  %>%    # keeping only where transactions > 10      filter(ad.group!="(not set)")  n <‐  nrow(top_adgroups) rownames(top_adgroups) <‐  paste("adG", 1:n) # short codes for adgroups top_adgroups <‐  select(top_adgroups, ‐ad.group) # remove long adgroup names  scaled_adgroups <‐ scale(top_adgroups)  # scale the values
  24. 24. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 24/38 Matrix: Scaled adgroup values. ##         sessions transactions    ad.cost transaction.revenue ## adG 1  0.3790902   2.72602456 ‐0.7040545           3.7397620 ## adG 2 ‐0.6137714   0.05134068 ‐0.7111664           0.2086295 ## adG 3  0.3207199   0.30473179  0.3411098           0.5346303 ## adG 4  0.9956617   0.78335943  0.9139105           1.1769897 ## adG 5 ‐0.2261473  ‐0.65252350 ‐0.1688845          ‐0.6691330 ## adG 6 ‐0.6863092  ‐0.59621436 ‐0.5979007          ‐0.4800614 ##       pageviews.per.session ## adG 1            1.93262389 ## adG 2            1.05619885 ## adG 3           ‐0.74163568 ## adG 4           ‐0.32186999 ## adG 5           ‐1.01079490 ## adG 6            0.01538598
  25. 25. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 25/38 Dendrogram hc <‐  hclust(dist(scaled_adgroups) )  plot(hc, hang = ‐1) rect.hclust(hc, k=3, border="red")   
  26. 26. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 26/38 Heatmap.2 library(gplots); library(RColorBrewer) my_palette <‐ colorRampPalette(c('white', 'yellow', 'green'))(256) heatmap.2(scaled_adgroups,            cexRow = 0.7,            cexCol = 0.7,                     col = my_palette,                rowsep = c(1, 5, 10, 14),           lwid = c(lcm(8),lcm(8)),           srtCol = 45,           adjCol = c(1, 1),           colsep = c(1, 2, 3, 4),           sepcolor = "white",            sepwidth = c(0.01, 0.01),             scale = "none",                    dendrogram = "row",               offsetRow = 0,           offsetCol = 0,           trace="none") 
  27. 27. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 27/38 Clusters based on 5 key values
  28. 28. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 28/38 3. Decision Trees Handle categorical + numerical variables Mimic human decion making process Greedy approach · · ·
  29. 29. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 29/38 3. Pushing the API profile.id = "12345678" start.date = "2015‐03‐01" end.date = "2015‐03‐31" dimensions = "ga:dateHour, ga:minute, ga:sourceMedium, ga:operatingSystem,                ga:subContinent, ga:pageDepth, ga:daysSinceLastSession" metrics = "ga:sessions, ga:percentNewSessions,  ga:transactions,             ga:transactionRevenue, ga:bounceRate, ga:avgSessionDuration,            ga:pageviewsPerSession, ga:bounces, ga:hits" ga_data <‐  get_ga(profile.id = profile.id,                      start.date = start.date,                      end.date = end.date,                     metrics = metrics,                      dimensions = dimensions)
  30. 30. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 30/38 The Data ## Source: local data frame [6 x 16] ##  ##     dateHour minute       sourceMedium operatingSystem    subContinent ## 1 2015030100     00 facebook / display         Windows Southern Europe ## 2 2015030100     01 facebook / display       Macintosh Southern Europe ## 3 2015030100     01       google / cpc         Windows Northern Europe ## 4 2015030100     01       google / cpc             iOS Southern Europe ## 5 2015030100     02 facebook / display       Macintosh Southern Europe ## 6 2015030100     02 facebook / display         Windows  Western Europe ## Variables not shown: pageDepth (chr), daysSinceLastSession (chr), sessions ##   (dbl), percentNewSessions (dbl), transactions (dbl), transactionRevenue ##   (dbl), bounceRate (dbl), avgSessionDuration (dbl), pageviewsPerSession ##   (dbl), hits (dbl), Visitor (chr)
  31. 31. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 31/38 Imbalanced class Approach: Page depth>5 set as proxy to conversion
  32. 32. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 32/38 Data preparation Session data made "almost" granular Removed invalid sessions Extra dimension added (user type) Removed highly correlated vars Data split into train and test Day of the week extracted from date Days since last session placed in buckets Date converted to weekday or weekend Datehour split in two component variables Georgraphy split between top sub-continents and Other Hour converted to AM or PM · · · · · · · · · · ·
  33. 33. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 33/38 Decision Tree with rpart library(rpart) fit <‐ rpart(pageDepth ~., data = Train,       # pageDepth is a binary variable                             method = 'class',                               control=rpart.control(minsplit = 10, cp = 0.001, xval = 10))  # printcp(fit) fit <‐ prune(fit, cp = 1.7083e‐03)   # prune the tree based on chosen param value
  34. 34. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 34/38 The Tree
  35. 35. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 35/38 VarImp …Possible Actions ? dotchart(fit$variable.importance)
  36. 36. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 36/38 Takeaways Web analytics not just for marketers! But neither a magic bullet… (misses the wealth of atomic level data) Solutions ? What's coming next ? · · · ·
  37. 37. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 37/38 Thank you!
  38. 38. 9/21/2015 Web Analytics with R file:///D:/Projects/R_project_temp/RGA/SimGAR.html 38/38

×