Enhancing statistics: Google Analytics and Visualization APIs
Upcoming SlideShare
Loading in...5

Enhancing statistics: Google Analytics and Visualization APIs



A look at how the Google Analytics and Visualization APIs have opened up new possibilities in generating statistical information for digital repositories.

A look at how the Google Analytics and Visualization APIs have opened up new possibilities in generating statistical information for digital repositories.



Total Views
Views on SlideShare
Embed Views



3 Embeds 39

http://www.openrepository.com 35
http://staging.openrepository.com 3
http://openrepository.com 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Additional information is contained within the presenter notes - please use the download link above to retrieve the original PowerPoint file.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • There have been many packages provided with and/or for institutional repositories for showing usage information.This slide shows just a handful of them – the included statistics with DSpace (prior to 1.6), ePrintsStats (developed as an add on to Eprints), and IRStats (developed for Eprints, but also tested with DSpace). Other notable packages include that developed by University of Minho for DSpace.The packages highlighted here, all use log analysis (either of what is essentially a debug log, or of the web server usage logs).
  • DSpace 1.6 introduced a new method of logging and reporting on usage, that uses the internal event mechanism of DSpace to trigger sending event information to a SOLR instance.The introduction of this statistics package in 1.6 – and the interesting new information that it reports – was what initiated and informed the project to extend the way Google Analytics could be used and integrated into a repository. This work was conducted against a DSpace repository, but the techniques can be applied to any other repository technology (even ones that don’t present a web based user interface).
  • Google Analytics provides a number of advantages to repository administrators. It’s easy to setup, involving no additional server-side application or scheduled task. It has a number of interesting tools, allowing you to segment the data in various ways, and providing alerting and scheduled email reports.Being Javascript based, it also allows accesses to pages that are being served entirely out of inline caches, where the server does not receive even a HEAD request, to be tracked for usage.And importantly, it is both highly scalable (it can track sites with millions of page hits per day, without you having to worry about how you manage that data), and completely free to use.
  • However, it is not without its negatives. Because it requires Javascript, it doesn’t track accesses from devices that don’t support it – potentially an increasing problem with mobile devices. It also requires a separate login to reach the report interface.The two most significant issues though are that it reports on URLs, and has no notion of how some of those URLs may relate to the structure of a repository (ie. that a number of items may be grouped into a collection). And that it only reports on HTML pageviews – it doesn’t, as standard, track file accesses (downloading PDFs or documents, playing movies, etc.)
  • It was mentioned as a negative of Google Analytics – file downloads are not tracked as standard. In this example, we can see the usualpageTrackerJavascript that is included in the footer of the page. This will cause the currently accessed url (as appears in the browser bar) to be logged in Google Analytics.The link to the PDF is shown in the box through the centre of the page – clicking on the link will result in the PDF being downloaded, but this will not be logged in Google Analytics.
  • The simplest (and for a long time, only) solution to this is to add an onclickJavascript event to the link to the PDF, which calls the pageTracker object that is created in the footer, explicitly passing the URL of the PDF as a parameter. Clicking on this link will cause the javascript event to be executed, and the URL that was passed as a parameter to be logged as an access.This is not an ideal solution though. You need to add the onclick event to EVERY download link – significantly increasing the HTML size - and even then it won’t track accesses that have come directly from outside (such as where Google has indexed a PDF and it appears in web search results).
  • As long as we are relying on Javascript / HTML page integration, download links – and particularly those from outside our domains – will remain a problem. However, with the increasing requirements to track mobile devices that did not have Javascript enabled, Google released a mechanism to track accesses using code that sat on the server side. It was released as PHP, Perl, JSP and ASPX scripts, however it is easy to convert that into Java (or any other code) to work with whatever platform you are using, and to be able to call it at any level.Below is some Java code that demonstrates its use.public void trackPageView(StringpagePath) {try {// create utmdt – title Mapparams = new HashMap();params.put("utmp", URLEncoder.encode(pagePath, "UTF-8"));dispatch(params);} catch (UnsupportedEncodingExceptionuee) {}}private void dispatch(Mapparams) { try { String userAgent = null;StringBuildergaUrlBuffer = new StringBuilder();gaUrlBuffer.append("http://www.google-analytics.com/__utm.gif?");gaUrlBuffer.append("utmwv=4.4sj");gaUrlBuffer.append("&utmac=").append(gaProfile);gaUrlBuffer.append("&utmn=").append(getRandomNumber());gaUrlBuffer.append("&utmcn=1");gaUrlBuffer.append("&utmhn=").append(URLEncoder.encode(request.getServerName(), "UTF-8")); // host namegaUrlBuffer.append("&utmip=").append(getIP(request.getRemoteAddr())); // zero last part if (request instanceofHttpServletRequest) {StringBuilderutmcc = new StringBuilder();HttpServletRequesthReq = (HttpServletRequest)request; // utmcc if (hReq.getCookies() != null) { for (Cookie cookie : hReq.getCookies()) { if (cookie.getName().contains("__utm")) { if (utmcc.length() > 0) { utmcc.append("%20"); } utmcc.append(cookie.getName()).append("%3D").append(URLEncoder.encode(cookie.getValue(), "UTF-8")); } } } if (utmcc.length() > 0) {gaUrlBuffer.append("&utmcc=").append(utmcc.toString()); } else {gaUrlBuffer.append("&utmcc=__utma%3D999.999.999.999.999.1%3B");gaUrlBuffer.append("&utmvid=").append(getVisitorId(hReq.getHeader("X-DCMGUID"), gaProfile, hReq.getHeader("User-Agent"), hReq.getRemoteAddr())); } if (hReq.getHeader("Referer") != null) {gaUrlBuffer.append("&utmr=").append(URLEncoder.encode(hReq.getHeader("Referer"), "UTF-8")); }userAgent = hReq.getHeader("User-Agent"); } for (String name : params.keySet()) {gaUrlBuffer.append("&").append(name).append("=").append(params.get(name)); } String gaUrlString = gaUrlBuffer.toString();URLConnectiongaUrl = null; byte[] buf = new byte[8192]; try {gaUrl = new URL(gaUrlString).openConnection(); if (!isEmpty(userAgent)) {gaUrl.setRequestProperty("User-Agent", userAgent); }gaUrl.connect();InputStream is = ((HttpURLConnection)gaUrl).getInputStream(); if (is != null) {int ret = 0; do { ret = is.read(buf); } while (ret > 0);is.close(); } } catch (IOExceptionioe) {log.error("IO Error disatching GA request", ioe); if (gaUrl != null &&gaUrlinstanceofHttpURLConnection) {InputStreames = ((HttpURLConnection)gaUrl).getErrorStream(); if (es != null) {int ret = 0; do { ret = es.read(buf); } while (ret > 0);es.close(); } } } } catch (Exception e) {log.error("Errordisatching GA request", e); }}
  • With the possibility of using server-side code, we can log request for downloads, even if they originated from outside domains.In the case of our DSpace based implementation, we create a servlet filter that can be configured in the web.xml to process requests to the BitstreamServlet. This filter then logs the request with Google Analytics, before allowing the BitstreamServlet to deliver the file.Note that when using server-side processing, it is important to look at the request and determine if it is from a robot, before logging it. Through examining a number of requests, it was found that by applying a few simple rules – checking to see if the user agent contained specific words, or and email address or url – that you could spot the vast majority of robots with good accuracy.Below is an example of the Java code used to log an access.private void logAccessToGA(ServletRequest request, ServletResponse response) { if (StringUtils.isEmpty(gaProfile)) { return; } if (request instanceofHttpServletRequest) {HttpServletRequesthrq = (HttpServletRequest)request; String userAgent = hrq.getHeader("User-Agent"); String serverName = hrq.getServerName(); if (serverName != null) {serverName = serverName.toLowerCase(); if (RequestUtils.isRealUser(userAgent)) {GATrackergaTracker = GATracker.getInstance();gaTracker.start(gaProfile, request, response);gaTracker.trackPageView(hrq.getRequestURI()); } } }}
  • An aside from the discussion of known issues, we have some powerful means of tracking additional information in Google Analytics, through something they call events. These were created for tracking activity inside things like Flash based movie players, and is designed to be called through the Javascript tracker. However, you can encode a request to track information from the server side too.Events are not URLs, instead they consist of a category (an overall grouping), and an action (like pressing play in a movie player). You can give the event a label (eg. The movie name), and a value.In the screenshot, we have a real-life example of using event tracking, where it is called from a servlet filter that intercepts every request to the application. The logging uses the category Traffic, and splits the action down to either User or Robot. The label is the URL being requested, and the value represents the number of bytes that have been served in that request.Or, to put it simply, this example shows outbound bandwidth usage being tracked – broken down by both whether it is user or robot initiated, and by the url being requested. Google Analytics is totalling the event values (bytes) for us over the period being reported on.Below is some Java code demonstrating how to encode events to be passed to Google Analytics.public void trackEvent(String category, String action, String label, Long value) { try {StringBuildereventString = new StringBuilder(); eventString.append("5(").append(URLEncoder.encode(category, "UTF-8"));eventString.append("*").append(URLEncoder.encode(action, "UTF-8")); if (!isEmpty(label)) {eventString.append("*").append(URLEncoder.encode(label, "UTF-8")); }eventString.append(")"); if (value != null) {eventString.append("(").append(value).append(")"); } Mapparams = new HashMap();params.put("utmt", "event");params.put("utme", eventString.toString());dispatch(params); } catch (UnsupportedEncodingExceptionuee) { }}
  • Returning to the stated problems with Google Analytics, here we demonstrate the issue with relating the reports to your repository.On the right, the Google Analytics report just consists of a list of URLs, some of which are item views, some are downloads, and some are other pages (logins, browse, search, etc.).The left shows that the repository we are reporting on has structure – items are grouped into collections, collections grouped into communities. How can we see what items and files were popular for a particular collection or community? You could create a segment in Google Analytics, but that’s a lot of work – and it would need maintaining every time you added or removed items to/from the collection(s).
  • Thankfully, Google Analytics now provides an API to retrieve data about your URLs and Events. It is built on the same protocols that they use for products like Calendar, etc. Clients are provided for Javascript and Java, however extensive documentation of the authentication and HTTP / XML formats, so that you can write clients in any other language.By pulling the data back from Google Analytics, you can then do additional filtering and accumulation within the client, where you also have access to the information about the repository and its structure. For instance, you know that if a url that contains a particular handle identifier what item that relates to.The API performance is good, but the latency will add up if you make lots of calls. Also, there are limits to the number of queries that can be made, in order to guarantee the service level.As a result, whilst you should be as specific as possible in your queries to reduce the number of rows being returned, it should be balanced with favouring queries that return more data than you need for the immediate requirement, but which can be cached and filtered client side for reuse.
  • Here we see two examples of the Analytics API in use with a repository. On the left hand side are statistics about an entire collection. Note that for the Total Visits, Top Item Views and Top File Downloads sections, just two queries were made of the Analytics API – one to retrieve the view counts of every item in the repository, and one to retrieve the counts of all the file downloads. A list of all the items in the collection was determined from the repository, and used to filter the Analytics values down to the relevant entries. The six month view required an analytics call for each of the six periods.The right hand side shows an example of the statistics for an individual item. Note that the Total Visists, File Downloads and Per Month breakdowns use the same information retrieved and cached for the collection view (or vice versa). The Country and City information can also be retrieved from Google Analytics, and is done so on a per-item basis.Below is an example of some of the code used to retrieve the information in these examples. The parameters passed to getDataFeed are:Dimension – the value to report (group) onMetric – the value to countFilter – only report on a subset of the data (a tilde signifies a regular expression)Order – the order in which to return rows (a minus sign signifies descending order)Start date – first date to include in reportEnd date – last date to include in reportStart index – offset to first row to return (used when iterating over a large dataset)Max rows – maximum number of rowspublic ListgetViewsForAllHandles(StringstartDate, String endDate) {dataFeed = getDataFeed("ga:pagePath", "ga:pageviews", "ga:pagePath=~/handle/[0-9]+/[0-9]+;ga:pagePath!@statistics", "-ga:pageviews", startDate, endDate, startIndex, 0);}public ListgetDownloadsForAllBitstreams(StringstartDate, String endDate) {dataFeed = getDataFeed("ga:pagePath", "ga:pageviews", "ga:pagePath=~/bitstream/[0-9]+/[0-9]+/.*;ga:pagePath!@statistics", "-ga:pageviews", startDate, endDate, startIndex, 0);}public ListgetViewsByCountry(StringstartDate, String endDate, intmaxRows) { String gaFilter = "ga:pagePath=~/(handle|bitstream)/[0-9]+/[0-9]+;ga:pagePath!@statistics";DataFeeddataFeed = getDataFeed("ga:country", "ga:pageviews", gaFilter, "-ga:pageviews", startDate, endDate, 0, maxRows);}public ListgetViewsByCountry(String handle, String startDate, String endDate, intmaxRows) { String gaFilterItem = "ga:pagePath=~/(handle|bitstream)/" + handle + ";ga:pagePath!@statistics";DataFeeddataFeed = getDataFeed("ga:country", "ga:pageviews", gaFilterItem, "-ga:pageviews", startDate, endDate, 0, maxRows);}public ListgetViewsByCity(StringstartDate, String endDate, intmaxRows) { String gaFilter = "ga:pagePath=~/(handle|bitstream)/[0-9]+/[0-9]+;ga:pagePath!@statistics";DataFeeddataFeed = getDataFeed("ga:city,ga:country", "ga:pageviews", gaFilter, "-ga:pageviews", startDate, endDate, 0, maxRows);}public ListgetViewsByCity(String handle, String startDate, String endDate, intmaxRows) { String gaFilterItem = "ga:pagePath=~/(handle|bitstream)/" + handle + ";ga:pagePath!@statistics";DataFeeddataFeed = getDataFeed("ga:city,ga:country", "ga:pageviews", gaFilterItem, "-ga:pageviews", startDate, endDate, 0, maxRows);}
  • Now we have simple statistics reports that just provide the raw numbers. However, it is often easier to interpret the reports with a visual representation.Google also provides a free API that is very easy to use and produces useful and attractive visuals. There are two modes of operation.Static images: some simple charts are available as static images. You simply create an tag where the src attribute is the API url, and pass the data that you want to chart as parameters. This is the most portable representation, but there are limits in the amount of data that you can pass.Dynamic images: you can also use Javascript / AJAX to load images on to your page. Simply place a tag (with id attribute) where you want the image to appear, and use Javascript to initialize the API, point it at the div, and prepare the data you want to chart. Note that most charts produced this way are interactive SVG elements, however some are rendered in Flash.
  • The Google Visualization API provides many different charts – line, bar, pie, heat maps, guages, geo maps (with country heat maps for the entire world, or city dots for specific regions).This is where the integration with Google Analytics works really well – for example, the geo map takes actual country and city names, the text values it recognises being the same as the ones returned by the Analytics API.
  • And here are the same examples of using the Analytics API from earlier, this time integrating the Visualization API to provide a representation of the numbers.The line and bar charts are clickable – clicking on a node in the charts will produce a label that shows the numeric value.The map can not be clicked on to zoom in to specific regions, however, you can hover over each country, to see how many views originated there.These are real world examples, and which you can see by following the urls provided.

Enhancing statistics: Google Analytics and Visualization APIs Enhancing statistics: Google Analytics and Visualization APIs Presentation Transcript

  • Enhancing Statistics
    Using the Google Analytics and Visualisation APIs
  • Existing Solutions
    Statistics packages shipped with or added on to IRs
  • Log Analysis
    DSpace 1.3-1.5
    View slide
  • Event Driven
    DSpace 1.6
    • Internal DSpace events View slide
    • External SOLR app
    • Ad hoc queries
  • Google Analytics - Overview
    Its strengths, and its weaknesses
  • Google Analytics: Pros
    Easy Setup
    Powerful tools
    Tracks cache hits
    Email reports
  • Google Analytics: Cons
    Tracks page views
    (no downloads)
    Requires Javascript
    Separate login
    Reports URL path/part
    Hard to locate specific items
  • Google Analytics – Tracking Issues
    Shortcomings of the Google tracker, and how to overcome them
  • Problem: Downloads
    • Page tracked in footer
    • Download link will return PDF, not html – so no Analytics tracker
    <a target="_blank" href="/e-space/bitstream/2173/3518/3/williams%20-%20specificity%20of%20acceleration.pdf" >williams - specificity of acceleration.pdf</a>
    <script type="text/javascript">
    varpageTracker =
  • Downloads – Solution 1
    • OnClick tracks url of download
    • Every download link needs to be changed
    • Does not track links from outside
    <a target="_blank" href="/e-space/bitstream/2173/3518/3/williams%20-%20specificity%20of%20acceleration.pdf" onclick="javascript:pageTracker._trackPageview('/e-space/bitstream/2173/3518/3/williams%20-%20specificity%20of%20acceleration.pdf');">williams - specificity of acceleration.pdf</a>
    <script type="text/javascript">
    varpageTracker =
  • Tracking without Javascript
    Mobile site code
    • Introduced Oct/Nov 2009
    • Server side code retrieves image from Google Analytics to log request
    • Supports page views and events
    • PHP/Perl/JSP/ASPX code provided
  • Downloads – Solution 2
    Identifying robots
    Contains: bot, crawl, fetch, ndex, nutch, spider
    Major engines: teoma, bing, msnbot, slurp
    Contains email address or url
    • Intercept download request
    • Test user agent for robots
    • Call http://www.google-analytics.com/__utm.gif from server
    Documentation for the __utm.gif parameters:
  • Event Tracking
    • Events have Category and Actioneg. ‘Videos’ and ‘Play’
    • Optional labeleg. Name of video
    • Optional valueeg. Time to load
    • Log via Javascript, or Mobile site code
    Event tracking guide:
  • Google Analytics – Reporting Issues
    Relating the statistics to repository contents
  • Relating to the Repository
    • No structure
    • URLs, not items/files
  • Analytics API
    • Launched April 2009
    • Same API as Calendar, Finance, etc.
    • Javascript and Java clients provided
    • Protocol uses HTTP and XML
    • Choose dimensions, metrics, filters, sorting
    Usage Tips
    • Prefer fewer distinct queries
    • Prefer queries that return fewer rows
    • Do some filtering client side
    • Do totalling client side
  • Analytics Examples
  • Google Visualizations
    Add charts and maps to your reports
  • Chart Tools / Interactive Tools
    • Simple charts are static images
    • Interactive charts use Javascript / SVG / Flash
    • Simple charts have data limits (2K GET / 16K POST)
    • Can be used with any data source
    • Works well with Analytics exports
    For more information on how to use the visualization API:
  • Useful Charts
    See more at the gallery:
  • Analytics and Charts
  • Thank You
    Any questions?
    “Whenever you can, count.”
    -- Sir Francis Galton
    Graham Triggs
    Technical Architect
    Open Repository
    W: http://www.openrepository.com/
    E: graham@biomedcentral.com