Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole

6,417 views
6,251 views

Published on

There are a number of mature web analytics products that have been on the market for ~20 years. Big data tools have only really taken off in the last 5 years. So why use big data tools mine web analytics data?

In this presentation, I explore the limitations of traditional approaches to web analytics, and explain how big data tools can be used to address those limitations and drive more value from the underlying data. I explain how a combination of Snowplow and Qubole can be used to do this in practice

Published in: Technology

Why use big data tools to do web analytics? And how to do it using Snowplow and Qubole

  1. 1. Using big data tools to analyse web analytics data Why use big data tools to analyse web analytics data? How would you use big data tools to analyse web analytics data (with Snowplow and Qubole)
  2. 2. Web event data is incredibly valuable • It tells you how your customers actually behave (in lots of detail), and how that varies • Between different customers • For the same customers over time. (Seasonality, progress in customer journey) • How behaviour drives value • It tells you how customers engage with you via your website / webapp • How that varies by different versions of your product • How improvements to your product drive increased customer satisfaction and lifetime value • It tells you how customers and prospective customers engage with your different marketing campaigns and how that drives subsequent behaviour Web analytics data should be essential to driving customer development, product development and marketing decisions
  3. 3. Deriving value from web analytics data often involves very bespoke analytics • The web is a rich and varied space! E.g. • • • • • • • Bank Newspaper Social network Analytics application Government organisation (e.g. tax office) Retailer Marketplace • For each type of business you’d expect different : • • • • Types of events, with different types of associated data Ecosystem of customers / partners with different types of relationships Product development cycle (and approach to product development) Types of business questions / priorities to inform how the data is analysed
  4. 4. Web analytics tools are good at delivering the standard reports that are common across different business types… • Where does your traffic come from e.g. • Sessions by marketing campaign / referrer • Sessions by landing page • Understanding events common across business types (page views, transactions, ‘goals’) e.g. • • • • Page views per session Page views per web page Conversion rate by traffic source Transaction value by traffic source • Capturing contextual data common people browsing the web • • • • • • Timestamps Referer data Web page data (e.g. page title, URL) Browser data (e.g. type, plugins, language) Operating system (e.g. type, timezone) Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)
  5. 5. …but not at enabling the high-value bespoke analytics • What is the impact of different ad campaigns and creative on the way users behave, subsequently? What is the return on that ad spend? • How do visitors use social channels (Facebook / Twitter) to interact around video content? How can we predict which content will “go viral”? • How do updates to our product change the “stickiness” of our service? ARPU? Does that vary by customer segment?
  6. 6. That is because there are significant limitations in the way traditional web analytics programmes handle: Data collection • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transactions • Limited set of ways of describing events (custom dim 1, custom dim 2…) Data processing Data access • Data is processed ‘once’ • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • As a result, data is siloed: hard to join with other data sets
  7. 7. We built Snowplow to address those limitations and enable high value, bespoke analytics on web event data Data pipeline Big data store Snowplow is a data pipeline: • • • Captures data from website via Javascript tags Validates, cleans, and enriches the incoming data (using Hadoop) Loads the cleaned / enriched data store into a big data store for analysis e.g. S3 where it can be analysed using big data tools e.g. Qubole
  8. 8. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules:
  9. 9. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Trackers generate event data • • • • • Javascript tracker for collecting data client-side No-JS / pixel tracker (e.g. for email marketing) Server side trackers (e.g. Lua tracker). Python / Ruby / Java / Scala on roadmap Mobile trackers (iOS, Android on the roadmap…) Internet of things (e.g. Arduino tracker)
  10. 10. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Collectors receive data and write it to a queue for processing • Cloudfront collector writes data to S3 • Clojure collector sets 3rd party cookie writes to S3 • Scala RT collector sets 3rd party cookie writes to S3 AND Kinesis
  11. 11. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Enrichment validates and enriches the data • Validates e.g. checks expected fields are set for each event type • Enrichments e.g. categorising referrers (search / social), inferring location from IP • Hadoop-based enrichment module (easy reprocessing of data) • Kinesis-based enrichment module (real time processing) in development
  12. 12. Understanding the technology that powers the Snowplow data pipeline The Snowplow data pipeline consists of five loosely coupled modules: Storage – make data available for analysis • Store data in Amazon S3 for processing using big data tools e.g. Qubole • Also support storage in Amazon Redshift / PostgreSQL for analysis using traditional BI tools
  13. 13. So what does Snowplow data look like? • A single table • One line of data per event • Fat table: 98 different fields (and counting)… Type of field Example field(s) Description User ID domain_userid, network_userid Fields to identify user performing browsing. 1st and 3rd party cookie IDs, browser fingerprints, IP address and separate field for setting to custom value all available Web page page_urlpath Fields that describe the web page the event occurred on, including document size, URL, title Traffic source mkt_source, refr_source Fields that relate to indicate the source of traffic. Snowplow includes fields that can be set via utm parameters and others based on the referrer Event (rather than context) event, se_action, tr_total Fields that relate to a specific event (e.g. transaction total) User tech setup br_type, os_name, dvce_type, br_viewheight Fields that describe the user’s browser / OS / device setup … … …
  14. 14. How do you analyse Snowplow data with Qubole? • Common approach: use Hive on Qubole (could also use Pig or other Hadoop-based jobs) • Create the events table (incl. recovering partitions) • Write highly bespoke queries directly against the complete events table
  15. 15. DEMO!
  16. 16. Performing more sophisticated analysis • Unfortunately there’s not time on this webinar to do a deeper demo… • …however, there are resources available, in particular, the Snowplow Analytics Cookbook - http://snowplowanalytics.com/analytics/index.html

×