Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introducing Snowplow 
Big Data Beers, Berlin 
Huge thanks to Zalando for hosting!
Snowplow is an open-source web and event analytics platform, 
first version released in early 2012 
• Co-founders Alex Dea...
At Keplar, we grew frustrated by significant limitations in 
traditional web analytics programs 
Data collection Data proc...
And we saw the potential of new “big data” technologies and 
services to solve these problems in a scalable, low-cost mann...
We wanted to take a fresh approach to web analytics 
• Your own web event data -> in your own data warehouse 
• Your own e...
Early on, we made a crucial decision: Snowplow should be 
composed of a set of loosely coupled subsystems 
1. Trackers 2. ...
Our initial skunkworks version of Snowplow – it was basic but it 
worked, and we started getting traction 
Website / webap...
What did people start using it for? 
Warehousing their 
web event data 
To enable… 
Agile aka ad hoc 
analytics 
Marketing...
Current Snowplow design 
and architecture
Our protocol-first, loosely-coupled approach made it possible to 
start swapping out existing components… 
Website / webap...
Our protocol-first, loosely-coupled approach made it possible to 
start swapping out existing components… 
Website / webap...
So far we have open-sourced a number of different trackers – 
with more planned 
Production ready: 
• JavaScript 
• No-Jav...
Enrichment process: what is Scalding? 
• Scalding is a Scala API over Cascading, the Java framework for building 
data pro...
Our “enrichment process” (formerly known as ETL) actually does 
two things: validation and enrichment 
• Our validation mo...
Adding the enrichments that web analysts expect = very 
important to Snowplow uptake 
• Web analysts are used to a very sp...
Ongoing evolution of 
Snowplow
There are three big aspects to Snowplow’s 2014 roadmap 
1. Make Snowplow work for non-web (e.g. mobile, IoT) environments ...
Snowplow is developing into an event analytics platform (not 
just a web analytics platform) 
Data warehouse 
Collect even...
Web analysts work with a small number of event types – outside 
of web, the number of possible event types is… infinite 
W...
As we get further away from the web, we needed to start 
supporting user’s own JSON events 
• Specifically, events represe...
Supporting a fixed set of web events and JSON events is a 
difficult problem 
• Almost everybody in event analytics falls ...
We wanted to bridge that divide, making it so that 
Snowplow comes with structured events “out of the box”, 
but is extens...
Issues with the event name: 
• Separate from the event 
properties 
• Not versioned 
• Not unique – HBO video played 
vers...
MixPanel et al cause “schema loss”
We decided to use JSON Schema, with additional metadata 
about what the schema represents
From a tracker, you send in a JSON which is self-describing, with 
a schema header and data body
iglu:com.channel2.vod/video_played/jsonschema/1-0-0 
Schema format 
Event name 
The vendor of this event 
We are calling o...
To add this to Snowplow, we developed a new schema 
repository called Iglu, and a shredding step in Hadoop
JSON Schema just gives us a data structure for events – we are 
also evolving a grammar to capture the semantics of events...
In parallel, we plan to evolve Snowplow from an event analytics 
platform into a “digital nervous system” for data driven ...
Some background on unified log based architectures 
CLOUD VENDOR / OWN DATA CENTER 
Search 
Silo 
SOME LOW LATENCY LOCAL L...
We are part way through our Kinesis support, with additional 
components being released soon 
Scala Stream 
Collector 
Raw...
Questions? 
ulogprugcf (43% off Unified Log 
Processing eBook) 
http://snowplowanalytics.com 
https://github.com/snowplow/...
Upcoming SlideShare
Loading in …5
×

Big Data Beers - Introducing Snowplow

1,324 views

Published on

Technical introduction to Snowplow given at Big Data Beers on 25th September 2014. a Explored how we use a variety of "big data" technologies including Hadoop, Kinesis and Redshift.

Published in: Software
  • Be the first to comment

Big Data Beers - Introducing Snowplow

  1. 1. Introducing Snowplow Big Data Beers, Berlin Huge thanks to Zalando for hosting!
  2. 2. Snowplow is an open-source web and event analytics platform, first version released in early 2012 • Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 • After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy • We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow • We started working full time on Snowplow in summer 2013
  3. 3. At Keplar, we grew frustrated by significant limitations in traditional web analytics programs Data collection Data processing Data access • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transactions • Limited set of ways of describing events (custom dim 1, custom dim 2…) • Data is processed ‘once’ • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • As a result, data is siloed: hard to join with other data sets
  4. 4. And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner CloudFront Amazon S3 Amazon EMR Amazon Redshift These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
  5. 5. We wanted to take a fresh approach to web analytics • Your own web event data -> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business questions • Plug in the broadest possible set of analysis tools to drive value from your data Data pipeline Data warehouse Analyse your data in any analysis tool
  6. 6. Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers 2. Collectors A B 3. Enrich C 4. Storage D 5. Analytics Generate event data from any environment Launched with: • JavaScript tracker Log raw events from trackers Launched with: • CloudFront collector Validate and enrich raw events Launched with: • HiveQL + Java UDF-based enrichment D = Standardised data protocols Store enriched events ready for analysis Launched with: • Amazon S3 Analyze enriched events Launched with: • HiveQL recipes These turned out to be critical to allowing us to evolve the above stack
  7. 7. Our initial skunkworks version of Snowplow – it was basic but it worked, and we started getting traction Website / webapp Snowplow data pipeline v1 (spring 2012) CloudFront-based pixel collector HiveQL + Java UDF “ETL” Amazon S3 JavaScript event tracker
  8. 8. What did people start using it for? Warehousing their web event data To enable… Agile aka ad hoc analytics Marketing attribution modelling Customer lifetime value calculations Customer churn detection RTB fraud Product recommendations
  9. 9. Current Snowplow design and architecture
  10. 10. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront-based event collector Scalding-based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon S3 Amazon Redshift / PostgreSQL or Clojure-based event collector
  11. 11. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront-based event collector Scalding-based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon S3 Amazon Redshift / PostgreSQL or Clojure-based event collector • Allow Snowplow users to set a third-party cookie with a user ID • Important for ad networks, widget companies, multi-domain retailers • Because Snowplow users wanted a much faster query loop than HiveQL/MapReduc e • We wanted a robust, feature-rich framework for managing validations, enrichments etc
  12. 12. So far we have open-sourced a number of different trackers – with more planned Production ready: • JavaScript • No-JavaScript (image beacon) • Python • Lua • Arduino Beta: • Ruby • iOS • Android • Node.js In development: • .NET • PHP
  13. 13. Enrichment process: what is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Scalding Cascalog PyCascading cascading. jruby Cascading Hive Pig Java Hadoop MapReduce Hadoop DFS
  14. 14. Our “enrichment process” (formerly known as ETL) actually does two things: validation and enrichment • Our validation model looks like this: Raw events “Bad” raw events + reasons why they are bad Enrichment Manager “Good” enriched events • Under the covers, we use a lot of monadic Scala (Scalaz) code
  15. 15. Adding the enrichments that web analysts expect = very important to Snowplow uptake • Web analysts are used to a very specific set of enrichments from Google Analytics, Site Catalyst etc • These enrichments have evolved over the past 15-20 years and are very domain specific: • Page querystring -> marketing campaign information (utm_ fields) • Referer data -> search engine name, country, keywords • IP address -> geographical location • Useragent -> browser, OS, computer information
  16. 16. Ongoing evolution of Snowplow
  17. 17. There are three big aspects to Snowplow’s 2014 roadmap 1. Make Snowplow work for non-web (e.g. mobile, IoT) environments as well as the web – RELEASED 2. Make Snowplow work with users’ JSON events as well as with our pre-defined events (aka page views, ecommerce transactions etc) – RELEASED 3. Move Snowplow away from an S3-based data pipeline to a unified log (Kinesis/Kafka)-based data pipeline – ONGOING 
  18. 18. Snowplow is developing into an event analytics platform (not just a web analytics platform) Data warehouse Collect event data from any connected device
  19. 19. Web analysts work with a small number of event types – outside of web, the number of possible event types is… infinite Web events • Page view • Page activity • Order • Add to basket All events • Game saved • Car started • Machine broke • Spellcheck run • Fridge empty • Screenshot taken • App crashed • SMS sent • Disk full • Screen viewed • Player died • Tweet drafted • Till opened • Product returned ∞ • Taxi arrived • Cluster started • Phonecall ended
  20. 20. As we get further away from the web, we needed to start supporting user’s own JSON events • Specifically, events represented as JSONs with arbitrary name: value pairs (arbitrary to Snowplow, not to the company using Snowplow!)
  21. 21. Supporting a fixed set of web events and JSON events is a difficult problem • Almost everybody in event analytics falls on one or other side of this divide: Fixed set of web events (page views etc) + custom variables Send anything JSONs
  22. 22. We wanted to bridge that divide, making it so that Snowplow comes with structured events “out of the box”, but is extensible with unstructured events Fixed set of web events (page views etc) + custom variables Send anything JSONs
  23. 23. Issues with the event name: • Separate from the event properties • Not versioned • Not unique – HBO video played versus Brightcove video played Lots of unanswered questions about the properties: • Is length required, and is it always a number? • Is id required, and is it always a string? • What other optional properties are allowed for a video play? Other issues: • What if the developer accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields • Why does the analyst need to keep an implicit schema in their head to analyze video played events?
  24. 24. MixPanel et al cause “schema loss”
  25. 25. We decided to use JSON Schema, with additional metadata about what the schema represents
  26. 26. From a tracker, you send in a JSON which is self-describing, with a schema header and data body
  27. 27. iglu:com.channel2.vod/video_played/jsonschema/1-0-0 Schema format Event name The vendor of this event We are calling our schema repository technology Iglu Schema version Anatomy of an Iglu schema URI
  28. 28. To add this to Snowplow, we developed a new schema repository called Iglu, and a shredding step in Hadoop
  29. 29. JSON Schema just gives us a data structure for events – we are also evolving a grammar to capture the semantics of events Subject Direct Object Indirect Object Verb Event Context Prep. ~ Object
  30. 30. In parallel, we plan to evolve Snowplow from an event analytics platform into a “digital nervous system” for data driven companies • The event data fed into Snowplow is written into a “Unified Log” • This becomes the “single source of truth”, upstream from the datawarehouse • The same source of truth is used for real-time data processing as analytics e.g. • Product recommendations • Ad targeting • Real-time website personalisation • Systems monitoring Snowplow will drive data-driven processes as well as off-line analytics
  31. 31. Some background on unified log based architectures CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs
  32. 32. We are part way through our Kinesis support, with additional components being released soon Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers • The parts in grey are still under development – we are working with Snowplow community members on these collaboratively • We are also starting work on support for Apache Kafka alongside Kinesis – for users who don’t want to run Snowplow on AWS
  33. 33. Questions? ulogprugcf (43% off Unified Log Processing eBook) http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata I am in Berlin tomorrow – to meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com

×