SlideShare a Scribd company logo

Big Data Beers - Introducing Snowplow

Technical introduction to Snowplow given at Big Data Beers on 25th September 2014. a Explored how we use a variety of "big data" technologies including Hadoop, Kinesis and Redshift.

1 of 33
Download to read offline
Introducing Snowplow 
Big Data Beers, Berlin 
Huge thanks to Zalando for hosting!
Snowplow is an open-source web and event analytics platform, 
first version released in early 2012 
• Co-founders Alex Dean and Yali Sassoon met at 
OpenX, the open-source ad technology business 
in 2008 
• After leaving OpenX, Alex and Yali set up Keplar, 
a niche digital product and analytics consultancy 
• We released Snowplow as a skunkworks 
prototype at start of 2012: 
github.com/snowplow/snowplow 
• We started working full time on Snowplow in 
summer 2013
At Keplar, we grew frustrated by significant limitations in 
traditional web analytics programs 
Data collection Data processing Data access 
• Sample-based (e.g. 
Google Analytics) 
• Limited set of events e.g. 
page views, goals, 
transactions 
• Limited set of ways of 
describing events 
(custom dim 1, custom 
dim 2…) 
• Data is processed ‘once’ 
• No validation 
• No opportunity to 
reprocess e.g. following 
update to business rules 
• Data is aggregated 
prematurely 
• Only particular 
combinations of metrics 
/ dimensions can be 
pivoted together 
(Google Analytics) 
• Only particular type of 
analysis are possible on 
different types of 
dimension (e.g. sProps, 
eVars, conversion goals 
in SiteCatalyst 
• Data is either aggregated 
(e.g. Google Analytics), 
or available as a 
complete log file for a 
fee (e.g. Adobe 
SiteCatalyst) 
• As a result, data is siloed: 
hard to join with other 
data sets
And we saw the potential of new “big data” technologies and 
services to solve these problems in a scalable, low-cost manner 
CloudFront Amazon S3 Amazon EMR Amazon Redshift 
These tools make it possible to capture, transform, store and analyse all your 
granular, event-level data, to you can perform any analysis
We wanted to take a fresh approach to web analytics 
• Your own web event data -> in your own data warehouse 
• Your own event data model 
• Slice / dice and mine the data in highly bespoke ways to answer your 
specific business questions 
• Plug in the broadest possible set of analysis tools to drive value from your 
data 
Data pipeline Data warehouse 
Analyse your data in 
any analysis tool
Early on, we made a crucial decision: Snowplow should be 
composed of a set of loosely coupled subsystems 
1. Trackers 2. Collectors A B 3. Enrich C 4. Storage D 5. Analytics 
Generate event 
data from any 
environment 
Launched with: 
• JavaScript 
tracker 
Log raw events 
from trackers 
Launched with: 
• CloudFront 
collector 
Validate and 
enrich raw 
events 
Launched with: 
• HiveQL + 
Java UDF-based 
enrichment 
D = Standardised data protocols 
Store enriched 
events ready 
for analysis 
Launched with: 
• Amazon S3 
Analyze 
enriched 
events 
Launched with: 
• HiveQL 
recipes 
These turned out to be critical to allowing us 
to evolve the above stack
Ad

Recommended

Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againAlexander Dean
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Alexander Dean
 
Big data meetup budapest adding data schemas to snowplow
Big data meetup budapest   adding data schemas to snowplowBig data meetup budapest   adding data schemas to snowplow
Big data meetup budapest adding data schemas to snowplowyalisassoon
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event datayalisassoon
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logAlexander Dean
 
How we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingHow we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingyalisassoon
 
A taste of Snowplow Analytics data
A taste of Snowplow Analytics dataA taste of Snowplow Analytics data
A taste of Snowplow Analytics dataRobert Kingston
 
Modelling event data in look ml
Modelling event data in look mlModelling event data in look ml
Modelling event data in look mlyalisassoon
 

More Related Content

What's hot

2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modelingyalisassoon
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and CascalogRobert Boland
 
How to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowHow to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowGiuseppe Gaviani
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we doyalisassoon
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016yalisassoon
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan
 
Snowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSSnowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSGiuseppe Gaviani
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessyalisassoon
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comyalisassoon
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016yalisassoon
 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...yalisassoon
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time AnalyticsAnil Madan
 
Data to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream MeetupData to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream MeetupJerome Boulon
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Amazon Web Services
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow Analytics
 

What's hot (20)

2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and Cascalog
 
How to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowHow to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using Snowplow
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we do
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
 
Snowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSSnowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWS
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.com
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016
 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
Data to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream MeetupData to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream Meetup
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
 

Similar to Big Data Beers - Introducing Snowplow

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsDamien Dallimore
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_DataNAVER D2
 
A Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationA Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationBen Stopford
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - AnalyticsDassana Wijesekara
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateSrinath Perera
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake EditionFrom BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake EditionRittman Analytics
 

Similar to Big Data Beers - Introducing Snowplow (20)

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
A Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationA Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices Generation
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake EditionFrom BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
 

More from Alexander Dean

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAlexander Dean
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Alexander Dean
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAlexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 

More from Alexander Dean (7)

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified log
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 

Recently uploaded

DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
maximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsmaximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsssuser82c38d
 
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTSi-engage
 
Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Asher Sterkin
 
Getting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxGetting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxmavinoikein
 
The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!ISPMAIndia
 
Embracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio ManagementEmbracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio ManagementOnePlan Solutions
 
AI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriAI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriISPMAIndia
 
P1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetP1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetMatthewTHawley
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfssuser82c38d
 
SPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product ManagementSPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product ManagementISPMAIndia
 
OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20Shane Coughlan
 
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ..."Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...ISPMAIndia
 
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)GDSCNiT
 
Sql server types of joins with example.pptx
Sql server types of joins with example.pptxSql server types of joins with example.pptx
Sql server types of joins with example.pptxsameer gaikwad
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAutokey
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfssuser82c38d
 
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...emili denli
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!Anthony Dahanne
 

Recently uploaded (20)

DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
maximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsmaximum subarray ppt for killing camp students
maximum subarray ppt for killing camp students
 
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
 
Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024
 
Getting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxGetting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptx
 
The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!
 
Embracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio ManagementEmbracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio Management
 
AI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriAI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit Bendigiri
 
P1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetP1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 Smartsheet
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdf
 
SPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product ManagementSPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product Management
 
OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20
 
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ..."Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
 
eLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdfeLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdf
 
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
 
Sql server types of joins with example.pptx
Sql server types of joins with example.pptxSql server types of joins with example.pptx
Sql server types of joins with example.pptx
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdf
 
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!
 

Big Data Beers - Introducing Snowplow

  • 1. Introducing Snowplow Big Data Beers, Berlin Huge thanks to Zalando for hosting!
  • 2. Snowplow is an open-source web and event analytics platform, first version released in early 2012 • Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 • After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy • We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow • We started working full time on Snowplow in summer 2013
  • 3. At Keplar, we grew frustrated by significant limitations in traditional web analytics programs Data collection Data processing Data access • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transactions • Limited set of ways of describing events (custom dim 1, custom dim 2…) • Data is processed ‘once’ • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • As a result, data is siloed: hard to join with other data sets
  • 4. And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner CloudFront Amazon S3 Amazon EMR Amazon Redshift These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
  • 5. We wanted to take a fresh approach to web analytics • Your own web event data -> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business questions • Plug in the broadest possible set of analysis tools to drive value from your data Data pipeline Data warehouse Analyse your data in any analysis tool
  • 6. Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers 2. Collectors A B 3. Enrich C 4. Storage D 5. Analytics Generate event data from any environment Launched with: • JavaScript tracker Log raw events from trackers Launched with: • CloudFront collector Validate and enrich raw events Launched with: • HiveQL + Java UDF-based enrichment D = Standardised data protocols Store enriched events ready for analysis Launched with: • Amazon S3 Analyze enriched events Launched with: • HiveQL recipes These turned out to be critical to allowing us to evolve the above stack
  • 7. Our initial skunkworks version of Snowplow – it was basic but it worked, and we started getting traction Website / webapp Snowplow data pipeline v1 (spring 2012) CloudFront-based pixel collector HiveQL + Java UDF “ETL” Amazon S3 JavaScript event tracker
  • 8. What did people start using it for? Warehousing their web event data To enable… Agile aka ad hoc analytics Marketing attribution modelling Customer lifetime value calculations Customer churn detection RTB fraud Product recommendations
  • 9. Current Snowplow design and architecture
  • 10. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront-based event collector Scalding-based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon S3 Amazon Redshift / PostgreSQL or Clojure-based event collector
  • 11. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront-based event collector Scalding-based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon S3 Amazon Redshift / PostgreSQL or Clojure-based event collector • Allow Snowplow users to set a third-party cookie with a user ID • Important for ad networks, widget companies, multi-domain retailers • Because Snowplow users wanted a much faster query loop than HiveQL/MapReduc e • We wanted a robust, feature-rich framework for managing validations, enrichments etc
  • 12. So far we have open-sourced a number of different trackers – with more planned Production ready: • JavaScript • No-JavaScript (image beacon) • Python • Lua • Arduino Beta: • Ruby • iOS • Android • Node.js In development: • .NET • PHP
  • 13. Enrichment process: what is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Scalding Cascalog PyCascading cascading. jruby Cascading Hive Pig Java Hadoop MapReduce Hadoop DFS
  • 14. Our “enrichment process” (formerly known as ETL) actually does two things: validation and enrichment • Our validation model looks like this: Raw events “Bad” raw events + reasons why they are bad Enrichment Manager “Good” enriched events • Under the covers, we use a lot of monadic Scala (Scalaz) code
  • 15. Adding the enrichments that web analysts expect = very important to Snowplow uptake • Web analysts are used to a very specific set of enrichments from Google Analytics, Site Catalyst etc • These enrichments have evolved over the past 15-20 years and are very domain specific: • Page querystring -> marketing campaign information (utm_ fields) • Referer data -> search engine name, country, keywords • IP address -> geographical location • Useragent -> browser, OS, computer information
  • 17. There are three big aspects to Snowplow’s 2014 roadmap 1. Make Snowplow work for non-web (e.g. mobile, IoT) environments as well as the web – RELEASED 2. Make Snowplow work with users’ JSON events as well as with our pre-defined events (aka page views, ecommerce transactions etc) – RELEASED 3. Move Snowplow away from an S3-based data pipeline to a unified log (Kinesis/Kafka)-based data pipeline – ONGOING 
  • 18. Snowplow is developing into an event analytics platform (not just a web analytics platform) Data warehouse Collect event data from any connected device
  • 19. Web analysts work with a small number of event types – outside of web, the number of possible event types is… infinite Web events • Page view • Page activity • Order • Add to basket All events • Game saved • Car started • Machine broke • Spellcheck run • Fridge empty • Screenshot taken • App crashed • SMS sent • Disk full • Screen viewed • Player died • Tweet drafted • Till opened • Product returned ∞ • Taxi arrived • Cluster started • Phonecall ended
  • 20. As we get further away from the web, we needed to start supporting user’s own JSON events • Specifically, events represented as JSONs with arbitrary name: value pairs (arbitrary to Snowplow, not to the company using Snowplow!)
  • 21. Supporting a fixed set of web events and JSON events is a difficult problem • Almost everybody in event analytics falls on one or other side of this divide: Fixed set of web events (page views etc) + custom variables Send anything JSONs
  • 22. We wanted to bridge that divide, making it so that Snowplow comes with structured events “out of the box”, but is extensible with unstructured events Fixed set of web events (page views etc) + custom variables Send anything JSONs
  • 23. Issues with the event name: • Separate from the event properties • Not versioned • Not unique – HBO video played versus Brightcove video played Lots of unanswered questions about the properties: • Is length required, and is it always a number? • Is id required, and is it always a string? • What other optional properties are allowed for a video play? Other issues: • What if the developer accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields • Why does the analyst need to keep an implicit schema in their head to analyze video played events?
  • 24. MixPanel et al cause “schema loss”
  • 25. We decided to use JSON Schema, with additional metadata about what the schema represents
  • 26. From a tracker, you send in a JSON which is self-describing, with a schema header and data body
  • 27. iglu:com.channel2.vod/video_played/jsonschema/1-0-0 Schema format Event name The vendor of this event We are calling our schema repository technology Iglu Schema version Anatomy of an Iglu schema URI
  • 28. To add this to Snowplow, we developed a new schema repository called Iglu, and a shredding step in Hadoop
  • 29. JSON Schema just gives us a data structure for events – we are also evolving a grammar to capture the semantics of events Subject Direct Object Indirect Object Verb Event Context Prep. ~ Object
  • 30. In parallel, we plan to evolve Snowplow from an event analytics platform into a “digital nervous system” for data driven companies • The event data fed into Snowplow is written into a “Unified Log” • This becomes the “single source of truth”, upstream from the datawarehouse • The same source of truth is used for real-time data processing as analytics e.g. • Product recommendations • Ad targeting • Real-time website personalisation • Systems monitoring Snowplow will drive data-driven processes as well as off-line analytics
  • 31. Some background on unified log based architectures CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs
  • 32. We are part way through our Kinesis support, with additional components being released soon Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers • The parts in grey are still under development – we are working with Snowplow community members on these collaboratively • We are also starting work on support for Apache Kafka alongside Kinesis – for users who don’t want to run Snowplow on AWS
  • 33. Questions? ulogprugcf (43% off Unified Log Processing eBook) http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata I am in Berlin tomorrow – to meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com