Snowplow: evolve your
analytics stack with your
business
Snowplow Meetup Tel Aviv July 2016
Hello! I’m Yali
• Co-founder at Snowplow: open source event data pipeline
• Analytics Lead. Focus on business analytics
I work with our clients so they get more
out of their data
• Marketing / customer analytics: how do we engage users better?
• Product analytics: how do we improve our user-facing products?
• Content / merchandise analytics:
• How to we write/produce/buy better content?
• How do we optimise the use of our existing content?
Self-describing data Event data modeling+
Event data pipeline that
evolves with your business
Self-describing data
Overview
Event data varies widely by
company
As a Snowplow user, you can
define your own events and entities
Events
Entities
(contexts)
• Build castle
• Form alliance
• Declare war
• Player
• Game
• Level
• Currency
• View product
• Buy product
• Deliver product
• Product
• Customer
• Basket
• Delivery van
You then define a schema
for each event and entity
{
"$schema":
"http://iglucentral.com/schemas/com.snowplowanalytics.self
-desc/schema/jsonschema/1-0-0#",
"description": "Schema for a fighter context",
"self": {
"vendor": "com.ufc",
"name": "fighter_context",
"format": "jsonschema",
"version": "1-0-1"
},
"type": "object",
"properties": {
"FirstName": {
"type": "string"
},
"LastName": {
"type": "string"
},
"Nickname": {
"type": "string"
},
"FacebookProfile": {
"type": "string"
},
"TwitterName": {
"type": "string"
},
"GooglePlusProfile": {
"type": "string"
},
"HeightFormat": {
"type": "string"
},
"HeightCm": {
"type": ["integer", "null"]
},
"Weight": {
"type": ["integer", "null"]
},
"WeightKg": {
"type": ["integer", "null"]
},
"Record": {
"type": "string",
"pattern": "^[0-9]+-[0-9]+-[0-9]+$"
},
"Striking": {
"type": ["number", "null"],
"maxdecimal": 15
},
"Takedowns": {
"type": ["number", "null"],
"maxdecimal": 15
},
"Submissions": {
"type": ["number", "null"],
"maxdecimal": 15
},
"LastFightUrl": {
"type": "string"
},
"LastFightEventText": {
"type": "string"
},
"NextFightUrl": {
"type": "string"
},
"NextFightEventText": {
"type": "string"
},
"LastFightDate": {
"type": "string",
"format": "timestamp"
}
},
"additionalProperties": false
}
Upload the
schema to
Iglu
Then send data into
Snowplow as self-
describing JSONs
{
“schema”:
“iglu:com.israel365/temperature_measure/jsonsch
ema/1-0-0”,
“data”: {
“timestamp”: “2016-07-11 17:53:21”,
“location”: “Tel-Aviv”,
“temperature”: 32
“units”: “Centigrade”
}
}
{
"$schema":
"http://iglucentral.com/schemas/com.snowplowanalytics.self-
desc/schema/jsonschema/1-0-0#",
"description": "Schema for an ad impression
event",
"self": {
"vendor": “com.israel365",
"name": “temperature_measure",
"format": "jsonschema",
"version": "1-0-0"
},
"type": "object",
"properties": {
"timestamp": {
"type": "string"
},
"location": {
"type": "string"
},
…
},
Event
Schema
reference
Schema
The schemas can then be
used in a number of ways
• Validate the data (important for data quality)
• Load the data into tidy tables in your data
warehouse
• Make it easy / safe to write downstream data
processing application (for real-time users)
Event data modeling
Overview
What is event data modeling?
Event data modeling is the process of using business logic to aggregate over
event-level data to produce 'modeled' data that is simpler for querying.
Immutable. Unopiniated.
Hard to consume. Not
contentious
Mutable and opinionated.
Easy to consume. May be
contentious
Unmodeled data Modeled data
In general, event data modeling is
performed on the complete event stream
• Late arriving events can change the way you
understand earlier arriving events
• If we change our data models: this gives us the
flexibility to recompute historical data based on the
new model
The evolving event
data pipeline
How do we handle pipeline
evolution?
PUSH
FACTORS:
What is being
tracked will
change over
time
PULL
FACTORS:
What
questions are
being asked
of the data will
change over
time
Businesses are not static, so event pipelines should not be either
Push example:
new source of event data
• If data is self-describing it is easy to add an additional
sources
• Self-describing data is good for managing bad data
and pipeline evolution
I’m an email send event and I
have information about the
recipient (email address,
customer ID) and the email
(id, tags, variation)
Pull example:
new business question
Answering the question:
3 possibilities
1. Existing data model
supports answer
2. Need to update data
model
3. Need to update data
model and data
collection
• Possible to answer
question with existing
modeled data
• Data collected
already supports
answer
• Additional
computation required
in data modeling step
(additional logic)
• Need to extend event
tracking
• Need to update data
models to incorporate
additional data (and
potentially additional
logic)
Self-describing data and the ability to recompute data
models are essential to enable pipeline evolution
Self-describing data Recompute data models on entire data set
• Updating existing events and entities in a
backward compatible way e.g. add optional
new fields
• Update existing events and entities in a
backwards incompatible way e.g. change
field types, remove fields, add compulsory fields
• Add new event and entity types
• Add new columns to existing derived
tables e.g. add new audience segmentation
• Change the way existing derived tables
are generated e.g. change sessionization logic
• Create new derived tables

Snowplow the evolving data pipeline

  • 1.
    Snowplow: evolve your analyticsstack with your business Snowplow Meetup Tel Aviv July 2016
  • 2.
    Hello! I’m Yali •Co-founder at Snowplow: open source event data pipeline • Analytics Lead. Focus on business analytics
  • 3.
    I work withour clients so they get more out of their data • Marketing / customer analytics: how do we engage users better? • Product analytics: how do we improve our user-facing products? • Content / merchandise analytics: • How to we write/produce/buy better content? • How do we optimise the use of our existing content?
  • 4.
    Self-describing data Eventdata modeling+ Event data pipeline that evolves with your business
  • 5.
  • 6.
    Event data varieswidely by company
  • 7.
    As a Snowplowuser, you can define your own events and entities Events Entities (contexts) • Build castle • Form alliance • Declare war • Player • Game • Level • Currency • View product • Buy product • Deliver product • Product • Customer • Basket • Delivery van
  • 8.
    You then definea schema for each event and entity { "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self -desc/schema/jsonschema/1-0-0#", "description": "Schema for a fighter context", "self": { "vendor": "com.ufc", "name": "fighter_context", "format": "jsonschema", "version": "1-0-1" }, "type": "object", "properties": { "FirstName": { "type": "string" }, "LastName": { "type": "string" }, "Nickname": { "type": "string" }, "FacebookProfile": { "type": "string" }, "TwitterName": { "type": "string" }, "GooglePlusProfile": { "type": "string" }, "HeightFormat": { "type": "string" }, "HeightCm": { "type": ["integer", "null"] }, "Weight": { "type": ["integer", "null"] }, "WeightKg": { "type": ["integer", "null"] }, "Record": { "type": "string", "pattern": "^[0-9]+-[0-9]+-[0-9]+$" }, "Striking": { "type": ["number", "null"], "maxdecimal": 15 }, "Takedowns": { "type": ["number", "null"], "maxdecimal": 15 }, "Submissions": { "type": ["number", "null"], "maxdecimal": 15 }, "LastFightUrl": { "type": "string" }, "LastFightEventText": { "type": "string" }, "NextFightUrl": { "type": "string" }, "NextFightEventText": { "type": "string" }, "LastFightDate": { "type": "string", "format": "timestamp" } }, "additionalProperties": false } Upload the schema to Iglu
  • 9.
    Then send datainto Snowplow as self- describing JSONs { “schema”: “iglu:com.israel365/temperature_measure/jsonsch ema/1-0-0”, “data”: { “timestamp”: “2016-07-11 17:53:21”, “location”: “Tel-Aviv”, “temperature”: 32 “units”: “Centigrade” } } { "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self- desc/schema/jsonschema/1-0-0#", "description": "Schema for an ad impression event", "self": { "vendor": “com.israel365", "name": “temperature_measure", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "timestamp": { "type": "string" }, "location": { "type": "string" }, … }, Event Schema reference Schema
  • 10.
    The schemas canthen be used in a number of ways • Validate the data (important for data quality) • Load the data into tidy tables in your data warehouse • Make it easy / safe to write downstream data processing application (for real-time users)
  • 11.
  • 12.
    What is eventdata modeling? Event data modeling is the process of using business logic to aggregate over event-level data to produce 'modeled' data that is simpler for querying.
  • 13.
    Immutable. Unopiniated. Hard toconsume. Not contentious Mutable and opinionated. Easy to consume. May be contentious Unmodeled data Modeled data
  • 14.
    In general, eventdata modeling is performed on the complete event stream • Late arriving events can change the way you understand earlier arriving events • If we change our data models: this gives us the flexibility to recompute historical data based on the new model
  • 15.
  • 16.
    How do wehandle pipeline evolution? PUSH FACTORS: What is being tracked will change over time PULL FACTORS: What questions are being asked of the data will change over time Businesses are not static, so event pipelines should not be either
  • 17.
    Push example: new sourceof event data • If data is self-describing it is easy to add an additional sources • Self-describing data is good for managing bad data and pipeline evolution I’m an email send event and I have information about the recipient (email address, customer ID) and the email (id, tags, variation)
  • 18.
  • 19.
    Answering the question: 3possibilities 1. Existing data model supports answer 2. Need to update data model 3. Need to update data model and data collection • Possible to answer question with existing modeled data • Data collected already supports answer • Additional computation required in data modeling step (additional logic) • Need to extend event tracking • Need to update data models to incorporate additional data (and potentially additional logic)
  • 20.
    Self-describing data andthe ability to recompute data models are essential to enable pipeline evolution Self-describing data Recompute data models on entire data set • Updating existing events and entities in a backward compatible way e.g. add optional new fields • Update existing events and entities in a backwards incompatible way e.g. change field types, remove fields, add compulsory fields • Add new event and entity types • Add new columns to existing derived tables e.g. add new audience segmentation • Change the way existing derived tables are generated e.g. change sessionization logic • Create new derived tables