SlideShare a Scribd company logo
1 of 24
Understanding event data
Data Insights, Cambridge April 2015
In the beginning…
3 years ago, we were pretty frustrated…
Product / platform development decisions
But…
• Hard to identify patterns in user behavior
• Hard to identify good / bad engagements
• We were using tools (GA / Adobe) to answer questions that the tools were not
designed to support
Cloud services + open source big data technology -> we can
collect and warehouse the event-level data in a linearly scalable
way…
…and perform any analysis we want on it
Snowplow v1 was born…
• Every event on your website, represented as a line of data in your own data
warehouse (on EMR)
• Data queried mostly using Apache Hive, but opportunity to use any Hadoop-based
data processing framework
Lots of flexibility to perform more involved / advanced web analytics
Is web analytics data a subset of a broader category of digital
event data?
• Stream of events describe “what has happened” over time
• High volume of data: one line of data per event, potentially 1000s of events per
second
• In all cases, having the complete, event-level dataset available for analysis provides
the possibility of using the data to answer a very broad set of questions
Could we extend the pipeline we built for web data to encompass digital
event data more generally?
Yes! Just build more trackers…
…BUT – what about the structure of the data? Doesn’t this vary
by event type? And aren’t we now looking at many more event
types?
Web events
All events
• Page view • Order • Add to basket• Page activity
• Game saved • Machine broke• Car started
• Spellcheck run • Screenshot taken• Fridge empty
• App crashed • Disk full• SMS sent
• Screen viewed • Tweet drafted• Player died
• Taxi arrived • Phonecall ended• Cluster started
• Till opened • Product returned
There are two historic approaches to dealing with the explosion
of possible event types
Web analytics vendors Mobile and app analytics vendors
Custom Variables Schema-less JSONs
Custom variables are very restrictive
1. Take a standard web event, like a page view:
2. and add custom variables until it becomes something totally different:
= a “taxi arrived” event, kind of!
Page View
Page View vehicle=taxi23 status=arrived+ +
Schema-less JSONs are better, but they have a different set of
problems
Issues with the event name:
• Separate from the event properties
• Not versioned
• Not unique – HBO video played
versus Brightcove video played
Lots of unanswered questions about the
properties:
• Is length required, and is it always a
number?
• Is id required, and is it always a string?
• What other optional properties are
allowed for a video play?
Other issues:
• What if the developer
accidentally starts sending
“len” instead of “length”? The
data will end up split across
two separate fields
• Why does the analyst need to
keep an implicit schema in
their head to analyze video
played events?
Our approach: schema our
JSONs
When a developer or analyst defines a new event in JSON, let’s
ask them to create a JSON Schema for that event
Additional optional field we might
not know about otherwise
No other fields
allowed
Yes length should always be a
number
But we need to let our event definitions evolve, so let’s
add versioning – we’re calling this SchemaVer
MODEL-REVISION-ADDITION
• Start versioning at 1-0-0 – so 1-0-0, 1-0-1, 1-0-2, 1-1-0 etc
• Try to stick to backwards-compatible ADDITION upgrades as much
as possible
We make the event JSONs self-describing, with a schema header
and data body
The schema field determines
where the schema can be found in
Iglu, our schema repository
Schemas are namespaced…
For each event being processed, we can ‘look up’ the schema
from the repo and use it to drive event validation and loading of
event data into structured data stores
Schema repo{}
Being able to load data into multiple different stores is very
valuable
• Different data stores support different types of analyses
• SQL DBS -> pivoting / OLAP
• Elasticsearch -> search, simple OLAP
• Graph databases -> pathing analysis
• Many of these data stores (all except elasticsearch) are ‘structured’
• Having the data pass through the pipeline in a schemaed format means we do not
need to manually structure it ourselves, which is expensive and error-prone
We are working on a second, real-time version of the Snowplow
data pipeline
Batch:
Real-time:
Requests logged to S3
Requests logged to Amazon
Kinesis and Kafka
Event data processed in
Scalding / EMR and SQL
Event data processing using
Kinesis Client Library / Samza
Kinesis and Kafka enable us to publish
and consume events from a
distributed stream in a real-time,
robust and scalable way
With the real-time pipeline, event data can be fed into data-
driven applications alongside the datawarehouse
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
Data
warehouse
This is exciting for data scientists
• Build predictive models based on the complete history of
events in the datawarehouse e.g. forecast revenue for
acquired users over lifetime…
• Put those models live on the same source of truth, as the
data comes in in real-time
• Use this approach for all types of applications: fraud detection,
real-time personalization, product recommendation
Data
warehouse
RT data-driven
application
However, this only makes figuring out how to model and
describe events more important
• Lots of applications (not just offline reporting / data warehouse) fed off the
event stream
• All those applications are decoupled: downstream applications have no control
over the structure of data inputted generated upstream of them
• So the better able we are to specify (and constrain) the data structures
upstream, the easier it’ll be to write downstream applications to consume the
data
Can we specify a standard framework / structure for events?
What about a semantic model?
We can extend our self-describing JSON model to encapsulate
this semantic model…
This is something we need to put
more thought / research into
The next few months are pretty exciting
• Building out the real-time pipeline
• Encouraging an ecosystem of developers / partners to build apps to
run on the real-time stream
• Develop the semantic model / event grammar
Any
questions?

More Related Content

What's hot

Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016yalisassoon
 
How to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowHow to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowGiuseppe Gaviani
 
Snowplow the evolving data pipeline
Snowplow   the evolving data pipelineSnowplow   the evolving data pipeline
Snowplow the evolving data pipelineyalisassoon
 
Snowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSSnowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSGiuseppe Gaviani
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessGiuseppe Gaviani
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessyalisassoon
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and CascalogRobert Boland
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we doyalisassoon
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow Analytics
 
Snowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcaseSnowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcaseyalisassoon
 
Simply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution AnalysisSimply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution AnalysisStewart Duncan
 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...yalisassoon
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comyalisassoon
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...yalisassoon
 
Simply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event ProcessingSimply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event Processingidan_by
 
Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...yalisassoon
 
Viewbix tracking journey
Viewbix tracking journeyViewbix tracking journey
Viewbix tracking journeyidan_by
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017SingleStore
 
Introducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowIntroducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowGiuseppe Gaviani
 

What's hot (20)

Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
 
How to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowHow to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using Snowplow
 
Snowplow the evolving data pipeline
Snowplow   the evolving data pipelineSnowplow   the evolving data pipeline
Snowplow the evolving data pipeline
 
Snowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSSnowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWS
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your business
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and Cascalog
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we do
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
 
Snowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcaseSnowplow at DA Hub emerging technology showcase
Snowplow at DA Hub emerging technology showcase
 
Simply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution AnalysisSimply Business and Snowplow - Multichannel Attribution Analysis
Simply Business and Snowplow - Multichannel Attribution Analysis
 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.com
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
 
Simply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event ProcessingSimply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event Processing
 
Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...Snowplow: putting digital analysts at the heart of digital analytics - the fo...
Snowplow: putting digital analysts at the heart of digital analytics - the fo...
 
Viewbix tracking journey
Viewbix tracking journeyViewbix tracking journey
Viewbix tracking journey
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017
 
Introducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowIntroducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from Snowplow
 

Viewers also liked

Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015yalisassoon
 
Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeyalisassoon
 
Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2yalisassoon
 
Modeling event data
Modeling event dataModeling event data
Modeling event datayalisassoon
 
A KPI framework for startups
A KPI framework for startupsA KPI framework for startups
A KPI framework for startupsyalisassoon
 
China high speed railway lines network-201407
China high speed railway lines network-201407China high speed railway lines network-201407
China high speed railway lines network-201407tjrgx
 
Marriott management philosophy
Marriott management philosophyMarriott management philosophy
Marriott management philosophyFawad Akhtar
 

Viewers also liked (11)

Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
 
Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMade
 
Basic unix commands
Basic unix commandsBasic unix commands
Basic unix commands
 
Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2Yali presentation for snowplow amsterdam meetup number 2
Yali presentation for snowplow amsterdam meetup number 2
 
Modeling event data
Modeling event dataModeling event data
Modeling event data
 
A KPI framework for startups
A KPI framework for startupsA KPI framework for startups
A KPI framework for startups
 
China high speed railway lines network-201407
China high speed railway lines network-201407China high speed railway lines network-201407
China high speed railway lines network-201407
 
HAKQ Profile
HAKQ ProfileHAKQ Profile
HAKQ Profile
 
Solinea Lazuli Tower Project Brief
Solinea Lazuli Tower Project BriefSolinea Lazuli Tower Project Brief
Solinea Lazuli Tower Project Brief
 
Effectivnoe upravlenie personalom
Effectivnoe upravlenie personalomEffectivnoe upravlenie personalom
Effectivnoe upravlenie personalom
 
Marriott management philosophy
Marriott management philosophyMarriott management philosophy
Marriott management philosophy
 

Similar to Understanding event data

[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_DataNAVER D2
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
 
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...confluent
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2
 
How APIs are Changing Software Development
How APIs are Changing Software DevelopmentHow APIs are Changing Software Development
How APIs are Changing Software Development3scale
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesAmazon Web Services
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldRandy Shoup
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?TechWell
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceChristian Beedgen
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Amazon Web Services
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 

Similar to Understanding event data (20)

[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
 
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
 
Security data deluge
Security data delugeSecurity data deluge
Security data deluge
 
How APIs are Changing Software Development
How APIs are Changing Software DevelopmentHow APIs are Changing Software Development
How APIs are Changing Software Development
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics Service
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 

Recently uploaded

Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Delhi Call girls
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsMichael W. Hawkins
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxpriyanshujha201
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxWorkforce Group
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdftbatkhuu1
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...Aggregage
 
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...lizamodels9
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Lviv Startup Club
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataExhibitors Data
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...lizamodels9
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxAndy Lambert
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876dlhescort
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael Hawkins
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptx
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdf
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
Call Girls In Holiday Inn Express Gurugram➥99902@11544 ( Best price)100% Genu...
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
 

Understanding event data

  • 1. Understanding event data Data Insights, Cambridge April 2015
  • 3. 3 years ago, we were pretty frustrated… Product / platform development decisions But… • Hard to identify patterns in user behavior • Hard to identify good / bad engagements • We were using tools (GA / Adobe) to answer questions that the tools were not designed to support
  • 4. Cloud services + open source big data technology -> we can collect and warehouse the event-level data in a linearly scalable way… …and perform any analysis we want on it
  • 5. Snowplow v1 was born… • Every event on your website, represented as a line of data in your own data warehouse (on EMR) • Data queried mostly using Apache Hive, but opportunity to use any Hadoop-based data processing framework Lots of flexibility to perform more involved / advanced web analytics
  • 6. Is web analytics data a subset of a broader category of digital event data? • Stream of events describe “what has happened” over time • High volume of data: one line of data per event, potentially 1000s of events per second • In all cases, having the complete, event-level dataset available for analysis provides the possibility of using the data to answer a very broad set of questions Could we extend the pipeline we built for web data to encompass digital event data more generally?
  • 7. Yes! Just build more trackers…
  • 8. …BUT – what about the structure of the data? Doesn’t this vary by event type? And aren’t we now looking at many more event types? Web events All events • Page view • Order • Add to basket• Page activity • Game saved • Machine broke• Car started • Spellcheck run • Screenshot taken• Fridge empty • App crashed • Disk full• SMS sent • Screen viewed • Tweet drafted• Player died • Taxi arrived • Phonecall ended• Cluster started • Till opened • Product returned
  • 9. There are two historic approaches to dealing with the explosion of possible event types Web analytics vendors Mobile and app analytics vendors Custom Variables Schema-less JSONs
  • 10. Custom variables are very restrictive 1. Take a standard web event, like a page view: 2. and add custom variables until it becomes something totally different: = a “taxi arrived” event, kind of! Page View Page View vehicle=taxi23 status=arrived+ +
  • 11. Schema-less JSONs are better, but they have a different set of problems Issues with the event name: • Separate from the event properties • Not versioned • Not unique – HBO video played versus Brightcove video played Lots of unanswered questions about the properties: • Is length required, and is it always a number? • Is id required, and is it always a string? • What other optional properties are allowed for a video play? Other issues: • What if the developer accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields • Why does the analyst need to keep an implicit schema in their head to analyze video played events?
  • 12. Our approach: schema our JSONs
  • 13. When a developer or analyst defines a new event in JSON, let’s ask them to create a JSON Schema for that event Additional optional field we might not know about otherwise No other fields allowed Yes length should always be a number
  • 14. But we need to let our event definitions evolve, so let’s add versioning – we’re calling this SchemaVer MODEL-REVISION-ADDITION • Start versioning at 1-0-0 – so 1-0-0, 1-0-1, 1-0-2, 1-1-0 etc • Try to stick to backwards-compatible ADDITION upgrades as much as possible
  • 15. We make the event JSONs self-describing, with a schema header and data body The schema field determines where the schema can be found in Iglu, our schema repository Schemas are namespaced…
  • 16. For each event being processed, we can ‘look up’ the schema from the repo and use it to drive event validation and loading of event data into structured data stores Schema repo{}
  • 17. Being able to load data into multiple different stores is very valuable • Different data stores support different types of analyses • SQL DBS -> pivoting / OLAP • Elasticsearch -> search, simple OLAP • Graph databases -> pathing analysis • Many of these data stores (all except elasticsearch) are ‘structured’ • Having the data pass through the pipeline in a schemaed format means we do not need to manually structure it ourselves, which is expensive and error-prone
  • 18. We are working on a second, real-time version of the Snowplow data pipeline Batch: Real-time: Requests logged to S3 Requests logged to Amazon Kinesis and Kafka Event data processed in Scalding / EMR and SQL Event data processing using Kinesis Client Library / Samza Kinesis and Kafka enable us to publish and consume events from a distributed stream in a real-time, robust and scalable way
  • 19. With the real-time pipeline, event data can be fed into data- driven applications alongside the datawarehouse CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log LOW LATENCY WIDE DATA COVERAGE < WIDE DATA COVERAGE > < FULL DATA HISTORY > FEW DAYS’ DATA HISTORY Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs Data warehouse
  • 20. This is exciting for data scientists • Build predictive models based on the complete history of events in the datawarehouse e.g. forecast revenue for acquired users over lifetime… • Put those models live on the same source of truth, as the data comes in in real-time • Use this approach for all types of applications: fraud detection, real-time personalization, product recommendation Data warehouse RT data-driven application
  • 21. However, this only makes figuring out how to model and describe events more important • Lots of applications (not just offline reporting / data warehouse) fed off the event stream • All those applications are decoupled: downstream applications have no control over the structure of data inputted generated upstream of them • So the better able we are to specify (and constrain) the data structures upstream, the easier it’ll be to write downstream applications to consume the data
  • 22. Can we specify a standard framework / structure for events? What about a semantic model?
  • 23. We can extend our self-describing JSON model to encapsulate this semantic model… This is something we need to put more thought / research into
  • 24. The next few months are pretty exciting • Building out the real-time pipeline • Encouraging an ecosystem of developers / partners to build apps to run on the real-time stream • Develop the semantic model / event grammar Any questions?