Your SlideShare is downloading. ×
Continuous Data Processing with Kinesis at Snowplow
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Continuous Data Processing with Kinesis at Snowplow

1,179

Published on

Since its inception, the Snowplow open source event analytics platform (https://github.com/snowplow/snowplow) has always been tightly coupled to the batched-based Hadoop ecosystem, and Elastic …

Since its inception, the Snowplow open source event analytics platform (https://github.com/snowplow/snowplow) has always been tightly coupled to the batched-based Hadoop ecosystem, and Elastic MapReduce in particular.

With the release of Amazon Kinesis in late 2013, we set ourselves the challenge of porting Snowplow to Kinesis, to give our users access to their Snowplow event stream in near-real-time.

With this porting process nearing completion, Alex Dean, Snowplow Analytics co-founder and technical lead, will share Snowplow’s experiences in adopting stream processing as a complementary architecture to Hadoop and batch-based processing.

In particular, Alex will explore:

- “Hero” use cases for event streaming which drove our adoption of Kinesis
- Why we waited for Kinesis, and thoughts on how Kinesis fits into the wider streaming ecosystem
- How Snowplow achieved a lambda architecture with minimal code duplication, allowing Snowplow users to choose which (or both) platforms to use
- Key considerations when moving from a batch mindset to a streaming mindset – including aggregate windows, recomputation, backpressure

Published in: Software, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,179
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Continuous data processing with Kinesis at Snowplow Budapest DW Forum 2014
  • 2. Agenda today 1. Introduction to Snowplow 2. Our batch data flow & use cases 3. Why are we excited about Kinesis? 4. Adding Kinesis support to Snowplow 5. Questions
  • 3. Introduction to Snowplow
  • 4. Snowplow is an open-source web and event analytics platform, first version released in early 2012 • Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 • After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy • We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow • We started working full time on Snowplow in summer 2013
  • 5. We wanted to take a fresh approach to web analytics • Your own web event data -> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business questions • Plug in the broadest possible set of analysis tools to drive value from your data Data warehouseData pipeline Analyse your data in any analysis tool
  • 6. And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis Amazon EMRAmazon S3CloudFront Amazon Redshift
  • 7. Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D D = Standardised data protocols Generate event data from any environment Log raw events from trackers Validate and enrich raw events Store enriched events ready for analysis Analyze enriched events These turned out to be critical to allowing us to evolve our technology stack
  • 8. Our batch data flow & use cases
  • 9. By spring 2013 we had arrived at a relatively stable batch-based processing architecture Website / webapp Snowplow Hadoop data pipeline CloudFront- based event collector Scalding- based enrichment on Hadoop JavaScript event tracker Amazon Redshift / PostgreSQL Amazon S3 or Clojure- based event collector
  • 10. What did people start using Snowplow for? Warehousing their web event data Agile aka ad hoc analytics To enable… Marketing attribution modelling Customer lifetime value calculations Customer churn prediction RTB fraud detection Email product recs
  • 11. These use cases tended to be characterized by a few important traits Trait Example Agile aka ad hoc analytics Marketing attribution modelling 1. They use data collected over long time periods 2. They demand ongoing & hands-on involvement from a BA/ data scientist 3. They tend not to elicit synchronous/deterministic responses RTB fraud detection
  • 12. So why did we get excited about Kinesis?
  • 13. A quick history lesson: the three eras of business data processing 1. The classic era, 1996+ 2. The hybrid era, 2005+ 3. The unified era, 2013+ For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/
  • 14. The classic era, 1996+ OWN DATA CENTER Data warehouse HIGH LATENCY Point-to-point connections WIDE DATA COVERAGE CMS Silo CRM Local loop Local loop NARROW DATA SILOES LOW LATENCY LOCAL LOOPS E-comm Silo Local loop Management reporting ERP Silo Local loop Silo Nightly batch ETL process FULL DATA HISTORY
  • 15. The hybrid era, 2005+ CLOUD VENDOR / OWN DATA CENTER Search Silo Local loop LOW LATENCY LOCAL LOOPS E-comm Silo Local loop CRM Local loop SAAS VENDOR #2 Email marketing Local loop ERP Silo Local loop CMS Silo Local loop SAAS VENDOR #1 NARROW DATA SILOES Stream processing Product rec’s Micro-batch processing Systems monitoring Batch processing Data warehouse Management reporting Batch processing Ad hoc analytics Hadoop SAAS VENDOR #3 Web analytics Local loop Local loop Local loop LOW LATENCY LOW LATENCY HIGH LATENCY HIGH LATENCY APIs Bulk exports
  • 16. The unified era, 2013+ CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log LOW LATENCY WIDE DATA COVERAGE Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > FEW DAYS’ DATA HISTORY Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs
  • 17. CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs The unified log is Kinesis (or Kafka)
  • 18. CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs We asked: can we implement Snowplow on top of Kinesis?
  • 19. What kinds of use cases can we support if we implement Snowplow on top of Kinesis? Populating a unified log with your company’s event streams In-session product recs To enable… Holistic systems monitoring In-game difficulty tuning In-session upselling Ad retargeting & RTB … anything requiring low latency response / holistic view of our data!
  • 20. Adding Kinesis support to Snowplow
  • 21. Where we are heading with our Kinesis architecture Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers
  • 22. This is where we are today Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers
  • 23. What have we and the Snowplow community learnt about Kinesis and continuous data processing so far? 1. One stream  many consuming apps is unexpected for many people (legacy of old MQs?) 2. Think of Kinesis apps as distributed Unix commands with streams mapping on to stdin, stderr, stdout 3. Build more complex systems by chaining simple Kinesis apps – the Kinesis stream is a really powerful primitive for continuous data flows 4. Scalability and elasticity are going to be much bigger challenges than in our batch flow
  • 24. Questions? http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To talk offline – @alexcrdean on Twitter or alex@snowplowanalytics.com

×