Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS User Group UK: Why your company needs a unified log

556 views

Published on

Apache Kafka and Amazon Kinesis are more than just message queues — they can serve as a unified log which you can put at the heart of your business, effectively creating a "digital nervous system" which your company's applications and processes can be re-structured around.

In this talk, Alex will provide an introduction to unified log technology, highlight some killer use cases and also show how Kinesis is being used "in anger" at Snowplow. Alex's talk will draw on his experiences working with event streams over the last two and a half years at Snowplow; it’s also heavily influenced by Jay Kreps’ unified log monograph, and by Alex's recent work penning Unified Log Processing, a Manning book. Alex's talk will show how event streams inside a unified log are an incredibly powerful primitive for building rich event-centric applications, unbundling local transactional silos and creating a single version of truth for a company.

Published in: Software
  • Be the first to comment

  • Be the first to like this

AWS User Group UK: Why your company needs a unified log

  1. 1. Why your company needs a Unified Log AWS User Group UK, 28th January 2015
  2. 2. Introducing myself • Alex Dean • Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1] • Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2] [1] https://github.com/snowplow/snowplow [2] http://manning.com/dean
  3. 3. So what is a Unified Log?
  4. 4. A quick history lesson: the three eras of business data processing [1] 1. The classic era, 1996+ 2. The hybrid era, 2005+ 3. The unified era, 2013+ [1] http://snowplowanalytics.com/blog/ 2014/01/20/the-three-eras-of-business-data-processing/
  5. 5. The classic era of business data processing, 1996+ OWN DATA CENTER Data warehouse HIGH LATENCY Point-to-point connections WIDE DATA COVERAGE CMS Silo CRM Local loop Local loop NARROW DATA SILOES LOW LATENCY LOCAL LOOPS E-comm Silo Local loop Management reporting ERP Silo Local loop Silo Nightly batch ETL process FULL DATA HISTORY
  6. 6. The hybrid era, 2005+ CLOUD VENDOR / OWN DATA CENTER Search Silo Local loop LOW LATENCY LOCAL LOOPS E-comm Silo Local loop CRM Local loop SAAS VENDOR #2 Email marketing Local loop ERP Silo Local loop CMS Silo Local loop SAAS VENDOR #1 NARROW DATA SILOES Stream processing Product rec’s Micro-batch processing Systems monitoring Batch processing Data warehouse Management reporting Batch processing Ad hoc analytics Hadoop SAAS VENDOR #3 Web analytics Local loop Local loop Local loop LOW LATENCY LOW LATENCY HIGH LATENCY HIGH LATENCY APIs Bulk exports
  7. 7. The hybrid era: a surfeit of software vendors CLOUD VENDOR / OWN DATA CENTER Search Silo Local loop LOW LATENCY LOCAL LOOPS E-comm Silo Local loop CRM Local loop SAAS VENDOR #2 Email marketing Local loop ERP Silo Local loop CMS Silo Local loop SAAS VENDOR #1 NARROW DATA SILOES Stream processing Product rec’s Micro-batch processing Systems monitoring Batch processing Data warehouse Management reporting Batch processing Ad hoc analytics Hadoop SAAS VENDOR #3 Web analytics Local loop Local loop Local loop LOW LATENCY LOW LATENCY HIGH LATENCY HIGH LATENCY APIs Bulk exports
  8. 8. The hybrid era: company-wide reporting and analytics ends up like Rashomon The bandit’s story vs. The wife’s story vs. The samurai’s story vs. The woodcutter’s story
  9. 9. The hybrid era: the number of data integrations is unsustainable
  10. 10. So how do we unravel the hairball?
  11. 11. The unified era, 2013+ CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log LOW LATENCY WIDE DATA COVERAGE Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > FEW DAYS’ DATA HISTORY Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs
  12. 12. CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs The unified log is Amazon Kinesis, or Apache Kafka • Amazon Kinesis, a hosted AWS service • Extremely similar semantics to Kafka • Apache Kafka, an append- only, distributed, ordered commit log • Developed at LinkedIn to serve as their organization’s unified log
  13. 13. “Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization” [1] [1] http://kafka.apache.org/
  14. 14. So what does a unified log give us? A single version of the truth Our truth is now upstream from the data warehouse The hairball of point-to-point connections has been unravelled Local loops have been unbundled 1 2 3 4
  15. 15. What does a unified log let us do that we couldn’t do before? Populating a unified log with your company’s event streams Real-time management reporting To enable… Holistic systems monitoring Re-running models from Day 0 A/B testing end-to-end pipelines Shipping offline models to RT … anything requiring low latency response / holistic view of our company’s data!
  16. 16. How are we embracing the unified log at Snowplow?
  17. 17. Some background: early on, we decided that Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D D = Standardised data protocols Generate event data from any environment Log raw events from trackers Validate and enrich raw events Store enriched events ready for analysis Analyze enriched events These turned out to be critical to allowing us to evolve the above stack
  18. 18. Today most users are running a batch-based Snowplow configuration Hadoop- based enrichment Snowplow event tracking SDK Amazon Redshift Amazon S3 HTTP-based event collector • Batch-based • Normally run overnight; sometimes every 4-6 hoursThe Snowplow batch-based flow uses Amazon S3 as a “poor man’s” unified log
  19. 19. CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs Can we implement Snowplow on top of Kinesis/Kafka?
  20. 20. We are working on Amazon Kinesis support first; Apache Kafka + Samza will come later this year scala- stream- collector scala- kinesis- enrich S3 Amazon Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers = not yet released kinesis- elasticsearch- sink DynamoDB Elastic- search Event aggregator Kinesis app Analytics on Read for agile exploration of events, machine learning, auditing, re-processing… Raw event stream Bad raw event stream Enriched event stream Google BigQuery kinesis- bigquery- sink Analytics on Write (for dashboarding, audience segmentation, RTB, etc)
  21. 21. Snowplow users can already write stream processing applications which leverage the Snowplow enriched event stream scala- stream- collector scala- kinesis- enrich AWS Lambda Apache Storm Snowplow Trackers Apache Samza Raw event stream Bad raw event stream Enriched event stream Apache Spark Streaming Kinesis Client Library
  22. 22. Questions? http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com Discount code: ulogprugcf (43% off Unified Log Processing eBook)

×