Event Driven Architecture


Published on

Jay Kreps' articles on the Stream Data Platform design pattern

Published in: Data & Analytics
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Event Driven Architecture

  1. 1. 1 © OCTO 2015© OCTO 2015 Event Driven Architecture bluckbluck
  2. 2. 2 © OCTO 2015 The first problem was how to transport data between systems The second part of this problem was the need to do richer analytical data processing with very low latency.
  3. 3. 3 © OCTO 2015 The pipeline for log data was scalable but lossy and could only deliver data with high latency. The pipeline between Oracle instances was fast, exact, and real-time, but not available to any other systems.
  4. 4. 4 © OCTO 2015 The pipeline of Oracle data for Hadoop was periodic CSV dumps— high throughput, but batch. The pipeline of data to our search system was low latency, but unscalable and tied directly to the database. The messaging systems were low latency but unreliable and unscalable.
  5. 5. 5 © OCTO 2015
  6. 6. 6 © OCTO 2015 We added data centers geographically distributed around the world we had to build out geographical replication for each of these data flows he data was always unreliable. Our reports were untrustworthy, derived indexes and stores were questionable, and everyone spent a lot of time battling data quality issues of all kinds At the same time we weren't just shipping data from place to place; we also wanted to do things with it Hadoop had given us a platform for batch processing, data archival, and ad hoc processing, and this had been enormously successful, but we lacked an analogous platform for low-latency processing.
  7. 7. 7 © OCTO 2015 Stream Data Plateform
  8. 8. 8 © OCTO 2015 Stream Data Plateform
  9. 9. 9 © OCTO 2015 Your database stores the current state of your data. But the current state is always caused by some actions that took place in the past. The actions are the events. Much of what people refer to when they talk about "big data" is really the act of capturing these events that previously weren't recorded anywhere and putting them to use for analysis, optimization, and decision making Event streams are an obvious fit for log data or things like "orders", "sales", "clicks" or "trades" that are obviously event-like. The Rise of Events and Event Streams
  10. 10. 10 © OCTO 2015 data in databases can also be thought of as an event stream. The process of creating a backup or standby copy of a database : to dump out the contents to take a "diff" of what has changed Change capture : If we take our diffs more and more frequently what we will be left with is a continuous sequence of single row changes. By publishing the database changes into the stream data platform you add this to the other set of event streams. You can use these streams to synchronize other systems like Hadoop cluster, a replica database, or a search index, or you can feed these changes into applications or stream processors to directly compute new things off the changes. Databases Are Event Streams
  11. 11. 11 © OCTO 2015 A stream data platform has two primary uses: Data Integration: The stream data platform captures streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse. Stream processing: It enables continuous, real-time processing and transformation of these streams and makes the results available system-wide. The stream data platform is a central hub for data streams. t also acts as a buffer between these systems—the publisher of data doesn't need to be concerned with the various systems that will eventually consume and load the data. This means consumers of data can come and go and are fully decoupled from the source. What Is a Stream Data Platform For?
  12. 12. 12 © OCTO 2015 Hadoop wants to be able to maintain a full copy of all the data in your organization and act as a "data lake" or "enterprise data hub". Directly integrating each data source with HDFS is a hugely time consuming proposition the end result only makes that data available to Hadoop. This type of data capture isn't suitable for real-time processing or syncing other real-time applications. This same pipeline can run in reverse: Hadoop and the data warehouse environment can publish out results that need to flow into appropriate systems for serving in customer-facing applications. What Is a Stream Data Platform For? Zoom Hadoop
  13. 13. 13 © OCTO 2015 The stream processing use case plays off the data integration use case. The results of the stream processing are just a new, derived stream. Stream processing acts as both a way to develop applications that need low-latency transformations but it is also directly part of the data integration usage as well: integrating systems often requires some munging of data streams in between. What Is a Stream Data Platform For? Zoom ETL
  14. 14. 14 © OCTO 2015 A stream data platform is similar to an enterprise messaging system—it receives messages and distributes them to interested subscribers. There are three important differences: Messaging systems are typically run in one-off deployments for different applications. The purpose of the stream data platform is very much as a central data hub. Messaging systems do a poor job of supporting integration with batch systems, such as a data warehouse or a Hadoop cluster, as they have limited data storage capacity. Messaging systems do not provide semantics that are easily compatible with rich stream processing. How Does a Stream Data Platform Relate To Existing Things
  15. 15. 15 © OCTO 2015 In other words a data stream data platform is a messaging system whose role has been rethought at a company-wide scale. How Does a Stream Data Platform Relate To Existing Things
  16. 16. 16 © OCTO 2015 A stream data platform is a true platform that any other system can choose to tap into and many applications can build around. by making data available in a uniform format in a single place with a common stream abstraction, many of the routine data clean-up tasks can be avoided entirely. Data Integration Tools
  17. 17. 17 © OCTO 2015 The advantage of a stream data platform is that transformation is fundamentally decoupled from the stream itself. This code can live in applications or stream processing tasks, allowing teams to iterate at their own pace without a central bottleneck for application development. Enterprise Service Buses
  18. 18. 18 © OCTO 2015 Databases have long had similar log mechanisms such as Golden Gate. However these mechanisms are limited to database changes only and are not a general purpose event capture platform. Change Capture Systems
  19. 19. 19 © OCTO 2015 A stream data platform doesn't replace your data warehouse; in fact, quite the opposite: it feeds it data. Data Warehouses and Hadoop
  20. 20. 20 © OCTO 2015 They attempt to add richer processing semantics to subscribers and can make implementing data transformation easier. Stream Processing Systems
  21. 21. 21 © OCTO 2015 everything from user activity to database changes to administrative actions like restarting a process are captured in real-time streams that are subscribed to and processed in real-time. What Does This Look Like In Practice?
  22. 22. 22 © OCTO 2015 part of the promise of this approach to data management is having a central repository with the full set of data streams your organization generates. This works best when data is all in the same place. simplifying system architecture. fewer integration points for data consumers, fewer things to operate, lower incremental cost for adding new applications, makes it easier to reason about data flow. But, there are several reasons to end up with multiple clusters To keep activity local to a datacenter For security reasons For SLA control. Rcommendations : Limit The Number of Clusters
  23. 23. 23 © OCTO 2015 Apache Kafka does not enforce any particular format If each individual or application chooses a representation of their own preference—say some use JSON, others XML, and others CSV—the result is that any system or process which uses multiple data streams has to munge and understand each of these. Local optimization—choosing your favorite format for data you produce—leads to huge global sub-optimization since now each system needs to write N adaptors, one for each format it wants to ingest. imagine how useless the Unix toolchain would be if each tool invented its own format: you would have to translate between formats every time you wanted to pipe one command to another. Rcommendations : Pick A Single Data Format
  24. 24. 24 © OCTO 2015 Connecting all systems directly would look something like this Whereas having this central stream data platform looks something like this Rcommendations : Pick A Single Data Format
  25. 25. 25 © OCTO 2015 We think Avro is the best choice for a number of reasons: 1. It has a direct mapping to and from JSON 2. It has a very compact format. The bulk of JSON, repeating every field name with every single record, is what makes JSON inefficient for high-volume usage. 3. It is very fast. 4. It has great bindings for a wide variety of programming languages so you can generate Java objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream. 5. It has a rich, extensible schema language defined in pure JSON 6. It has the best notion of compatibility for evolving your data over time. Rcommendations : Use Avro as Your Data Format
  26. 26. 26 © OCTO 2015 Isn't the modern world of big data all about unstructured data, dumped in whatever form is convenient, and parsed later when it is queried? One of the primary advantages of this type of architecture where data is modeled as streams is that applications are decoupled. Applications produce a stream of events capturing what occurred without knowledge of which things subscribe to these streams. The Need For Schemas
  27. 27. 27 © OCTO 2015 Whenever you see a common activity across multiple systems try to use a common schema for this activity. An example of this that is common to all businesses is application errors. Share Event Schemas