Unbounded, unordered, large scale data sets are increasingly common in day to day business and IoT for example is continuously bringing more and more data. So, big data is a buzz word that describes unusually large scale data systems that we started building to deal with internet scale data sets. Hadoop is a canonical example of a system built for this purpose – and recently there is a big push towards streaming models and hence faster data
In this presentation, we are going to see the evolution around data technologies. From the early days and the introduction of Data Warehouses, to the evolution of Hadoop. We are going to see how of MapReduce changed the way we think about data, And how we reached the real-time / streaming data technologies and tools.
From a long time ago, we are using databases to serve on-line transactions But when it comes to aggregating – joining – filtering – transforming this data, data warehouses was the technology used for integration and data archival
So in a traditional ETL pipeline – we would run over night batches to run some reports. But isn’t the batching mechanism limiting our responsiveness ?
A lot of things started changing in the early 2000s, there was kind of a series of google papers that described things they were doing internally and people took those ideas and wrote alternatives in the open sourced world. One of the most famous one is the MR paper (2004) that described a general purpose distributed compute model that basically took away the problem of end users having the know-how to break up their computations into smaller pieces, distribute over a cluster and handle failures and reruns, bringing all that data back together for the final result..The framework handles all that for you. So this was a very important step forward and that became Hadoop ultimately.
This resulted into the birth of Hadoop and everything (!) in this talk is related with Map-Reduce Hadoop celebrated it’s 10th year anniversary in the Hadoop Summit last April in Dublin.
Hadoop being a framework that encompasses the M-R paradigm, and introduces an immutable distributed file-system and then other tools that have been added to this ecosystem - Hadoop is using the MR paradigm, and introduces By default it replicates data across multiple nodes of the cluster (usually 3) and executes distributed sets of Map – Reduce task by utilizing data locality. … it’s core philosophy is send computation to the data instead of pulling data
So Hadoop proved to be both resilient and very flexible. There are a number of use cases ranging from - Social media analytics - to data-warehousing - to machine learning And is currently used in production across many industries However, it’s not perfect. It kind of difficult and sometimes slow Also it’s not supporting streaming
Apache Spark became very popular in early 2014 when it graduated into an Apache top level project as one of the fastest growing OSS projects Unlike M-R that writes to disk at every MAP and REDUCE phase, Spark is a lot more efficient as it keeps the intermediate data in-memory, by introducing new distributed collections
On top of it’s API, Spark added the capability to run SQL queries So it became a natural choice to be the compute engine which still plugs into hadoop and other systems and now you can write spark jobs instead, with a better performance. In fact, Spark claims to be up to 100 times faster, but in practise Spark is usually 4-7 times faster than M-R and it’s getting better So while with M-R it’s common for your computation to take 45 minutes on a M-R job, Today we can optimize it to take 5 - 15 minutes on Spark.
How about low latency? What happens when I want to react in real time events say for example update my search engines based on inventory changes? How can I immediately detect anomalies so I can respond fast before I get angry customers ?
Application requirements have changed dramatically in recent years. Only a few years ago a large application had tens of servers, seconds of response time, hours of offline maintenance and gigabytes of data. Today applications are deployed on everything from mobile devices to cloud-based clusters running thousands of multi-core processors. Users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures.
Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation and location transparency, they scale up & down and are also resilient as they respond to failures which actually makes Error handling is a first class citizen.
So we want to apply those universal principals in fast, cheap and scalable way
And of course not to loose data.
So with data streaming from everywhere…
How do I build scalable fault tolerant distributed data processing systems, that can handle massive amount of data? From diverse sources? With different structures? How about back-pressure ? How long to buffer data – in order not to run out of memory ?
Kafka is a distributed pub-sub message system, started in LinkedIn and: Writes in high throughput and low latency Is a multi-subscriber system Replicates data for resilience Uses partitions for sharding
It’s designed to feed data in multiple systems including batch systems and is persistent by default If the queue is persistent – you don’t have to worry about back-pressure. This is solved by definition. This is what Kafka is – a persistent queue that can buffer way more data that can live in an application’s memory So Kafka is becoming the de-facto standard to store stream events The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers subscribe to one or more topics.
The key idea of Kafka is the log. The log is 1. An abstract data structure – that has some properties, 2. A structured ordered layer of messages 3. It is immutable – so once written it does not change 4. Not only written in order – but also read in order – which means sequential access ensures high performance
So log, provides the ordering semantics that is required for stream processing
But if you want to scale this log, then you shard it in multiple-partitions. And if you do that – it’s essential the backend of Kafka – where the log is the topic, that physically lives in partitions – that is replicated in a whole bunch of brokers.
So a Kafka topic is just a sharded write-ahead log. Producers append records to these logs and consumers subscribe to changes
ETSY.com for example – is using a single topic sharded over 200 partitions – distributed over 20+ servers
Microsoft, Netflix and LinkedIn have surpassed the 1 Trillion messages / day rate (!)
So it’s pretty damned fast
So let’s dive a little bit more into Kafka
Recently Kafka connect was introduced which is a large scale streaming data import/export data tool for Kafka It’s both a framework to build connectors and a tool for copying streaming data to and from Kafka
We normally have source connectors ( which can be for example your JMS system) and sink connectors ( for example update some indexes on ElasticSearch).
What is interesting – is that multiple open source connectors already exist – some of them certified that ensure Exactly-once semantics Retries and redeliveries Error policies
Designing and maintaining the ETL process is often considered one of the most difficult and resource intensive portions of a data warehouse project:
- Each step can introduce errors and risks - Tools can cost millions - Increased complexity – due to tight coupling - Can introduce data duplication after fail-over
Kafka enable’s us to break down the E from the T and the L – and de-couple
We can now source data from an external system and sink them into another So the only necessary thing to define is – how to run our data transformations
And with data transformations we mean, filtering, aggregations, data enrichment etc…
Unlike the request-response model – which is synchronous , tightly coupled and latency sensitive, (where you send one input and get one output) and the only way to scale this services is by deploying multiple instances of this service and Unlike batch where you send ALL your data and wait a long time to get ALL the outputs, Stream-processing is a model where we have some inputs and get some outputs back, where the definition of some is left to the program It a generalization of request-response & batch
The most prominent streaming frameworks right now are: Spark Streaming – Flink and Kafka Streams
Our available options range from a DIY stream processing approach (using the kafka libraries) – that seems simple to begin with, but we need to manually care about many aspects regarding: ▪ fault tolerance and fast failover ▪ state - when doing distributed joins and aggregations ▪ reprocessing ▪ time windowing
There are also established stream processing frameworks – such as spark-streaming Spark started as a batch model, such as Map-Reduce. But because it’s much more efficient than M-R They realized, that they can actually support a stream model by using a definable window of time up to a few seconds Pretty-much like a mini-batch
Flink and Kafka Streams use a different model – an event-at-a-time processing model (not microbatch) with millisecond latency
The key differences between spark-streaming and Flink / Kafka Streams is the expected latency. When spark-streaming runs mini-batches and produces results in a matter of 2 – 10 seconds
What happens if a developer introduces a BUG and sends some bad data into a topic ? We also mentioned that topics are immutable. So are multiple consumer – application going to be affected ?
We can avoid a lot of suffering - by using a Schema Registry
It provides a serving layer for your metadata. It provides a rest-full interface for storing and retrieving Avro Schemas.
It provides serializers that plug-in to kafka-clients, and handle schema retrieval for kafka messages that are send into the Avro format.
In that case our Application – would configure a particular serializer into every producer object. That serializer ensures that the schema of messages is both registered and valid.
Using same schema registry in dev and production allows Uber to catch schema mismatches in unit tests, before rolling to production.
So lets consider a simple model of a retail store. The core streams in retail are sales of products, orders placed for new products and shipments of products that arrive in stores. The inventory on hand is a table computed off the sale and shipment streams which add and subtract from our stock of products on hand. Then reordering products when the stock starts to run low and adjusting prices based on demand. How do we model real world things as a combination of streams?
This streaming example was presented in the Hadoop Strata conference in London last month, and we were motivated to implement it
So let’s assume you have a shipment topic and a sales topic. For the shake of the example the message format is ItemID, StoreId and the product count. So messages stream in real time..
So we implemented this example (which by the way code will be available afterwards) by following the steps: Generate some synthetic data to capture the problem and continuously feed KAFKA Run a long running spark streaming application which aggregates the topics and generates messages for low inventory. Applications can subscribe to the low inventory topic and react in real time.
So as we said it’s important to define a schema that validates the messages format and type. Avro provides 1) rich data structures 2) compact fast binary data format and Avro schema’s are defined in JSON. It’s considered one of the best practices in both streaming and batch processing applications. What is really interesting about Avro is that supports schema evolution and ensures forward and backward compatibility.
So we generate both sales and shipment records for about 2 M messages / sec… and this is only by utilising 1 thread.
The serializer automatically registers the new schema into schema registry and this is how it would look like.
To start building a spark streaming application you need to add the dependencies of course, and create the spark streaming context (a bit of boilerplate required) to set up the window mainly and pretty much run a spark job every 4 seconds.
In our spark st app we are defining the typed data format. What is interesting about is that even if the producer of this message decides to evolve/change the data schema this code will still work.
We we set up the code to subscribe one consumer to each topic and continuously poll data from kafka about these topics and deserialise into objects.
We pull shipment records and we calculate the “inventory on hand” by increasing the availability of the product from the shipment messages and decreasing from the sales. If the availability of a product drops below a certain threashold, we generate a ‘low-availability’ event/message to a new topic, for other applications to react
The idea is for downstream applications (or connectors) to update our catalogue in real-time as presented in web and mobile apps Re-order products with low-inventory at a particular store
This is the entire spark streaming application you require for this example. Simple?
So we generated 1B messages in less than 30 min.
Obviously when it comes to production monitoring is crusial. There are noumerus tools out there for this job, and kafka itself provides a rich set of metrics that applications such as Prometeus and Graphana can expose and visualise.
And if you are the poor devops guy that needs to deploy, configure and scale up or down, view logs etc… How can you deliver such an infrastructure ?
Actually this is a hard task. Fortunately some amazing developers provide an integration that sorts this problem In a matter of minutes.
Confluent-On-Cloudera => Excellent integration with Hadoop http://www.landoop.com/blog/2016/07/confluent-on-cloudera/
So these tools we’ve seen, seem to be very commonly used together in the streaming architecture. Spark and in particular Spark Streaming and Kafka for ingesting the data in adorable resilient scalable away.
You normally require some sort of scalable distributed storage most commonly used Cassandra but in principle it can be almost any other data store.
And some missing parts ”how do I glue things together?” and “what infrastructure to run them on?”
So we’ve seen people adapting this kind of humourus acronym SMACK stack which stands for … Mesos is emerging the next generation in managing a clustering system, still early days but we really like it. Akka is meeting the need for micro services and glues things together especially with Akka Streams.. Not necessarily agree with it, but all these technologies fit nicely together and we need to be wise when choosing our technology stack...
So this would be a high level architecture on your distributed systems.
We can use our Hadoop cluster, as a DW and for running our Analytics and Machine Learning, And run on it our Kafka based streaming platform as well
We will most probably use one or more NoSQL clusters, and for our custom and stateless applications we can utilize a Mesos Cluster and run them within docker containers.
Depending on your needs, you adjust your architecture – but this a very common pattern we’ve seen across many organizations.
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and #stream-processing
Check is format is
Retrieve schema ID
Schema ID + Data
low inventory topic
let’s see some code
Define the data contract / schema
in Avro format