12. Real Time Data Infrastructure
Producers Consumers
Messaging As A
Service (Kafka)
Stream Processing As
A Service (Flink)
Keystone Data Pipeline
Topics Routing Transformations
(filter, projection, UDF)
Schema /
Hygiene
13. Stream Processing
Producers Consumers
Messaging As A
Service (Kafka)
Stream Processing As
A Service (Flink)
Keystone Data Pipeline
Topics Routing Transformations
(filter, projection, UDF)
Schema /
Hygiene
23. Offerings by complexity
● Simple drag and drop: filter, projection, data hygiene
○ Available now via Keystone router
● Medium: SQL, UDF (User Defined Function)
○ Coming 2018
● Advanced: custom stream processing applications
○ Available now
24. Ease of use v.s. capability
difficult
easy
Limited Full feature
Color legend
Available now
Coming 2018
Easeofuse
Capability
25. Ease of use v.s. capability
difficult
easy
Limited Full feature
Color legend
Available now
Coming 2018
Easeofuse
Capability
Keystone
Router
26. Ease of use v.s. capability
difficult
easy
Limited Full feature
Custom
SPaaS App
Color legend
Available now
Coming 2018
Easeofuse
Capability
Keystone
Router
27. Ease of use v.s. capability
difficult
easy
Limited Full feature
Keystone
Router
Custom
SPaaS App
SQL
Color legend
Available now
Coming 2018
Easeofuse
Capability
28. Ease of use v.s. capability
difficult
easy
Limited Full feature
Custom
SPaaS App
Color legend
Available now
Coming 2018
Easeofuse
Capability
Keystone
Router + UDF
SQL + UDF
42. Drag-and-drop Keystone router
● Stateless and embarrassingly parallel
● ~2,000 jobs in prod
● Self serve and fully managed
● At least once delivery semantics
● Isolation
90. Hive backfill v.s. Flink rewind
Hive backfill Flink rewind
Warm-up issue Yes No
Ordering issue Yes No
Data retention Months Hours or days
Applicability Stateless Stateless and
stateful
91. Pros for Hive backfill source
● Long-term storage (a few months)
● Fast recovery
○ S3 is very scalable
○ Runs in parallel with live job
Who have used any stream processing framework? What about Apache Flink?
We care about stream processing at Netflix, because we have real business needs for it.
If you go to Netflix home screen, it shows a list of rows of recommended TV shows and movies, personalized to your taste. Most of those recommendations are done by offline batch processing jobs. There is one row on the home screen called “Trending Now” that shows the videos that are trending in Netflix right now. It is also personalized. “Trending Now” is computed by a stream processing application in real time. Here now really means now, not a hour ago, not yesterday.
In some cases, streaming is just a more natural way to process unbounded data compared to batch. In this simple example, we have a unbounded user activity stream and we want to do session. Session is defined when a gap of inactivity occurred.
In batch world, we split the input data into fixed-sized windows (e.g. daily window) of bounded data set. then we repeatedly process each of those fixed-size window.
Here are how batch job would compute the sessions.
Here is the problem for Bob. One of his sessions cross the boundary of day. Since they are processed at different day, these two events are incorrectly divided into two different sessions
The correct computation should group them as one session
In streaming world, we don’t artificially trunk and process the data by daily boundary. We just process events as they come in. It is a more natural mapping to the real-world activities.
Last but not the least, we will take a deep look how should we recover streaming job from extended outage. We will discuss two solutions: backfill and rewind
We have a small team of 10 people. At high level, we provide three platforms for the company. At the bottom layer, we have two building blocks: messaging as a service that builds on top of Apache Kafka and stream processing as a service that builds on top of Apache Flink. Then at the top layer, we have Keystone data pipeline that builds on top of those two building blocks.
The components provided by streaming processing are highlighted in red color
At 10,000 feet level, Keystone data pipeline is responsible for moving data from producer to sinks for data consumption. We will get into more details of Keystone pipeline when we are talking about Keystone router later.
Why did we choose Apache Flink as our stream process engine. Here are some of the key features that Flink provides.
Flink guarantees exactly-once semantics for stateful computations. In another world, Flink ensures that operators do not perform duplicate updates to their state in the events of failures. We need rewindable source like Kafka to achieve exactly-once application state.
If a job writes to a sink (which most jobs probably do), there are two ways to achieve end-to-end exactly once. First, idempotent sink. Second, if sink supports transaction, then Flink can do two-phase commit and commit transaction as part of checkpoint
Windows are at the heart of processing unbounded streams. Windows split the stream into “buckets” of finite size. Then we can apply computations on those buckets. Flink supports flexible windowing based on time, count, or sessions. Windows can also be customized with flexible triggering conditions to support sophisticated streaming patterns.
Flink supports stream processing and windowing with event time semantics. Event time makes it easy to compute accurate results with out-of-order or late-arrival events.
Flink state backend defines the data structure that holds the state. It also implement the logic to take a snapshot of the job state and store that snapshot as part of a checkpoint to some distributed file system like HDFS or S3. Checkpoints is how Flink achieve fault tolerance.
Flink also supports incremental checkpointing for RocksDB state backend, which greatly reduce the snapshot size in many cases.
As checkpoint barrier flows through the job graph as part of the data stream, each operator takes a snapshot and upload the snapshot. Most of the snapshot work happened asynchronously in the background. This allows the system to maintain high throughput while achieving fault tolerance.
Flink’s checkpoint performance is one of the biggest reason that we choose Flink as our stream processing engine. This really enabled the possibilities for large-state jobs.
It provides multiple levels of programming API from the low-level process function to high-level SQL
Let’s look at those offerings plotted in the two dimensions of “ease of use” and “capability”.
We first migrated our Keystone router to Flink platform. Keystone router is arguably the most important use case. It is also probably the simplest use case. We have ~2,000 routing jobs in Keystone globally. We have to make this use case as easy as possible. We are attacking Keystone router use case first, because it showcases that we can run Flink at very large scale. That builds up the confidence for our users
Then we build our platform to enable users to write custom stream processing job. It gives user the total control. But it is not necessarily the simplest way to get things done.
That’s why we are exploring SQL. Industry trend shows SQL can increase the adoption of stream processing. User just need to write a SQL query. My team will be managing the executions of SQL jobs. User doesn’t need to worry about operations
User defined function extends the capability of Keystone router and SQL
UDF extended the expressiveness of SQL. that should help the adoption of SQL
Right now, our router only supports simple xpath based filter and top-level field projection. If users want more sophisticated filtering or transformation logic, right now they have to write and manage their own custom applications. We want to make it simpler for them. If we can plug in user defined function (e.g. in javascript) into our router and manage those routing jobs, then users don’t need to worry about operations.
Titus is a container orchestration platform developed in house. It is similar to Kubernetes. Our Flink jobs run on top of Titus platform.
We start two Titus job for Flink runtime: one for job manager and one for task managers. Together they form one Flink cluster. We only submit one job for each Flink cluster for isolation purpose.
Pretty much every applications publishes some data to our data pipeline.
We ingest over 1 trillion unique events per day. That is about ~3 PBs.
Need to update this one
One of our largest data stream has a fan-out of 13 outputs. It used to be 20 outputs.
We have ~2,000 routing jobs running in prod env. That’s it is so important to make it self serving and fully automated. Otherwise, my team will be overwhelmed just to dealing with the operation works. That’s why our initial focus for stream processing is Keystone router, because it is our biggest use case.
Once user provision routing jobs via our self service UI, my team is responsible for managing the operations of those jobs.
We also run each routing job in isolation as its own Flink cluster. Such isolation is important for our platform to deal with outage or backpressure. If one ElasticSearch cluster is having issue, we don’t want it to affect the routing to other ElasticSearch clusters or Kafka sink or Hive sink.
After showing you the simple point-and-click Keystone router, let’s see how did we enable user to write custom stream processing jobs.
Dashboard: system metrics (like cpu, memory, disk), jvm metrics (like GC pause), flink metrics, Kafka metrics etc.
That greatly reduced the barrier of entrance and takes care of the boilerplate work. What used to hours, now it only takes literally just a minute. Some of our data engineers said that now it is so easy to create a streaming job, I don’t really need SQL right now or at least it’s not an urgent need.
We tested the code locally. Now we are ready to deploy it in cloud.
We automatically extract the application config from app docker image and present them in UI. User can choose to override any setting during deployment time. E.g. it’s very common that the kafka cluster name/VIP are different btw test and prod env. When user deploy the same image in prod env, it is very likely they need to override the Kafka cluster VIP.
When we build a platform, we have to consider non-happy path. In the last section, we will take a deep look at how should a streaming job recover from failure. Failure recover is one my favorite topic.
Things can fail.
Your job may need to query other external service to enrich data and your dependency service may fail.
We are offering two solutions for failure recovery. First is launch a backfill job running in parallel with the live job and let backfill job reprocess data from outage period. Second is rewind the regular Flink job to a checkpoint before the outage happened and resume data processing from that point.
Most of our data streams land in Hive table in addition to Kafka where stream processing applications read from. Our Hive data warehouse usually stores data for a few months. It is a perfect source for backfill. We can launch a backfill job replaying Hive data from the outage period. Note backfill job is processing a finite dataset from Hive. So it will finish after it is done with processing data from outage period.
It is very easy to deploy a backfill job using the same docker image as the live job. In this example, I configured two sources but the job actually only reads from one source based on which source is chosen. For the long-running live job, I would chose Kafka source. For short-lived backfill job, I would chose Hive source. This UI is a little misleading, because user may think this job reads from both Kafka and Hive. We will improve it very soon.
To read from Hive source, user need to provide Hive database and table name, and also the SQL where clause to specify Hive partitions. Most of our Hive tables are partitioned by date and hour. So the SQL where clause would typically be the time range for the data you want to re-process.
What is Lambda architecture?
This all sounds great. So what’s the catch?
With stateful stream processor, each input record is evaluated against the application state accumulated over time. When we launch a brand-new backfill job, we need to build up such state before we can emit correct output.
In addition to process the data during outage period, we also need to process additional data from warm-up period. How long that warm-up period needs to be is highly depend on use cases. Could be as short as a few minutes. Could be as long as hours or days.
We added warm-up data. But here is the chicken and egg problem. Computation during warm-up period also produce incorrect result. It’s up to application to make sure no output is emitted during warm-up period. That is added complexity for application
Kafka consumers read messages in the same order they were written to a partition. If data is balancedly distributed to all partitions, that achieves a good message ordering for the system.
However, ordering is very problematic for Hive source. In order to understand that, we should look at how Hive read data from S3. Here we are reading a Hive partition with 5 S3 files
Hadoop mapreduce usually split input data into logical trunks called splits. In this example, those 5 S3 files are splitted into 10 logical trunks.
In our implementation, Flink job manager does the split calculation and ship the split result to all task managers as part of the job graph info
Each task manager then runs the same split assignment algorithm (e.g. round-robin) to decide which splits it should read.
Hive doesn’t track and maintain ordering among those files. That is not required by batch system. If ordering is needed, user need to sort the data e.g. using “orderBy” operator in SQL. If we look at the first task manager, it process splits from hours out of order: hour 0 first, then hour 23, 12, 3
Note that we are not asking for total ordering in the distributed system. But stream processing job usually expects certain behavior or threshold for late events. User sets allowed lateness for live job accordingly (e.g. to 10 minutes), which is very reasonable for live source like Kafka. Now for backfill job Hive source emitting records in arbitrary time order, it can screw up your computation correctness and produce wrong result.
We talked about Flink checkpoint earlier. It is how Flink implement fault tolerance. Upon job failure, it automatically rewind the job to last successful checkpoint. Rewind of application state also includes Kafka offsets.
We talked about Flink checkpoint earlier. It is how Flink implement fault tolerance. Upon job failure, it automatically rewind the job to last successful checkpoint. Rewind of application state also includes Kafka offsets.
Now let’s extend the basic idea to more general failure recovery.
With flink rewind, we don’t have the warmup and ordering issue. Kafka offsets are part of the application state. Flink will rewind Kafka offsets along with other application state. After rewind, Flink job is resuming from a correct application state. There is no warm-up issue. Here we are still working with Kafka source. So ordering is maintained.
Rewind works for all situations, stateless or stateful.
Now let’s extend the basic idea to more general failure recovery.
With flink rewind, we don’t have the warmup and ordering issue. Kafka offsets are part of the application state. Flink will rewind Kafka offsets along with other application state. After rewind, Flink job is resuming from a correct application state. There is no warm-up issue. Here we are still working with Kafka source. So ordering is maintained.
Rewind works for all situations, stateless or stateful.
Obviously this requires user to resolve the issue before Kafka retention expired. Otherwise, we can’t rewind. Our typical Kafka retention are hours or days
Here is a comparison of these two solutions
You may wonder if Flink rewind works for all use cases, why bother to provide Hive backfill.
We can only keep relative short retention for Kafka data, typically hours or days. Our Hive data warehouse stores data in S3 usually with a few months of retention. So data should be there when we need them.
With Hive backfill, we probably can achieve fast recovery. S3 is super scalable. We can launch the backfill job running in parallel with live job. We can even make the backfill job running in a much bigger cluster. With bigger cluster, backfill job can process events at a much higher rate to speed up the recovery.
Today we are recommending Hive backfill for stateless jobs and Flink rewind for stateful jobs. But we are still exploring if we can push the boundary.
E.g. can we solve the ordering issues for Hive source? Can we take care of the warm-up issue at platform? If we can, then we can probably apply Hive backfill for stateful jobs too.
Or maybe just forget Hive as backfill source. Let’s make sure we have enough retention for our Kafka topics. We should only rely on Flink rewind.
We don’t know for sure where we ultimately land. We need to do more evaluation.
No matter it’s Hive as backfill source or Flink rewind, there are some caveats that we should watch out for
At steady state when there is lagging, live job process events as they come into Kafka. Throughput is determined by the incoming message rate. During recovery phase, this is a little different story, since now we have a backlog to clear. Streaming job will process events as fast as it can. As mentioned earlier, we can also run backfill job with a bigger cluster size to speed up recovery. It is possible that reprocessing can cause 10x load on dependency services or sink. We need to watch out such impact
You may wonder why is this an issue? Back-pressure will kick in and streaming job will slow down naturally. The problem is on the impact of those external services, because they may be serving requests from the critical path of video streaming from Netflix users. We don’t want to affect the critical path of video streaming for Netflix customers.
We should define reasonable recovery time and size the job cluster properly.
During outage period, streaming job probably continues to write output to sink. These output may be incorrect. Such side effect can’t be undone by by stream process platform. It’s up to application to handle it
Duplicates are ok. Consumer can deal with it.
Idempotent sink, E.g. ElasticSearch, Cassandra and EVCache have last write win.
“bad” data can be wiped out in sink. E.g. drop Hive table partitions containing “bad” data and populate them again via backfill or rewind
If Flink job is calling an external service, that service is unlikely to participate rewind. In this example, we have an application calling A/B service to enrich original data stream with A/B test cell allocation information.
Initially everything is fine, job is processing latest msg X
Msg X contains user Alice. So our streaming job asks A/B service which A/B test cell is Alice allocated to. A/B service returns cell A.
Msg X contains user Alice. So our streaming job asks A/B service which A/B test cell is Alice allocated to. A/B service returns cell A.
Outage also happened
Outage also happened
Outage also happened
When we query A/B service again, now it returns a different answer cell B. This can affect computation correctness. How much this becomes an issue is highly dependent on the use case.
One possible way to solve this issue is to convert A/B data from a table to a stream. Previously, we are essentially doing a table lookup to A/B service
If we can convert A/B data to a stream source and join it with the original data stream, now A/B data also becomes part of the application state. It will be checkpointed together with other application state. When rewind happens, we will rewind A/B data along with other application state.
Netflix is a pioneer in chaos engineering. We constantly trigger or simulate failures to evaluate how our system reacts to failure. Similar for stream processing applications, we also encourage our users to regularly exercise the backfill or rewind tooling. practice practice practice. Users will get familiar with the recovery process. We can make sure the tooling works when we need them.
We want to make stream processing so easy that a caveman can do it. That is our slogan.
Before I opening up for questions, I want to mention that I will be at the O'Reilly Booth between 3 and 4 pm this afternoon. If you have more questions or just like to chat, please drop by.