Hi, I’m Neha Narkhede… There is a big paradigm shift happening around the world where companies are moving rapidly towards leveraging data in real-time and fundamentally moving away from batch-oriented computing. But how do you do that? Well that is what today’s talk is about. I’m going to summarize 6 years of work in 15 mins, so let’s get started.
Unordered, unbounded and large-scale datasets are increasingly common in day-to-day business. Stream data means different things for different businesses. For retail, it might mean streams of orders and shipments, for finance, it might mean streams of stock ticker data while for web companies, it might mean streams of user activity data. Stream data is everywhere. At the same time, there is a huge push towards getting faster results: doing instant credit card fraud detection, doing instant credit card payment processing vs only 5 times a day, being able to detect and alert on a problem that causes retail sales to dip in seconds vs a day later (you can only imagine what that would do to retail companies over black Friday)
So the takeaway is that businesses operate in real-time not batch, if you go to a store to buy something, you don’t wait there for several hours to get it. So data processing required to make key business decisions and to operate a business effectively should also happen in real-time.
Here are some examples to support that claim…
Event = something that happened. Different for different businesses.
Log files are also event streams. For instance, every line in a log file is an event that in this case tells you how the service is being used.
There is an inherent duality in tables and streams; Traditional databases are all about tables full of state but are not designed to respond to streams of events that modify those tables.
Tables have rows that store the latest value for a unique key. But…no notion of time
If you look at how a table gets constructed over time, you will notice that…
The operations are actually a stream of events where the event is just the operation that modifies the table. Every database does this internally and it is called a changelog
So events are everywhere, what next? We need to fundamentally move to event-centric thinking. For a retail website, there are possibly various avenues that generate the “product view” event. A standard thing to do is to ensure that all product view data ends up in Hadoop so you can run analytics on user interest to power various business functions from marketing to product positioning and so on.
Reality about 100x more complex. In some corner, you are using some messaging system for app-to-app communication. You might have a custom way of loading data from various databases into Hadoop. But then more destinations appear over time and now you have to feed the same data to a search system, various caches etc. This is a common reality and a simplified version.
The core insight is that a data pipeline is also an event stream.
What you need instead of that scary picture is a central streaming platform at the heart of a datacenter. A central nervous system that collects data from various sources and feeds all other systems and apps that need to consume and process data in real-time.
Why does this make sense?
Why is a streaming platform needed? Because data sources and destinations add up over time. Initially you might have just the web app that produces the product view event and maybe you’ve only thought about analyzing it in Hadoop.
But over time, the mobile app shows up that also produces the same data and several more applications as destinations for search, recommendations, security etc.
Event centric thinking involves building a forward-compatible architecture. You will never be able to foresee what future apps might show up that will need the same data. So capture it in a central, scalable streaming platform that asynchronously feeds downstream systems.
So how do you build such a streaming platform?
That journey starts with Apache Kafka.
At a high-level, Kafka is a pub-sub messaging system that has producers that capture events. Events are sent to and stored locally on a central cluster of brokers. And consumers subscribe to topics or named categories of data. End-to-end, producers to consumer data flow is real-time.
Magic of Kafka is in the implementation. It is not just a pub-sub messaging system, it is a modern distributed platform…
All that means, you can throw lots of data at Kafka and have it be made available throughout the company within milliseconds. At LinkedIn and several other companies, Kafka is deployed at a large scale…
In the last 5 years since it was open-sourced, it has been widely adopted by 1000s of companies worldwide.
So Kafka is the foundation of the central streaming platform.
Infrastructure is really only as useful as the data it has. The next step moving to a streaming platform based data architecture is solving the ETL problem.
REST Apis for management
Core: Data pipeline Venture bet: Stream processing
Most people think they know…
Doesn’t mean you drop everything on the floor if anything slows down Streaming algorithms—online space Can compute median
About how inputs are translated into outputs (very fundamental)
HTTP/REST All databases Run all the time Each request totally independent—No real ordering Can fail individual requests if you want Very simple! About the future!
“Ed, the MapReduce job never finishes if you watch it like that” Job kicks off at a certain time Cron! Processes all the input, produces all the input Data is usually static Hadoop! DWH, JCL Archaic but powerful. Can do analytics! Compex algorithms! Also can be really efficient! Inherently high latency
Generalizes request/response and batch. Program takes some inputs and produces some outputs Could be all inputs Could be one at a time Runs continuously forever!
For some time, stream processing was thought of as a faster map-reduce layer useful for faster analytics, requiring deployment of a central cluster much like Hadoop. But in my experience, I’ve learnt that the most compelling applications that do stream processing look much more like an event-driven microservice and less like a Hive query or Spark job.
Companies == streams What a retail store do Streams Retail - Sales - Shipments and logistics - Pricing - Re-ordering - Analytics - Fraud and theft
Let’s dive into the real-time analytics and apps area
Only one thing you can do if you think the world needs to change, you live in Silicon Valley—quit your job and do it. Mission: Build a Streaming Platform Product: Confluent Platform
Thank you slide. Add to the end of your presentation.
Building a real-time streaming platform using Kafka Connect + Kafka Streams
Building a real-
Kafka Connect +
Jeremy Custenborder, Systems Engineer, Confluent
• Everything in the company is a real-time stream
• > 1.2 trillion messages written per day
• > 3.4 trillion messages read per day
• ~ 1 PB of stream data
• Thousands of engineers
• Tens of thousands of producer processes