Building a real-time streaming platform using Kafka Connect + Kafka Streams


Published on

How Apache Kafka brings an event-centric approach to building streaming applications, and how to use Kafka Connect and Kafka Streams to build them.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hi, I’m Neha Narkhede…
    There is a big paradigm shift happening around the world where companies are moving rapidly towards leveraging data in real-time and fundamentally moving away from batch-oriented computing. But how do you do that? Well that is what today’s talk is about. I’m going to summarize 6 years of work in 15 mins, so let’s get started.
  • Unordered, unbounded and large-scale datasets are increasingly common in day-to-day business. Stream data means different things for different businesses. For retail, it might mean streams of orders and shipments, for finance, it might mean streams of stock ticker data while for web companies, it might mean streams of user activity data. Stream data is everywhere. At the same time, there is a huge push towards getting faster results: doing instant credit card fraud detection, doing instant credit card payment processing vs only 5 times a day, being able to detect and alert on a problem that causes retail sales to dip in seconds vs a day later (you can only imagine what that would do to retail companies over black Friday)
  • So the takeaway is that businesses operate in real-time not batch, if you go to a store to buy something, you don’t wait there for several hours to get it. So data processing required to make key business decisions and to operate a business effectively should also happen in real-time.

    Here are some examples to support that claim…
  • Event = something that happened. Different for different businesses.
  • Log files are also event streams. For instance, every line in a log file is an event that in this case tells you how the service is being used.
  • There is an inherent duality in tables and streams; Traditional databases are all about tables full of state but are not designed to respond to streams of events that modify those tables.
  • Tables have rows that store the latest value for a unique key. But…no notion of time
  • If you look at how a table gets constructed over time, you will notice that…
  • The operations are actually a stream of events where the event is just the operation that modifies the table.
    Every database does this internally and it is called a changelog
  • So events are everywhere, what next? We need to fundamentally move to event-centric thinking. For a retail website, there are possibly various avenues that generate the “product view” event. A standard thing to do is to ensure that all product view data ends up in Hadoop so you can run analytics on user interest to power various business functions from marketing to product positioning and so on.
  • Reality about 100x more complex. In some corner, you are using some messaging system for app-to-app communication. You might have a custom way of loading data from various databases into Hadoop. But then more destinations appear over time and now you have to feed the same data to a search system, various caches etc.
    This is a common reality and a simplified version.

    300 services
    ~100 databases
    Trolling: load into Oracle, search, etc
  • The core insight is that a data pipeline is also an event stream.
  • What you need instead of that scary picture is a central streaming platform at the heart of a datacenter. A central nervous system that collects data from various sources and feeds all other systems and apps that need to consume and process data in real-time.

    Why does this make sense?
  • Why is a streaming platform needed? Because data sources and destinations add up over time. Initially you might have just the web app that produces the product view event and maybe you’ve only thought about analyzing it in Hadoop.
  • But over time, the mobile app shows up that also produces the same data and several more applications as destinations for search, recommendations, security etc.

    Event centric thinking involves building a forward-compatible architecture. You will never be able to foresee what future apps might show up that will need the same data. So capture it in a central, scalable streaming platform that asynchronously feeds downstream systems.

  • So how do you build such a streaming platform?
  • That journey starts with Apache Kafka.
  • At a high-level, Kafka is a pub-sub messaging system that has producers that capture events. Events are sent to and stored locally on a central cluster of brokers. And consumers subscribe to topics or named categories of data. End-to-end, producers to consumer data flow is real-time.
  • Magic of Kafka is in the implementation. It is not just a pub-sub messaging system, it is a modern distributed platform…

    How so?
  • All that means, you can throw lots of data at Kafka and have it be made available throughout the company within milliseconds. At LinkedIn and several other companies, Kafka is deployed at a large scale…
  • In the last 5 years since it was open-sourced, it has been widely adopted by 1000s of companies worldwide.
  • So Kafka is the foundation of the central streaming platform.
  • Infrastructure is really only as useful as the data it has. The next step moving to a streaming platform based data architecture is solving the ETL problem.
  • 0.9
  • REST Apis for management
  • Core: Data pipeline
    Venture bet: Stream processing
  • Most people think they know…
  • Doesn’t mean you drop everything on the floor if anything slows down
    Streaming algorithms—online space
    Can compute median
  • About how inputs are translated into outputs (very fundamental)
    All databases
    Run all the time
    Each request totally independent—No real ordering
    Can fail individual requests if you want
    Very simple!
    About the future!
  • “Ed, the MapReduce job never finishes if you watch it like that”
    Job kicks off at a certain time
    Processes all the input, produces all the input
    Data is usually static
    DWH, JCL
    Archaic but powerful. Can do analytics! Compex algorithms!
    Also can be really efficient!
    Inherently high latency
  • Generalizes request/response and batch.
    Program takes some inputs and produces some outputs
    Could be all inputs
    Could be one at a time
    Runs continuously forever!
  • For some time, stream processing was thought of as a faster map-reduce layer useful for faster analytics, requiring deployment of a central cluster much like Hadoop. But in my experience, I’ve learnt that the most compelling applications that do stream processing look much more like an event-driven microservice and less like a Hive query or Spark job.
  • Companies == streams
    What a retail store do
    - Sales
    - Shipments and logistics
    - Pricing
    - Re-ordering
    - Analytics
    - Fraud and theft
  • Let’s dive into the real-time analytics and apps area
  • Only one thing you can do if you think the world needs to change, you live in Silicon Valley—quit your job and do it.
    Mission: Build a Streaming Platform
    Product: Confluent Platform
  • Thank you slide. Add to the end of your presentation.
  • Building a real-time streaming platform using Kafka Connect + Kafka Streams

    1. 1. Building a real- time streaming platform using Kafka Connect + Kafka Streams Jeremy Custenborder, Systems Engineer, Confluent
    2. 2. • Everything in the company is a real-time stream • > 1.2 trillion messages written per day • > 3.4 trillion messages read per day • ~ 1 PB of stream data • Thousands of engineers • Tens of thousands of producer processes
    3. 3. Resources • Confluent • Company website: • Blog: • Free Ebook “Making Sense of Stream Processing” • Apache Kafka • • Kafka Connect • low-latency-data-pipelines • Kafka Streams • made-simple
    4. 4. Thanks! Jeremy Custenborder | | Download Kafka and Confluent Platform