RealTime Messages at Scale
with Apache Kafka
Will Gardella
Product Manager
©2015 Couchbase Inc. 2
©2015 Couchbase Inc. 3
©2015 Couchbase Inc. 4
©2015 Couchbase Inc. 5
©2015 Couchbase Inc. 6
Agenda
 You might need Kafka if…
 Kafka architecture
 Background - Couchbase
 Couchbase & Kafka
 Behind the Scenes
 Demo
 An Example Producer and Consumer
What’s Apache Kafka for?
You might need Kafka if…
©2015 Couchbase Inc. 8
You might need Kafka if…
Photo Credit: Cory Doctorow
https://www.flickr.com/photos/doctorow/14638938
©2015 Couchbase Inc. 9
Different speeds for different
systems
 NoSQL
 RDBMS
 Cache
 Search
 Apps
 Metrics
 Logs
 Hadoop
 Relational Data Warehouse
Source:
Confluen
t
Typical Kafka Use Cases
©2015 Couchbase Inc. 12
Kafka Architecture
Broker 1
Consumer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Kafka Cluster
Broker 2
Broker 3
©2015 Couchbase Inc. 13
Kafka Architecture
Broker 1
Consumer
Zookeeper
Producer
Producer
Producer
Consumer
Consumer
Consumer
Kafka Cluster
Broker 2
Broker 3
Topic 1 – Partition 1
Topic 2 – Partition 2
Topic 2 – Partition 1
Topic 3 – Partition 1
Topic 1 – Partition 2
Topic 3 – Partition 2
Couchbase Server 4.0
A brief Introduction
©2015 Couchbase Inc. 15
Couchbase Server 4.0 for modern applications
Combines the flexibility of JSON, the power of SQL, and the scale of
NoSQL
Develop with Agility Operate at Any Scale
 Flexible JSON data model
 Dynamic schema support
 Powerful query language that extends SQL to JSON
 Sub-millisecond latencies at scale
 Elastic scaling on commodity servers
 High availability
©2015 Couchbase Inc. 16
Couchbase Server Defined
The first NoSQL database that enables you to develop with
agility and operate at any scale.
Managed Cache Key-Value Store Document
Database
Embedded
Database
Sync Management
©2015 Couchbase Inc. 17
The Power Of The Flexible JSON Schema
Ability to store data in multiple ways
• Denormalized single document, as opposed to normalizing data across multiple table
• Dynamic Schema to add new values when needed
©2015 Couchbase Inc. 18
Couchbase and Other Big Data Systems
data scientist / engineersup to 1010 application
users
NoSQL
Database
101- 102
Kafka Hadoop
Spark
Elasticsearch
EDW
Kafka & Couchbase Use Cases
©2015 Couchbase Inc. 21
Couchbase & Kafka Use Cases
 Couchbase as the Master Database
– Changes in the bucket update data elsewhere
 Triggers / Event Handling
– Handle events like deletions / expirations
externally
– E.g. expiration & replicated session tokens
 Real-time Data Integration
– Extract from Couchbase, transform and load
data in real-time
 Real-time Data Processing
– Extract from a bucket, process in real-time and
load back to another bucket
The Couchbase Kafka Connector
How it works
©2015 Couchbase Inc. 26
Database Change Protocol (DCP)
Couchbase Server’s internal data sync mechanism since Couchbase Server 3.x
 Used for
– Intra-Cluster Replication
– Indexing
– XDCR (Cross Datacenter Replication for HA/DR)
– Some connectors, including Kafka and Spark
• Use Couchbase 2.x Java SDK JVM Core IO DCP handling library
 Sends mutations
– Mutations = creation, update, or delete of an item
– Each mutation that occurs in a vBucket has a sequence number
Important: DCP not supported for external clients!
An Example Producer and Consumer
ConnectingCouchbase via Kafka to anApplication
©2015 Couchbase Inc. 35
Kafka Generator Example
©2015 Couchbase Inc. 36
Kafka Producer Example
©2015 Couchbase Inc. 37
Kafka Producer Example
©2015 Couchbase Inc. 38
A Kafka Consumer Example
©2015 Couchbase Inc. 39
A Kafka Consumer Example
Demo
©2015 Couchbase Inc. 41
Couchbase Kafka Connector Roadmap
Available Now: 1.2 GA
 Kafka Producer or Consumer
 Stream events
 Filters
 Transform events
41
Code: https://github.com/couchbase/couchbase-kafka-connector/
Issues: https://issues.couchbase.com/projects/KAFKAC
Docs: http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html
Planned
 Monthly maintenance releases
Under discussion
 Merge code for Storm connector
 Adopt Kafka Connect (Kafka 0.9)
 ???
©2015 Couchbase Inc. 42
Learn More - Couchbase Kafka Connector
Confluent’s Ewen Cheslack-Postava at Couchbase Connect 2015
 Great high level intro to Kafka in ~20 minutes
 https://youtu.be/fFPVwYKUTHs
Couchbase and Kafka - Up and Running in 10 Minutes
 Run through the sample code yourself
 http://blog.couchbase.com/2015/november/kafka-and-couchbase-up-and-running-in-10-minutes
Product docs
 http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html
Avalon Consulting blog and Github repo
 http://blogs.avalonconsult.com/blog/big-data/purchase-transaction-alerting-with-couchbase-and-
kafka/
 https://github.com/Avalon-Consulting-LLC/couchbase-kafka
42
Thank you.
will.gardella@couchbase.com
Twitter: @WillGardella

Real time Messages at Scale with Apache Kafka and Couchbase

  • 1.
    RealTime Messages atScale with Apache Kafka Will Gardella Product Manager
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
    ©2015 Couchbase Inc.6 Agenda  You might need Kafka if…  Kafka architecture  Background - Couchbase  Couchbase & Kafka  Behind the Scenes  Demo  An Example Producer and Consumer
  • 7.
    What’s Apache Kafkafor? You might need Kafka if…
  • 8.
    ©2015 Couchbase Inc.8 You might need Kafka if… Photo Credit: Cory Doctorow https://www.flickr.com/photos/doctorow/14638938
  • 9.
    ©2015 Couchbase Inc.9 Different speeds for different systems  NoSQL  RDBMS  Cache  Search  Apps  Metrics  Logs  Hadoop  Relational Data Warehouse
  • 10.
  • 11.
  • 12.
    ©2015 Couchbase Inc.12 Kafka Architecture Broker 1 Consumer Producer Producer Producer Consumer Consumer Consumer Kafka Cluster Broker 2 Broker 3
  • 13.
    ©2015 Couchbase Inc.13 Kafka Architecture Broker 1 Consumer Zookeeper Producer Producer Producer Consumer Consumer Consumer Kafka Cluster Broker 2 Broker 3 Topic 1 – Partition 1 Topic 2 – Partition 2 Topic 2 – Partition 1 Topic 3 – Partition 1 Topic 1 – Partition 2 Topic 3 – Partition 2
  • 14.
    Couchbase Server 4.0 Abrief Introduction
  • 15.
    ©2015 Couchbase Inc.15 Couchbase Server 4.0 for modern applications Combines the flexibility of JSON, the power of SQL, and the scale of NoSQL Develop with Agility Operate at Any Scale  Flexible JSON data model  Dynamic schema support  Powerful query language that extends SQL to JSON  Sub-millisecond latencies at scale  Elastic scaling on commodity servers  High availability
  • 16.
    ©2015 Couchbase Inc.16 Couchbase Server Defined The first NoSQL database that enables you to develop with agility and operate at any scale. Managed Cache Key-Value Store Document Database Embedded Database Sync Management
  • 17.
    ©2015 Couchbase Inc.17 The Power Of The Flexible JSON Schema Ability to store data in multiple ways • Denormalized single document, as opposed to normalizing data across multiple table • Dynamic Schema to add new values when needed
  • 18.
    ©2015 Couchbase Inc.18 Couchbase and Other Big Data Systems data scientist / engineersup to 1010 application users NoSQL Database 101- 102 Kafka Hadoop Spark Elasticsearch EDW
  • 19.
  • 20.
    ©2015 Couchbase Inc.21 Couchbase & Kafka Use Cases  Couchbase as the Master Database – Changes in the bucket update data elsewhere  Triggers / Event Handling – Handle events like deletions / expirations externally – E.g. expiration & replicated session tokens  Real-time Data Integration – Extract from Couchbase, transform and load data in real-time  Real-time Data Processing – Extract from a bucket, process in real-time and load back to another bucket
  • 21.
    The Couchbase KafkaConnector How it works
  • 22.
    ©2015 Couchbase Inc.26 Database Change Protocol (DCP) Couchbase Server’s internal data sync mechanism since Couchbase Server 3.x  Used for – Intra-Cluster Replication – Indexing – XDCR (Cross Datacenter Replication for HA/DR) – Some connectors, including Kafka and Spark • Use Couchbase 2.x Java SDK JVM Core IO DCP handling library  Sends mutations – Mutations = creation, update, or delete of an item – Each mutation that occurs in a vBucket has a sequence number Important: DCP not supported for external clients!
  • 23.
    An Example Producerand Consumer ConnectingCouchbase via Kafka to anApplication
  • 24.
    ©2015 Couchbase Inc.35 Kafka Generator Example
  • 25.
    ©2015 Couchbase Inc.36 Kafka Producer Example
  • 26.
    ©2015 Couchbase Inc.37 Kafka Producer Example
  • 27.
    ©2015 Couchbase Inc.38 A Kafka Consumer Example
  • 28.
    ©2015 Couchbase Inc.39 A Kafka Consumer Example
  • 29.
  • 30.
    ©2015 Couchbase Inc.41 Couchbase Kafka Connector Roadmap Available Now: 1.2 GA  Kafka Producer or Consumer  Stream events  Filters  Transform events 41 Code: https://github.com/couchbase/couchbase-kafka-connector/ Issues: https://issues.couchbase.com/projects/KAFKAC Docs: http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html Planned  Monthly maintenance releases Under discussion  Merge code for Storm connector  Adopt Kafka Connect (Kafka 0.9)  ???
  • 31.
    ©2015 Couchbase Inc.42 Learn More - Couchbase Kafka Connector Confluent’s Ewen Cheslack-Postava at Couchbase Connect 2015  Great high level intro to Kafka in ~20 minutes  https://youtu.be/fFPVwYKUTHs Couchbase and Kafka - Up and Running in 10 Minutes  Run through the sample code yourself  http://blog.couchbase.com/2015/november/kafka-and-couchbase-up-and-running-in-10-minutes Product docs  http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html Avalon Consulting blog and Github repo  http://blogs.avalonconsult.com/blog/big-data/purchase-transaction-alerting-with-couchbase-and- kafka/  https://github.com/Avalon-Consulting-LLC/couchbase-kafka 42
  • 32.

Editor's Notes

  • #2 Don’t forget to intro
  • #3 If you tweet about Kafka, you may be followed by this weird Franz Kafka bot…
  • #4 … but the real Franz Kafka died of tuberculosis in 1924…
  • #5 Interestingly, this Franz Kafka bot sounds quite a bit like the real Franz Kafka. Not overly cheery… I thought perhaps it was real quotes from just quotes from his work
  • #6 But no – it just sounds like it could be him…
  • #9 It helps you decouple systems in time – systems can be asynchronous, but this is more than that Consuming systems don’t have to be on or even exist at the time that producers are making messages Schema registry that describes what is known about data produced in different systems Publish / subscribe system Hooking up application server logs, caches, databases, and so forth You don’t want each system to have to be matched or hand integrated with every other system and service, with different adapter codes, different error handling behavior, logging, etc. And how do share metadata? What do you do with different revisions of That’s madness Imagine trying to add a new service that needs to read from 10 other services…. Organizationally that’s difficult, and every team has the potential to make different decisions about what systems to use…
  • #10  Kafka helps mitigate different expectations of speed and size of data being ingested in various systems Hadoop – HDFS can take tons of data, but not in tiny pieces – it’s a batch oriented system NoSQL databases like Couchbase can scale to billions of users with sub millisecond response times but not with bulk load Compare application server logs, a bulk database extraction, processing a stream of Twitter messages There can be issues with integrations where the slowest
  • #11 Vision is to have scalable, low latency pub sub message queue as standard interface for realtime streaming data Hadoop, HDFS specifically, fills this role for batch systems and led to a large ecosystem of useful tools that can interoperate via Hadoop data storage Kafka does the same for realtime data, and can scale to handle your entire organizations data. Kafka acts as the hub and applications hang off of it, exchanging data through Kafka We refer to this architecture as a stream data platform. Reminder: On this slide – you need to talk about the differences between Couchbase and Hadoop – they are complementary, they solve different problems Messaging: Decouple data processing from data producers Log Aggregation: A log as stream of messages Stream Processing: Consume data from one topic and put the filtered/transformed data into another one Click Stream Analysis: Page views/searches as real-time publish-subscribe feeds
  • #13 Publish Subscribe Broker Stores messages Failover: Leader vs. Follower Load balanced Producer Publish data/messages to the topic Consumer Applications/processes/threads those are subscribed to the topic Can be grouped (consumer groups) in order to process messages in parallel Multiple consumer instances can load balance reading the partitions of a topic Consumer groups are elastic and fault tolerant
  • #14 Topic Distributed and partitioned message queue Topics are partitioned so they can scale across multiple servers Partitions are also replicated for fault tolerance This is what the producers actually write to and what the consumers actually read Scales the Kafka brokers High performance: Log Kafka operates only on logs – they are always append only logs, and messages are always read sequentially Do not track the per message read state – you don’t need to because access is sequential Retention based on policy – either time based or size based Not keeping per message state Multiple consumers reading from the same log means that multiple consumers can do what they need to do (they know where they left off, Kafka doesn’t need to). This is like DCP in Couchbase Consumer Applications/processes/threads those are subscribed to the topic Can be grouped (consumer groups) in order to process messages in parallel Multiple consumer instances can load balance reading the partitions of a topic Consumer groups are elastic and fault tolerant Zookeeper – Distributed Synchronization and configuration store – Needed to partition topics and to support consumer groups (where multiple consumers work together to process (ingest) a Kafka Topic in parallel
  • #16 Multiple data models N1QL - SQL-Like query language Multiple indexes SDKs, ODBC / JDBC drivers and frameworks Push-button scalability Consistent high-performance Always on 24x7 with HA - DR Easy Administration with Web UI, Rest API and CLI
  • #17 KEY POINT: COUCHBASE HAS YOU COVERED FOR YOUR GENERAL PURPOSE DB NEEDS. FROM CACHING TO KV STORE, TO JSON DOCUMENT STORE, TO MOBILE APPS. NO OTHER NOSQL DB VENDOR HAS THIS BREADTH AND DEPTH OF TECHNOLOGY The purpose of this slide is to discuss the high level concepts of Couchbase, and if the SE wants to discuss what parts of Couchbase make up each concept. It is not to go over specific technologies like N1QL, ODBC, etc
  • #18 KEY POINT: YOU HAVE THE OPTION TO REPRESENT DATA QUITE DIFFERENTLY USING JSON AS OPPOSED TO A RELATIONAL DATABASE. - Where in relational databases you might have to have multiple tables to best represent your data, in JSON you can model your data like an object might already be in your programming language of choice. No ORM (Object Relational Model) needed. You can do relationships in Couchbase, but they are different than in a relational database and outside of the scope of an intro call normally. Make sure to stress that normalization is still something that can be done in Couchbase where it makes sense for the application, but this diagram is something that helps people coming from relational understand what is possible for JSON.
  • #19 Work people do in these systems - Training ML models ETL / Data wrangling Aggregations Reporting / BI Kafka is a data multiplexer – some people are still going to want to do this, but it’s designed for higher latency applications with a known high complexity (e.g. ebay – many different consumers for information) Traditional data warehouse – definitely will be a different programming language – how do you make sense of the data feed? You get into the problems that making changes on one side introduces tons of complexity on the other Downsides – maturity is not 100% on the Spark side, still in active development in the Couchbase side KV / N1QL
  • #20 KEY POINT: ENTERPRISES ARE USING COUCHBASE ACROSS A RANGE OF MISSION CRITICAL USE CASES. As the slide shows, Couchbase supports a wide range of use cases, from Profile Management to HA Cache. Each use case has its own set of requirements – some need very high performance, some need very high availability, some need flexibility of the data model. The ability to meet all of these requirements is what has driven adoption of Couchbase by large enterprise companies You should memorize a few things about a customer use per case so you can quickly go through these. What you want is a sound bite per use case.
  • #22 1. All your data is managed in Couchbase and the other systems record these changes – for example, a users purchase might be logg 2. A user’s web session is being stored in a Couchbase bucket and you want to react on it – for example – delete the session in another system like people do in Single Sign On Couchbase can handle 100,000’s of operations per second 3. Real Time data integration For example, you want to do a quick check on purchases to see if there’s anything suspicious about them – that may be done in another system 4. In this case, it’s important to note that Couchbase can be a Kafka Consumer or a Kafka Producer, so doing tasks like ML – flow data out, train models and flow data back into Couchbase. This is similar to number 3, but the difference is you’re loading something back into Couchbase so that users can quickly interact with it. You may have systems that build recommendations but then flow those back into Couchbase so that the next visitors get a slightly better mix of product offers Write data to a topic, process it with a framework and load it back into another separate bucket to serve users
  • #24 Skip if short of time – don’t need to cover anything besides DCP Punchline is, this mechanism allows Couchbase to scale elastically and without downtime while still enabling any client to find exactly where the active copy of a piece of data is (using the cluster map) Multiple buckets can exist within a single cluster of nodes (1, 2 or 3 extra copies) Each data set has 1024 Virtual Buckets (vBuckets) Each vBucket contains 1/1024th portion of the data set vBuckets do not have a fixed physical server location
  • #25 Add lots of notes
  • #26 Add lots of notes
  • #27 What can possibly go wrong if you write your own connector with DCP? A lot – First of all, you need to be able to drink from the firehose. Couchbase 100K’s of messages per second – the Kafka brokers sit there and soak up those messages and can write them out the other end at whatever speed your consuming systems are capable of DCP is written for memory to memory type replication – if you’re writing to a system that can’t keep up, the client needs to do some fancy footwork to make everything come out ok
  • #30 What can possibly go wrong? A lot – First of all, you need to be able to drink from the firehose. Couchbase 100K’s of messages per second – the Kafka brokers sit there and soak up those messages and can write them out the other end at whatever speed your consuming systems are capable of DCP is written for memory to memory type replication – if you’re writing to a system that can’t keep up, the client needs to do some fancy footwork to make everything come out ok
  • #36 This just does the work of creating some messages – for demo purposes, I can type things in here and see them show up as documents in Couchbase The keys are random – and limited to 10 DCP is a way of doing mutations, so sometimes we are going to overwrite existing docs and sometimes we will end up making new docs Overwrites are captured as sequence numbers (when combined with a document key, you have version information) similar to the offset in Kafka
  • #37 Producer is going to grab the docs and send them to Kafka
  • #38 This filter that we’re using prints out the dcpEvent to the console so we can read it but otherwise does no filtering You can add logic to the filter to mark events false, in which case they won’t be written to Kafka
  • #40 Finally, this is attaching to my Kafka vagrant image on port 9092 and subscribing to the topic default, partition 0 (we only have one partition)
  • #43 Fully transparent cluster and bucket management, including direct access if needed