From a Kafkaesque Story to
the Promised Land

7/7/2013
Ran Silberman
Open Source paradigm
The Cathedral & the Bazaar by Eric S Raymond, 1999
the struggle between top-down and bottom-up design
Challenges of data platform[1]

•
•
•
•
•
•
•
•

High throughput
Horizontal scale to address growth
High availability of data services
No Data loss

Satisfy Real-Time demands
Enforce structural data with schemas
Process Big Data and Enterprise Data
Single Source of Truth (SSOT)
SLA's of data platform
Real-time
servers

Real-time
Customers

Offline
Customers
Data Bus
SLA:
1. 98% in < 1/2 hr
2. 99.999% < 4 hrs

BI DWH

SLA:
1. 98% in < 500 msec
2. No send > 2 sec

Real-time
dashboards
Legacy Data flow in LivePerson
RealTime
servers

ETL
Sessionize
Modeling
Schema
View

Customers
View
Reports

BI DWH
(Oracle)
1st phase - move to Hadoop
RealTime
servers

ETL
Sessionize
Modeling
Schema
View

Customers
View
Reports

Hadoop

MR Job transfers
data to BI DWH

HDFS

BI DWH
(Vertica)
2. move to Kafka
RealTime
servers

Kafka
Topic-1

Customers
View
Reports

Hadoop
6

MR Job transfers
data to BI DWH

HDFS

BI DWH
(Vertica)
3. Integrate with new producers
New
RealTime
servers

RealTime
servers

Kafka
Topic-1

Topic-2

Customers
View
Reports

Hadoop
6

MR Job transfers
data to BI DWH

HDFS

BI DWH
(Vertica)
4. Add Real-time BI
New
RealTime
servers

RealTime
servers

Kafka
Topic-1

Topic-2

Storm
Topology

Customers
View
Reports

Hadoop
6

MR Job transfers
data to BI DWH

HDFS

BI DWH
(Vertica)
5. Standardize Data-Model using Avro
New
RealTime
servers

RealTime
servers

Kafka
Topic-1

Topic-2

Storm
Topology

Hadoop

Customers
View
Reports

Camus

6

MR Job transfers
data to BI DWH

HDFS

BI DWH
(Vertica)
6. Define Single Source of Truth (SSOT)
New
RealTime
servers

RealTime
servers

Kafka
Topic-1

Topic-2

Storm
Topology

Hadoop

Customers
View
Reports

Camus

6

MR Job transfers
data to BI DWH

HDFS

BI DWH
(Vertica)
Kafka[2] as Backbone for Data

•
•
•
•
•
•
•
•

Central "Message Bus"
Support multiple topics (MQ style)
Write ahead to files
Distributed & Highly Available
Horizontal Scale
High throughput (10s MB/Sec per server)
Service is agnostic to consumers' state
Retention policy
Kafka Architecture
Kafka Architecture cont.
Producer 1

Producer 2

Producer 3

Zookeeper

Node 1

Consumer 1

Node 2

Consumer 1

Node 3

Consumer 1
Kafka Architecture cont.
Producer 1

Topic1

Producer 2

Topic2

Zookeeper

Node 1

Node 2

Node 3

Node 4

Group1
Consumer 1

Consumer 2

Consumer 3
Kafka replay messages.
Zookeeper

Min Offset ->
Max Offset ->

Node 3

Node 4

fetchRequest = new fetchRequest(topic, partition, offset, size);
currentOffset : taken from zookeeper
Earliest offset: -2
Latest offset : -1
Kafka API[3]

•
•

Producer API
Consumer API
o

High-level API
ď‚§

using zookeeper to access brokers and to save
offsets

o

SimpleConsumer API
ď‚§

•

direct access to Kafka brokers

Kafka-Spout, Camus, and KafkaHadoopConsumer all

use SimpleConsumer
Kafka API[3]

•

Producer

messages = new List<KeyedMessage<K, V>>()
Messages.add(new KeyedMessage(“topic1”, null, msg1));
Send(messages);

•

Consumer

streams[] = Consumer.createMessageStream((“topic1”, 1);
for (message: streams[0]{
//do something with message

}
Kafka in Unit Testing
•
•

Use of class KafkaServer
Run embedded server
Introducing Avro[5]

•
•

Schema representation using JSON
Support types
o

Primitive types: boolean, int, long, string, etc.

o

Complex types:
Record, Enum, Union, Arrays, Maps, Fixed

•
•

Data is serialized using its schema
Avro files include file-header of the schema
Add Avro protocol to the story
Kafka message
1.0

Producer 1

Create Message
according to
Schema 1.0

Header

{event1:{header:{sessionId:"102122"),{timestamp:"12346")}...

Avro message

Register schema 1.0
Add revision to
message header

Send message

Encode message
with Schema 1.0

Schema Repo

Topic 1

Read message

Topic 2

Extract header
and obtain
schema version

Consumers:
Camus/Storm

Get schema by
version 1.0

Decode message
with schema 1.0
Pass message
Kafka + Storm + Avro example

•

Demonstrating use of Avro data passing from Kafka to
Storm

•
•
•
•
•

Explains Avro revision evolution
Requires Kafka and Zookeeper installed
Uses Storm artifact and Kafka-Spout artifact in Maven

Plugin generates Java classes from Avro Schema
https://github.com/ransilberman/avro-kafka-storm
Resiliency
Producer machine
Producer
Send message
to Kafka

Persist message to
local disk

local file
Kafka Bridge
Send message
to Kafka

Real-time
Consumer:
Storm

Fast
Topic

Consistent
Topic

Offline
Consumer:
Hadoop
Challenges of Kafka

•
•
•
•
•
•
•

Still not mature enough
Not enough supporting tools (viewers, maintenance)
Duplications may occur
API not documented enough
Open Source - support by community only
Difficult to replay messages from specific point in time
Eventually Consistent...
Eventually Consistent
Because it is a distributed system -

•
•
•
•

No guarantee for delivery order
No way to tell to which broker message is sent
Kafka do not guarantee that there are no duplications
...But eventually, all message will arrive!
Desert
Event
generated

Event
destination
Major Improvements in Kafka 0.8[4]

•
•
•

Partitions replication
Message send guarantee
Consumer offsets are represented numbers instead of
bytes (e.g., 1, 2, 3, ..)
Addressing Data Challenges

•

High throughput
o Kafka, Hadoop

•

Horizontal scale to address growth
o Kafka, Storm, Hadoop

•

High availability of data services
o Kafka, Storm, Zookeeper

•

No Data loss
o Highly Available services, No ETL
Addressing Data Challenges Cont.

•

Satisfy Real-Time demands
o Storm

•

Enforce structural data with schemas
o Avro

•

Process Big Data and Enterprise Data
o Kafka, Hadoop

•

Single Source of Truth (SSOT)
o Hadoop, No ETL
References

•

[1]Satisfying new requirements for Data Integration By

David Loshin

•
•
•
•
•

[2]Apache Kafka
[3]Kafka API
[4]Kafka 0.8 Quick Start

[5]Apache Avro
[5]Storm
Thank you!

From a Kafkaesque Story to The Promised Land at LivePerson

  • 1.
    From a KafkaesqueStory to the Promised Land 7/7/2013 Ran Silberman
  • 2.
    Open Source paradigm TheCathedral & the Bazaar by Eric S Raymond, 1999 the struggle between top-down and bottom-up design
  • 3.
    Challenges of dataplatform[1] • • • • • • • • High throughput Horizontal scale to address growth High availability of data services No Data loss Satisfy Real-Time demands Enforce structural data with schemas Process Big Data and Enterprise Data Single Source of Truth (SSOT)
  • 4.
    SLA's of dataplatform Real-time servers Real-time Customers Offline Customers Data Bus SLA: 1. 98% in < 1/2 hr 2. 99.999% < 4 hrs BI DWH SLA: 1. 98% in < 500 msec 2. No send > 2 sec Real-time dashboards
  • 5.
    Legacy Data flowin LivePerson RealTime servers ETL Sessionize Modeling Schema View Customers View Reports BI DWH (Oracle)
  • 6.
    1st phase -move to Hadoop RealTime servers ETL Sessionize Modeling Schema View Customers View Reports Hadoop MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  • 7.
    2. move toKafka RealTime servers Kafka Topic-1 Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  • 8.
    3. Integrate withnew producers New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  • 9.
    4. Add Real-timeBI New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  • 10.
    5. Standardize Data-Modelusing Avro New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Hadoop Customers View Reports Camus 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  • 11.
    6. Define SingleSource of Truth (SSOT) New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Hadoop Customers View Reports Camus 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  • 12.
    Kafka[2] as Backbonefor Data • • • • • • • • Central "Message Bus" Support multiple topics (MQ style) Write ahead to files Distributed & Highly Available Horizontal Scale High throughput (10s MB/Sec per server) Service is agnostic to consumers' state Retention policy
  • 13.
  • 14.
    Kafka Architecture cont. Producer1 Producer 2 Producer 3 Zookeeper Node 1 Consumer 1 Node 2 Consumer 1 Node 3 Consumer 1
  • 15.
    Kafka Architecture cont. Producer1 Topic1 Producer 2 Topic2 Zookeeper Node 1 Node 2 Node 3 Node 4 Group1 Consumer 1 Consumer 2 Consumer 3
  • 16.
    Kafka replay messages. Zookeeper MinOffset -> Max Offset -> Node 3 Node 4 fetchRequest = new fetchRequest(topic, partition, offset, size); currentOffset : taken from zookeeper Earliest offset: -2 Latest offset : -1
  • 17.
    Kafka API[3] • • Producer API ConsumerAPI o High-level API  using zookeeper to access brokers and to save offsets o SimpleConsumer API  • direct access to Kafka brokers Kafka-Spout, Camus, and KafkaHadoopConsumer all use SimpleConsumer
  • 18.
    Kafka API[3] • Producer messages =new List<KeyedMessage<K, V>>() Messages.add(new KeyedMessage(“topic1”, null, msg1)); Send(messages); • Consumer streams[] = Consumer.createMessageStream((“topic1”, 1); for (message: streams[0]{ //do something with message }
  • 19.
    Kafka in UnitTesting • • Use of class KafkaServer Run embedded server
  • 20.
    Introducing Avro[5] • • Schema representationusing JSON Support types o Primitive types: boolean, int, long, string, etc. o Complex types: Record, Enum, Union, Arrays, Maps, Fixed • • Data is serialized using its schema Avro files include file-header of the schema
  • 21.
    Add Avro protocolto the story Kafka message 1.0 Producer 1 Create Message according to Schema 1.0 Header {event1:{header:{sessionId:"102122"),{timestamp:"12346")}... Avro message Register schema 1.0 Add revision to message header Send message Encode message with Schema 1.0 Schema Repo Topic 1 Read message Topic 2 Extract header and obtain schema version Consumers: Camus/Storm Get schema by version 1.0 Decode message with schema 1.0 Pass message
  • 22.
    Kafka + Storm+ Avro example • Demonstrating use of Avro data passing from Kafka to Storm • • • • • Explains Avro revision evolution Requires Kafka and Zookeeper installed Uses Storm artifact and Kafka-Spout artifact in Maven Plugin generates Java classes from Avro Schema https://github.com/ransilberman/avro-kafka-storm
  • 23.
    Resiliency Producer machine Producer Send message toKafka Persist message to local disk local file Kafka Bridge Send message to Kafka Real-time Consumer: Storm Fast Topic Consistent Topic Offline Consumer: Hadoop
  • 24.
    Challenges of Kafka • • • • • • • Stillnot mature enough Not enough supporting tools (viewers, maintenance) Duplications may occur API not documented enough Open Source - support by community only Difficult to replay messages from specific point in time Eventually Consistent...
  • 25.
    Eventually Consistent Because itis a distributed system - • • • • No guarantee for delivery order No way to tell to which broker message is sent Kafka do not guarantee that there are no duplications ...But eventually, all message will arrive! Desert Event generated Event destination
  • 26.
    Major Improvements inKafka 0.8[4] • • • Partitions replication Message send guarantee Consumer offsets are represented numbers instead of bytes (e.g., 1, 2, 3, ..)
  • 27.
    Addressing Data Challenges • Highthroughput o Kafka, Hadoop • Horizontal scale to address growth o Kafka, Storm, Hadoop • High availability of data services o Kafka, Storm, Zookeeper • No Data loss o Highly Available services, No ETL
  • 28.
    Addressing Data ChallengesCont. • Satisfy Real-Time demands o Storm • Enforce structural data with schemas o Avro • Process Big Data and Enterprise Data o Kafka, Hadoop • Single Source of Truth (SSOT) o Hadoop, No ETL
  • 29.
    References • [1]Satisfying new requirementsfor Data Integration By David Loshin • • • • • [2]Apache Kafka [3]Kafka API [4]Kafka 0.8 Quick Start [5]Apache Avro [5]Storm
  • 30.