0
From a Kafkaesque Story to
the Promised Land
7/7/2013
Ran Silberman
Open Source paradigm
The Cathedral & the Bazaar by Eric S Raymond, 1999
the struggle between top-down and bottom-up design
Challenges of data platform[1]
• High throughput
• Horizontal scale to address growth
• High availability of data services...
SLA's of data platform
BI DWH
Real-time
Customers
Real-time
dashboards
Data Bus
Offline
Customers
SLA:
1. 98% in < 1/2 hr
...
Legacy Data flow in LivePerson
BI DWH
(Oracle)
RealTime
servers
View
Reports
Customers
ETL
Sessionize
Modeling
Schema
View
1st phase - move to Hadoop
ETL
Sessionize
Modeling
Schema
View
RealTime
servers
BI DWH
(Vertica)
View
Reports
HDFS
Hadoop
...
2. move to Kafka
6
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1
View
Report...
3. Integrate with new producers
6
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topi...
4. Add Real-time BI
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH...
5. Standardize Data-Model using Avro
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfe...
6. Define Single Source of Truth (SSOT)
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job tran...
Kafka[2] as Backbone for Data
• Central "Message Bus"
• Support multiple topics (MQ style)
• Write ahead to files
• Distri...
Kafka Architecture
Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2 Producer 3
Node 2 Node 3
Consumer 1 Consumer 1Consumer 1
Group1
Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2
Node 3 Node 4
Consumer 2 Consumer 3Consumer 1
Node ...
Kafka replay messages.
Zookeeper
Node 3 Node 4
Min Offset ->
Max Offset ->
fetchRequest = new fetchRequest(topic, partitio...
Kafka API[3]
• Producer API
• Consumer API
o High-level API
 using zookeeper to access brokers and to save
offsets
o Simp...
Kafka API[3]
• Producer
messages = new List<KeyedMessage<K, V>>()
Messages.add(new KeyedMessage(“topic1”, null, msg1));
Se...
Kafka in Unit Testing
• Use of class KafkaServer
• Run embedded server
Introducing Avro[5]
• Schema representation using JSON
• Support types
o Primitive types: boolean, int, long, string, etc....
Add Avro protocol to the story
Topic 1
Schema Repo
Producer 1
Topic 2
Consumers:
Camus/Storm
Create Message
according to
S...
Kafka + Storm + Avro example
• Demonstrating use of Avro data passing from Kafka to
Storm
• Explains Avro revision evoluti...
Producer machine
Resiliency
Producer
Consistent
Topic
Send message
to Kafka
local file
Persist message to
local disk
Kafka...
Challenges of Kafka
• Still not mature enough
• Not enough supporting tools (viewers, maintenance)
• Duplications may occu...
Eventually Consistent
Because it is a distributed system -
• No guarantee for delivery order
• No way to tell to which bro...
Major Improvements in Kafka 0.8[4]
• Partitions replication
• Message send guarantee
• Consumer offsets are represented nu...
Addressing Data Challenges
• High throughput
o Kafka, Hadoop
• Horizontal scale to address growth
o Kafka, Storm, Hadoop
•...
Addressing Data Challenges Cont.
• Satisfy Real-Time demands
o Storm
• Enforce structural data with schemas
o Avro
• Proce...
References
• [1]Satisfying new requirements for Data Integration By
David Loshin
• [2]Apache Kafka
• [3]Kafka API
• [4]Kaf...
Thank you!
Upcoming SlideShare
Loading in...5
×

From a kafkaesque story to The Promised Land

2,560

Published on

LivePerson moved from an ETL based data platform to a new data platform based on emerging technologies from the Open Source community: Hadoop, Kafka, Storm, Avro and more.
This presentation tells the story and focuses on Kafka.

Published in: Technology

Transcript of "From a kafkaesque story to The Promised Land"

  1. 1. From a Kafkaesque Story to the Promised Land 7/7/2013 Ran Silberman
  2. 2. Open Source paradigm The Cathedral & the Bazaar by Eric S Raymond, 1999 the struggle between top-down and bottom-up design
  3. 3. Challenges of data platform[1] • High throughput • Horizontal scale to address growth • High availability of data services • No Data loss • Satisfy Real-Time demands • Enforce structural data with schemas • Process Big Data and Enterprise Data • Single Source of Truth (SSOT)
  4. 4. SLA's of data platform BI DWH Real-time Customers Real-time dashboards Data Bus Offline Customers SLA: 1. 98% in < 1/2 hr 2. 99.999% < 4 hrs SLA: 1. 98% in < 500 msec 2. No send > 2 sec Real-time servers
  5. 5. Legacy Data flow in LivePerson BI DWH (Oracle) RealTime servers View Reports Customers ETL Sessionize Modeling Schema View
  6. 6. 1st phase - move to Hadoop ETL Sessionize Modeling Schema View RealTime servers BI DWH (Vertica) View Reports HDFS Hadoop MR Job transfers data to BI DWH Customers BI DWH (Oracle)
  7. 7. 2. move to Kafka 6 RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 View Reports Customers
  8. 8. 3. Integrate with new producers 6 RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers View Reports Customers
  9. 9. 4. Add Real-time BI View Reports 6 Customers RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Storm Topology
  10. 10. 5. Standardize Data-Model using Avro View Reports 6 Customers RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Storm Topology Camus
  11. 11. 6. Define Single Source of Truth (SSOT) View Reports 6 Customers RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Storm Topology Camus
  12. 12. Kafka[2] as Backbone for Data • Central "Message Bus" • Support multiple topics (MQ style) • Write ahead to files • Distributed & Highly Available • Horizontal Scale • High throughput (10s MB/Sec per server) • Service is agnostic to consumers' state • Retention policy
  13. 13. Kafka Architecture
  14. 14. Kafka Architecture cont. Node 1 Zookeeper Producer 1 Producer 2 Producer 3 Node 2 Node 3 Consumer 1 Consumer 1Consumer 1
  15. 15. Group1 Kafka Architecture cont. Node 1 Zookeeper Producer 1 Producer 2 Node 3 Node 4 Consumer 2 Consumer 3Consumer 1 Node 2 Topic1 Topic2
  16. 16. Kafka replay messages. Zookeeper Node 3 Node 4 Min Offset -> Max Offset -> fetchRequest = new fetchRequest(topic, partition, offset, size); currentOffset : taken from zookeeper Earliest offset: -2 Latest offset : -1
  17. 17. Kafka API[3] • Producer API • Consumer API o High-level API  using zookeeper to access brokers and to save offsets o SimpleConsumer API  direct access to Kafka brokers • Kafka-Spout, Camus, and KafkaHadoopConsumer all use SimpleConsumer
  18. 18. Kafka API[3] • Producer messages = new List<KeyedMessage<K, V>>() Messages.add(new KeyedMessage(“topic1”, null, msg1)); Send(messages); • Consumer streams[] = Consumer.createMessageStream((“topic1”, 1); for (message: streams[0]{ //do something with message }
  19. 19. Kafka in Unit Testing • Use of class KafkaServer • Run embedded server
  20. 20. Introducing Avro[5] • Schema representation using JSON • Support types o Primitive types: boolean, int, long, string, etc. o Complex types: Record, Enum, Union, Arrays, Maps, Fixed • Data is serialized using its schema • Avro files include file-header of the schema
  21. 21. Add Avro protocol to the story Topic 1 Schema Repo Producer 1 Topic 2 Consumers: Camus/Storm Create Message according to Schema 1.0 Register schema 1.0 Add revision to message header Send message Read message Extract header and obtain schema version Get schema by version 1.0 Encode message with Schema 1.0 Decode message with schema 1.0 {event1:{header:{sessionId:"102122"),{timestamp:"12346")}... Header Avro message Kafka message Pass message 1.0
  22. 22. Kafka + Storm + Avro example • Demonstrating use of Avro data passing from Kafka to Storm • Explains Avro revision evolution • Requires Kafka and Zookeeper installed • Uses Storm artifact and Kafka-Spout artifact in Maven • Plugin generates Java classes from Avro Schema • https://github.com/ransilberman/avro-kafka-storm
  23. 23. Producer machine Resiliency Producer Consistent Topic Send message to Kafka local file Persist message to local disk Kafka Bridge Send message to Kafka Fast Topic Real-time Consumer: Storm Offline Consumer: Hadoop
  24. 24. Challenges of Kafka • Still not mature enough • Not enough supporting tools (viewers, maintenance) • Duplications may occur • API not documented enough • Open Source - support by community only • Difficult to replay messages from specific point in time • Eventually Consistent...
  25. 25. Eventually Consistent Because it is a distributed system - • No guarantee for delivery order • No way to tell to which broker message is sent • Kafka do not guarantee that there are no duplications • ...But eventually, all message will arrive! Event generated Event destination Desert
  26. 26. Major Improvements in Kafka 0.8[4] • Partitions replication • Message send guarantee • Consumer offsets are represented numbers instead of bytes (e.g., 1, 2, 3, ..)
  27. 27. Addressing Data Challenges • High throughput o Kafka, Hadoop • Horizontal scale to address growth o Kafka, Storm, Hadoop • High availability of data services o Kafka, Storm, Zookeeper • No Data loss o Highly Available services, No ETL
  28. 28. Addressing Data Challenges Cont. • Satisfy Real-Time demands o Storm • Enforce structural data with schemas o Avro • Process Big Data and Enterprise Data o Kafka, Hadoop • Single Source of Truth (SSOT) o Hadoop, No ETL
  29. 29. References • [1]Satisfying new requirements for Data Integration By David Loshin • [2]Apache Kafka • [3]Kafka API • [4]Kafka 0.8 Quick Start • [5]Apache Avro • [5]Storm
  30. 30. Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×