Ran Silberman, developer & technical leader at LivePerson presents how LivePerson moved their data platform from a legacy ETL concept to new "Data Integration" concept of our era.
Kafka is the main infrastructure that holds the backbone for data flow in the new Data Integration. Having that said, Kafka cannot come by itself. Other supporting systems like Hadoop, Storm, and Avro protocol were also integrated.
In this lecture Ran will describe the implementation in LivePerson and will share some tips and how to avoid pitfalls.
Read More: https://connect.liveperson.com/community/developers/blog/2013/11/21/from-a-kafkaesque-story-to-the-promised-land
2. Open Source paradigm
The Cathedral & the Bazaar by Eric S Raymond, 1999
the struggle between top-down and bottom-up design
3. Challenges of data platform[1]
•
•
•
•
•
•
•
•
High throughput
Horizontal scale to address growth
High availability of data services
No Data loss
Satisfy Real-Time demands
Enforce structural data with schemas
Process Big Data and Enterprise Data
Single Source of Truth (SSOT)
4. SLA's of data platform
Real-time
servers
Real-time
Customers
Offline
Customers
Data Bus
SLA:
1. 98% in < 1/2 hr
2. 99.999% < 4 hrs
BI DWH
SLA:
1. 98% in < 500 msec
2. No send > 2 sec
Real-time
dashboards
5. Legacy Data flow in LivePerson
RealTime
servers
ETL
Sessionize
Modeling
Schema
View
Customers
View
Reports
BI DWH
(Oracle)
6. 1st phase - move to Hadoop
RealTime
servers
ETL
Sessionize
Modeling
Schema
View
Customers
View
Reports
Hadoop
MR Job transfers
data to BI DWH
HDFS
BI DWH
(Vertica)
7. 2. move to Kafka
RealTime
servers
Kafka
Topic-1
Customers
View
Reports
Hadoop
6
MR Job transfers
data to BI DWH
HDFS
BI DWH
(Vertica)
8. 3. Integrate with new producers
New
RealTime
servers
RealTime
servers
Kafka
Topic-1
Topic-2
Customers
View
Reports
Hadoop
6
MR Job transfers
data to BI DWH
HDFS
BI DWH
(Vertica)
9. 4. Add Real-time BI
New
RealTime
servers
RealTime
servers
Kafka
Topic-1
Topic-2
Storm
Topology
Customers
View
Reports
Hadoop
6
MR Job transfers
data to BI DWH
HDFS
BI DWH
(Vertica)
10. 5. Standardize Data-Model using Avro
New
RealTime
servers
RealTime
servers
Kafka
Topic-1
Topic-2
Storm
Topology
Hadoop
Customers
View
Reports
Camus
6
MR Job transfers
data to BI DWH
HDFS
BI DWH
(Vertica)
11. 6. Define Single Source of Truth (SSOT)
New
RealTime
servers
RealTime
servers
Kafka
Topic-1
Topic-2
Storm
Topology
Hadoop
Customers
View
Reports
Camus
6
MR Job transfers
data to BI DWH
HDFS
BI DWH
(Vertica)
12. Kafka[2] as Backbone for Data
•
•
•
•
•
•
•
•
Central "Message Bus"
Support multiple topics (MQ style)
Write ahead to files
Distributed & Highly Available
Horizontal Scale
High throughput (10s MB/Sec per server)
Service is agnostic to consumers' state
Retention policy
16. Kafka replay messages.
Zookeeper
Min Offset ->
Max Offset ->
Node 3
Node 4
fetchRequest = new fetchRequest(topic, partition, offset, size);
currentOffset : taken from zookeeper
Earliest offset: -2
Latest offset : -1
17. Kafka API[3]
•
•
Producer API
Consumer API
o
High-level API
using zookeeper to access brokers and to save
offsets
o
SimpleConsumer API
•
direct access to Kafka brokers
Kafka-Spout, Camus, and KafkaHadoopConsumer all
use SimpleConsumer
18. Kafka API[3]
•
Producer
messages = new List<KeyedMessage<K, V>>()
Messages.add(new KeyedMessage(“topic1”, null, msg1));
Send(messages);
•
Consumer
streams[] = Consumer.createMessageStream((“topic1”, 1);
for (message: streams[0]{
//do something with message
}
19. Kafka in Unit Testing
•
•
Use of class KafkaServer
Run embedded server
20. Introducing Avro[5]
•
•
Schema representation using JSON
Support types
o
Primitive types: boolean, int, long, string, etc.
o
Complex types:
Record, Enum, Union, Arrays, Maps, Fixed
•
•
Data is serialized using its schema
Avro files include file-header of the schema
21. Add Avro protocol to the story
Kafka message
1.0
Producer 1
Create Message
according to
Schema 1.0
Header
{event1:{header:{sessionId:"102122"),{timestamp:"12346")}...
Avro message
Register schema 1.0
Add revision to
message header
Send message
Encode message
with Schema 1.0
Schema Repo
Topic 1
Read message
Topic 2
Extract header
and obtain
schema version
Consumers:
Camus/Storm
Get schema by
version 1.0
Decode message
with schema 1.0
Pass message
22. Kafka + Storm + Avro example
•
Demonstrating use of Avro data passing from Kafka to
Storm
•
•
•
•
•
Explains Avro revision evolution
Requires Kafka and Zookeeper installed
Uses Storm artifact and Kafka-Spout artifact in Maven
Plugin generates Java classes from Avro Schema
https://github.com/ransilberman/avro-kafka-storm
23. Resiliency
Producer machine
Producer
Send message
to Kafka
Persist message to
local disk
local file
Kafka Bridge
Send message
to Kafka
Real-time
Consumer:
Storm
Fast
Topic
Consistent
Topic
Offline
Consumer:
Hadoop
24. Challenges of Kafka
•
•
•
•
•
•
•
Still not mature enough
Not enough supporting tools (viewers, maintenance)
Duplications may occur
API not documented enough
Open Source - support by community only
Difficult to replay messages from specific point in time
Eventually Consistent...
25. Eventually Consistent
Because it is a distributed system -
•
•
•
•
No guarantee for delivery order
No way to tell to which broker message is sent
Kafka do not guarantee that there are no duplications
...But eventually, all message will arrive!
Desert
Event
generated
Event
destination
26. Major Improvements in Kafka 0.8[4]
•
•
•
Partitions replication
Message send guarantee
Consumer offsets are represented numbers instead of
bytes (e.g., 1, 2, 3, ..)
27. Addressing Data Challenges
•
High throughput
o Kafka, Hadoop
•
Horizontal scale to address growth
o Kafka, Storm, Hadoop
•
High availability of data services
o Kafka, Storm, Zookeeper
•
No Data loss
o Highly Available services, No ETL
28. Addressing Data Challenges Cont.
•
Satisfy Real-Time demands
o Storm
•
Enforce structural data with schemas
o Avro
•
Process Big Data and Enterprise Data
o Kafka, Hadoop
•
Single Source of Truth (SSOT)
o Hadoop, No ETL