• Save
From a Kafkaesque Story to The Promised Land at LivePerson
Upcoming SlideShare
Loading in...5
×
 

From a Kafkaesque Story to The Promised Land at LivePerson

on

  • 1,750 views

Ran Silberman, developer & technical leader at LivePerson presents how LivePerson moved their data platform from a legacy ETL concept to new "Data Integration" concept of our era. ...

Ran Silberman, developer & technical leader at LivePerson presents how LivePerson moved their data platform from a legacy ETL concept to new "Data Integration" concept of our era.

Kafka is the main infrastructure that holds the backbone for data flow in the new Data Integration. Having that said, Kafka cannot come by itself. Other supporting systems like Hadoop, Storm, and Avro protocol were also integrated.

In this lecture Ran will describe the implementation in LivePerson and will share some tips and how to avoid pitfalls.

Read More: https://connect.liveperson.com/community/developers/blog/2013/11/21/from-a-kafkaesque-story-to-the-promised-land

Statistics

Views

Total Views
1,750
Views on SlideShare
1,750
Embed Views
0

Actions

Likes
4
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    From a Kafkaesque Story to The Promised Land at LivePerson From a Kafkaesque Story to The Promised Land at LivePerson Presentation Transcript

    • From a Kafkaesque Story to the Promised Land 7/7/2013 Ran Silberman
    • Open Source paradigm The Cathedral & the Bazaar by Eric S Raymond, 1999 the struggle between top-down and bottom-up design
    • Challenges of data platform[1] • • • • • • • • High throughput Horizontal scale to address growth High availability of data services No Data loss Satisfy Real-Time demands Enforce structural data with schemas Process Big Data and Enterprise Data Single Source of Truth (SSOT)
    • SLA's of data platform Real-time servers Real-time Customers Offline Customers Data Bus SLA: 1. 98% in < 1/2 hr 2. 99.999% < 4 hrs BI DWH SLA: 1. 98% in < 500 msec 2. No send > 2 sec Real-time dashboards
    • Legacy Data flow in LivePerson RealTime servers ETL Sessionize Modeling Schema View Customers View Reports BI DWH (Oracle)
    • 1st phase - move to Hadoop RealTime servers ETL Sessionize Modeling Schema View Customers View Reports Hadoop MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
    • 2. move to Kafka RealTime servers Kafka Topic-1 Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
    • 3. Integrate with new producers New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
    • 4. Add Real-time BI New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
    • 5. Standardize Data-Model using Avro New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Hadoop Customers View Reports Camus 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
    • 6. Define Single Source of Truth (SSOT) New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Hadoop Customers View Reports Camus 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
    • Kafka[2] as Backbone for Data • • • • • • • • Central "Message Bus" Support multiple topics (MQ style) Write ahead to files Distributed & Highly Available Horizontal Scale High throughput (10s MB/Sec per server) Service is agnostic to consumers' state Retention policy
    • Kafka Architecture
    • Kafka Architecture cont. Producer 1 Producer 2 Producer 3 Zookeeper Node 1 Consumer 1 Node 2 Consumer 1 Node 3 Consumer 1
    • Kafka Architecture cont. Producer 1 Topic1 Producer 2 Topic2 Zookeeper Node 1 Node 2 Node 3 Node 4 Group1 Consumer 1 Consumer 2 Consumer 3
    • Kafka replay messages. Zookeeper Min Offset -> Max Offset -> Node 3 Node 4 fetchRequest = new fetchRequest(topic, partition, offset, size); currentOffset : taken from zookeeper Earliest offset: -2 Latest offset : -1
    • Kafka API[3] • • Producer API Consumer API o High-level API  using zookeeper to access brokers and to save offsets o SimpleConsumer API  • direct access to Kafka brokers Kafka-Spout, Camus, and KafkaHadoopConsumer all use SimpleConsumer
    • Kafka API[3] • Producer messages = new List<KeyedMessage<K, V>>() Messages.add(new KeyedMessage(“topic1”, null, msg1)); Send(messages); • Consumer streams[] = Consumer.createMessageStream((“topic1”, 1); for (message: streams[0]{ //do something with message }
    • Kafka in Unit Testing • • Use of class KafkaServer Run embedded server
    • Introducing Avro[5] • • Schema representation using JSON Support types o Primitive types: boolean, int, long, string, etc. o Complex types: Record, Enum, Union, Arrays, Maps, Fixed • • Data is serialized using its schema Avro files include file-header of the schema
    • Add Avro protocol to the story Kafka message 1.0 Producer 1 Create Message according to Schema 1.0 Header {event1:{header:{sessionId:"102122"),{timestamp:"12346")}... Avro message Register schema 1.0 Add revision to message header Send message Encode message with Schema 1.0 Schema Repo Topic 1 Read message Topic 2 Extract header and obtain schema version Consumers: Camus/Storm Get schema by version 1.0 Decode message with schema 1.0 Pass message
    • Kafka + Storm + Avro example • Demonstrating use of Avro data passing from Kafka to Storm • • • • • Explains Avro revision evolution Requires Kafka and Zookeeper installed Uses Storm artifact and Kafka-Spout artifact in Maven Plugin generates Java classes from Avro Schema https://github.com/ransilberman/avro-kafka-storm
    • Resiliency Producer machine Producer Send message to Kafka Persist message to local disk local file Kafka Bridge Send message to Kafka Real-time Consumer: Storm Fast Topic Consistent Topic Offline Consumer: Hadoop
    • Challenges of Kafka • • • • • • • Still not mature enough Not enough supporting tools (viewers, maintenance) Duplications may occur API not documented enough Open Source - support by community only Difficult to replay messages from specific point in time Eventually Consistent...
    • Eventually Consistent Because it is a distributed system - • • • • No guarantee for delivery order No way to tell to which broker message is sent Kafka do not guarantee that there are no duplications ...But eventually, all message will arrive! Desert Event generated Event destination
    • Major Improvements in Kafka 0.8[4] • • • Partitions replication Message send guarantee Consumer offsets are represented numbers instead of bytes (e.g., 1, 2, 3, ..)
    • Addressing Data Challenges • High throughput o Kafka, Hadoop • Horizontal scale to address growth o Kafka, Storm, Hadoop • High availability of data services o Kafka, Storm, Zookeeper • No Data loss o Highly Available services, No ETL
    • Addressing Data Challenges Cont. • Satisfy Real-Time demands o Storm • Enforce structural data with schemas o Avro • Process Big Data and Enterprise Data o Kafka, Hadoop • Single Source of Truth (SSOT) o Hadoop, No ETL
    • References • [1]Satisfying new requirements for Data Integration By David Loshin • • • • • [2]Apache Kafka [3]Kafka API [4]Kafka 0.8 Quick Start [5]Apache Avro [5]Storm
    • Thank you!