From a Kafkaesque Story to The Promised Land at LivePerson

2,976 views

Published on

Ran Silberman, developer & technical leader at LivePerson presents how LivePerson moved their data platform from a legacy ETL concept to new "Data Integration" concept of our era.

Kafka is the main infrastructure that holds the backbone for data flow in the new Data Integration. Having that said, Kafka cannot come by itself. Other supporting systems like Hadoop, Storm, and Avro protocol were also integrated.

In this lecture Ran will describe the implementation in LivePerson and will share some tips and how to avoid pitfalls.

Read More: https://connect.liveperson.com/community/developers/blog/2013/11/21/from-a-kafkaesque-story-to-the-promised-land

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,976
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

From a Kafkaesque Story to The Promised Land at LivePerson

  1. 1. From a Kafkaesque Story to the Promised Land 7/7/2013 Ran Silberman
  2. 2. Open Source paradigm The Cathedral & the Bazaar by Eric S Raymond, 1999 the struggle between top-down and bottom-up design
  3. 3. Challenges of data platform[1] • • • • • • • • High throughput Horizontal scale to address growth High availability of data services No Data loss Satisfy Real-Time demands Enforce structural data with schemas Process Big Data and Enterprise Data Single Source of Truth (SSOT)
  4. 4. SLA's of data platform Real-time servers Real-time Customers Offline Customers Data Bus SLA: 1. 98% in < 1/2 hr 2. 99.999% < 4 hrs BI DWH SLA: 1. 98% in < 500 msec 2. No send > 2 sec Real-time dashboards
  5. 5. Legacy Data flow in LivePerson RealTime servers ETL Sessionize Modeling Schema View Customers View Reports BI DWH (Oracle)
  6. 6. 1st phase - move to Hadoop RealTime servers ETL Sessionize Modeling Schema View Customers View Reports Hadoop MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  7. 7. 2. move to Kafka RealTime servers Kafka Topic-1 Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  8. 8. 3. Integrate with new producers New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  9. 9. 4. Add Real-time BI New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Customers View Reports Hadoop 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  10. 10. 5. Standardize Data-Model using Avro New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Hadoop Customers View Reports Camus 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  11. 11. 6. Define Single Source of Truth (SSOT) New RealTime servers RealTime servers Kafka Topic-1 Topic-2 Storm Topology Hadoop Customers View Reports Camus 6 MR Job transfers data to BI DWH HDFS BI DWH (Vertica)
  12. 12. Kafka[2] as Backbone for Data • • • • • • • • Central "Message Bus" Support multiple topics (MQ style) Write ahead to files Distributed & Highly Available Horizontal Scale High throughput (10s MB/Sec per server) Service is agnostic to consumers' state Retention policy
  13. 13. Kafka Architecture
  14. 14. Kafka Architecture cont. Producer 1 Producer 2 Producer 3 Zookeeper Node 1 Consumer 1 Node 2 Consumer 1 Node 3 Consumer 1
  15. 15. Kafka Architecture cont. Producer 1 Topic1 Producer 2 Topic2 Zookeeper Node 1 Node 2 Node 3 Node 4 Group1 Consumer 1 Consumer 2 Consumer 3
  16. 16. Kafka replay messages. Zookeeper Min Offset -> Max Offset -> Node 3 Node 4 fetchRequest = new fetchRequest(topic, partition, offset, size); currentOffset : taken from zookeeper Earliest offset: -2 Latest offset : -1
  17. 17. Kafka API[3] • • Producer API Consumer API o High-level API  using zookeeper to access brokers and to save offsets o SimpleConsumer API  • direct access to Kafka brokers Kafka-Spout, Camus, and KafkaHadoopConsumer all use SimpleConsumer
  18. 18. Kafka API[3] • Producer messages = new List<KeyedMessage<K, V>>() Messages.add(new KeyedMessage(“topic1”, null, msg1)); Send(messages); • Consumer streams[] = Consumer.createMessageStream((“topic1”, 1); for (message: streams[0]{ //do something with message }
  19. 19. Kafka in Unit Testing • • Use of class KafkaServer Run embedded server
  20. 20. Introducing Avro[5] • • Schema representation using JSON Support types o Primitive types: boolean, int, long, string, etc. o Complex types: Record, Enum, Union, Arrays, Maps, Fixed • • Data is serialized using its schema Avro files include file-header of the schema
  21. 21. Add Avro protocol to the story Kafka message 1.0 Producer 1 Create Message according to Schema 1.0 Header {event1:{header:{sessionId:"102122"),{timestamp:"12346")}... Avro message Register schema 1.0 Add revision to message header Send message Encode message with Schema 1.0 Schema Repo Topic 1 Read message Topic 2 Extract header and obtain schema version Consumers: Camus/Storm Get schema by version 1.0 Decode message with schema 1.0 Pass message
  22. 22. Kafka + Storm + Avro example • Demonstrating use of Avro data passing from Kafka to Storm • • • • • Explains Avro revision evolution Requires Kafka and Zookeeper installed Uses Storm artifact and Kafka-Spout artifact in Maven Plugin generates Java classes from Avro Schema https://github.com/ransilberman/avro-kafka-storm
  23. 23. Resiliency Producer machine Producer Send message to Kafka Persist message to local disk local file Kafka Bridge Send message to Kafka Real-time Consumer: Storm Fast Topic Consistent Topic Offline Consumer: Hadoop
  24. 24. Challenges of Kafka • • • • • • • Still not mature enough Not enough supporting tools (viewers, maintenance) Duplications may occur API not documented enough Open Source - support by community only Difficult to replay messages from specific point in time Eventually Consistent...
  25. 25. Eventually Consistent Because it is a distributed system - • • • • No guarantee for delivery order No way to tell to which broker message is sent Kafka do not guarantee that there are no duplications ...But eventually, all message will arrive! Desert Event generated Event destination
  26. 26. Major Improvements in Kafka 0.8[4] • • • Partitions replication Message send guarantee Consumer offsets are represented numbers instead of bytes (e.g., 1, 2, 3, ..)
  27. 27. Addressing Data Challenges • High throughput o Kafka, Hadoop • Horizontal scale to address growth o Kafka, Storm, Hadoop • High availability of data services o Kafka, Storm, Zookeeper • No Data loss o Highly Available services, No ETL
  28. 28. Addressing Data Challenges Cont. • Satisfy Real-Time demands o Storm • Enforce structural data with schemas o Avro • Process Big Data and Enterprise Data o Kafka, Hadoop • Single Source of Truth (SSOT) o Hadoop, No ETL
  29. 29. References • [1]Satisfying new requirements for Data Integration By David Loshin • • • • • [2]Apache Kafka [3]Kafka API [4]Kafka 0.8 Quick Start [5]Apache Avro [5]Storm
  30. 30. Thank you!

×