Real Time Data Streaming using Kafka & Storm

  • 4,588 views
Uploaded on

This presentation describes 3 real use case of Real-Time Data Streaming and how they were implemented in LivePerson using Kafka and Storm

This presentation describes 3 real use case of Real-Time Data Streaming and how they were implemented in LivePerson using Kafka and Storm

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,588
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
189
Comments
0
Likes
22

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. DATA LivePerson Case Study: Real Time Data Streaming March 20th 2014 Ran Silberman
  • 2. About me ● Technical Leader of Data Platform in LivePerson ● Bird watcher and amateur bird photographer Pharaoh Eagle-Owl / Bubo ascalaphus This is what the people from previous slide were looking at… Amir Silberman
  • 3. Agenda ● Why we chose Kafka + Storm ● How implementation was done ● Measures of success ● Two examples of use ● Tips from our experience
  • 4. Data in LivePerson Visitor in Site Chat Window Agent console LivePerson SaaS Server LoginMonitor Rules, Intelligence, Decision Chat Chat Invite DATA DATA DATA BIG DATA
  • 5. Legacy Data flow in LivePerson BI DWH (Oracle) RealTime servers ETL Sessionize Modeling Schema View Real-Time data Historical data
  • 6. Why Kafka + Storm? ● Need to scale out and plan for future scale ○ Limit for scale should not be technology ○ Let the limit be cost of (commodity) hardware ● What Data platforms can be implemented quickly? ○ Open source - fast evolving and community ○ Micro-services - do only what you ought to do! ● Are there risks in this choice? ○ Yes! technology is not mature enough ○ But, there is no other mature technology that can address our needs!
  • 7. Long-eared Owl / Asio otus Amir Silberman
  • 8. Legacy Data flow in LivePerson BI DWH (Oracle) RealTime servers Customers ETL Sessionize Modeling Schema View
  • 9. 1st phase - move to Hadoop ETL Sessionize Modeling Schema View RealTime servers BI DWH (Vertica)HDFS Hadoop MR Job transfers data to BI DWH Customers
  • 10. 2. move to Kafka 6 RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Customers
  • 11. 3. Integrate with new producers 6 RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Customers
  • 12. 4. Add Real-time BI 6 Customers RealTime servers HDFS BI DWH (Vertica) Hadoop MR Job transfers data to BI DWH Kafka Topic-1 Topic-2 New RealTime servers Storm Topology Analytics DB
  • 13. Architecture Real-time servers Kafka Storm Cassandra/ CouchBase Real Time Processing Flow rate into Kafka: 33 MB/Sec Flow rate from Kafka: 20 MB/Sec Total daily data in Kafka: 17 Billion events Some Numbers: Cyber Monday 2013 Dashboards 4 topologies reading all events
  • 14. Eurasian Wryneck / Jynx torquilla Amir Silberman
  • 15. Two use cases 1. Visitor list 2. Agent State
  • 16. 1st Strom Use Case: “Visitors List” Use case: ● Show list of visitors in the “Agent Console” ● Collect data about visitor in real time ● Visitor stickiness in streaming process
  • 17. Visitors List Topology
  • 18. Selected Analytics DB - Couchbase 1st Strom Use Case: “Visitors List” ● Document Store - for complex documents ● Searchable - possible to search by different attributes. ● High throughput - Read & Write
  • 19. First Storm Topology – Visitor Feed Storm Topology Kafka Spout Analyze relevant events Write event to Visitor document emit emit Kafka events stream Add/ Update Couchbase “Visitor List” Topology: Analytics DB: Couchbase - Document store Parse Avro into tuple emit
  • 20. Visitors List - Storm considerations ● Complex calculations before sending to DB ○ Ignore delayed events ○ Reorder events before storing ● Document cached in memory ● Fields Grouping to bolt that writes to CouchBase ● High parallelism in bolt that writes to CouchBase
  • 21. Visitors List Topology
  • 22. European Roller / Coracias garrulus Amir Silberman
  • 23. 2nd Storm Use Case: “Agent State” Use case: ● Show Agent activity on “Agent Console” ● Count Agent statistics ● Display graphs
  • 24. Agent Status Topology
  • 25. Selected Analytics DB - Cassandra 2nd Storm Use Case: “Agent State” ● Wide Column Store DB ● Highly Available w/o Single point of failure ● High throughput ● Optimized for counters
  • 26. First Storm Topology – Visitor Feed Storm Topology Kafka Spout Analyze relevant events Send events emit emit Kafka events stream Add “Agent Status” Topology: Analytics DB: Cassandra - Document store Parse Avro into tuple emit Data visualization using Highcharts
  • 27. Agent Status - Storm considerations ● Counters stored by topology ● Calculations done after reading from DB ● Delayed events should not be ignored ● Order of events does not matter ● Using Highcharts for data visualization
  • 28. Spur-winged Lapwing / Vanellus spinosus Amir Silberman
  • 29. 3rd Storm Use Case: Data Auditing Use case: ● Needs to be able to tell whether events arrived ○ Where there any missing events? ○ Where there any duplicated events? ○ How long did it take for events to arrive? ● Data not important - only count of events
  • 30. 3rd Storm Use Case: Data Auditing Realtime server Kafka Topics Auditing Topic Storm Sync topology Audit-loader topology MySql Hadoop HDFS audit job kafka 1 3 4 2 Auditor
  • 31. First Storm Topology – Visitor Feed Storm Topology Kafka Spout Analyze relevant events Send events emit emit Kafka events stream Add “Sync Audit” Topology: Sync messages between two topics Parse Avro into tuple emit Kafka Audit topic
  • 32. First Storm Topology – Visitor Feed Storm Topology Kafka Spout Analyze relevant events Send events emit emit Kafka Audit topic Add “Load Audit” Topology: Analytics DB: MySql - RDBMS Parse Avro into tuple emit Auditing Report
  • 33. “Load Audit” Topology: ● Stores statistics of events count ● SQL type DB ● Used for Auditing and other statistics ● Requires metadata in events header
  • 34. Challenges: ● High network traffic ● Writing to Kafka is faster than reading ● All topologies read all events ● How to avoid resource starvation in Storm Subalpine Warbler / Sylvia cantillans Amir Silberman
  • 35. Optimizations of Kafka ● Increase Kafka consuming rate by adding partitions ● Run on physical machines with RAID ● Set retention to the proper need ● Monitor data flow!
  • 36. Optimizations of Storm ● #of Kafka-Spouts = number of total partitions ● Set “Isolation mode” for important topologies ● Validate Network cards can carry network traffic ● Set Storm cluster on high CPU machines ● Monitor servers CPU & Memory (Graphite) ● Assess min. #Cores that topology needs ○ Use “top” -> “load” to find server load
  • 37. Demo ● Agent Console - https://z1.le.liveperson.net/ 71394613 / rans@liveperson.com ● My Site - http://birds-of-israel.weebly.com/
  • 38. Questions? Little Owl / Athene noctua Amir Silberman
  • 39. Thank you! Ruff / Philomachus pugnax Amir Silberman