Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Liveperson DLD 2015


Published on

In this meetup, Kobi Salant - Data Platform Technical Lead & Vladi Feigin - Data System Architect, both from Liveperson will talk about : Making scale a non-issue for real-time Data apps.

Have you ever tried to build a system processing in real-time hundreds of thousands events per second and servicing more than 1M concurrent visitors?

We're going to talk about the LivePerson real-time stream processing solution doing exactly that. Learn how we empower digital call centers with insights for their critical decision making processes and never-ending efficiency goals.

Published in: Data & Analytics

Liveperson DLD 2015

  1. 1. DLD. Tel-Aviv. 2015 Making Scale a Non-Issue for Real-Time Data Apps Vladi Feigin, LivePerson Kobi Salant, LivePerson
  2. 2. Agenda  Intro  About LivePerson  Digital Engagements  Call Center Use Case  Architecture  Zoom-In
  3. 3. Bio Vladi Feigin  System Architect in LivePerson  18 years in software development  Interests : distributed computing, data, analytics and martial arts
  4. 4. Bio Kobi Salant  Data Platform Tech Lead in LivePerson  25 years in software development  Interests : Application performance, traveling and coffee
  5. 5. LivePerson  We do Digital Engagements  Agile and very technological  Real Big Data and Analytics company  Really cool place to work in  One of the SaaS pioneers  6 Data Centers across the world  Founded in 1995, a public company since 2000 (NASDAQ: LPSN)  More than 18,000 customers worldwide  More than 1000 employees
  6. 6. LivePerson technology stack
  7. 7. We are Big Data  1.4 Million concurrent visits  1 Million events per second  2 billion site visits per month  27 million live engagements per month  Data freshness SLA (RT flow): up to 5 seconds
  8. 8. Visitor
  9. 9. Agent
  10. 10. Visitor
  11. 11. Agent
  12. 12. Call Center Operating Digital engagement requires operating a call center in the most efficient way How to operate a call center in the most efficient way?  Provide operational metrics … In real-time What are the challenges?  Huge scale, load peaks, real-time calculations, high data freshness SLA
  13. 13. Call Center Operating
  14. 14. Architecture. Real-Time data flow producer (agent) producer (sess.) producer (chat) Kafka Storm Cassandra Storm Fast topic ElasticSearch CouchBase API Consistent topic Batch layer (Hadoop) producer (conv.) producer (other) Custom Apps.
  15. 15. Chat History. Example producer (agent) producer (sess.) producer (chat.) Kafka Storm Fast topic ElasticSearch API Consistent topic MR job Very low latency 99.5% of data High latency 99.999% of data
  16. 16. Data Producers. Requirements  Real time  “Five nines” persistence  Small footprint  No interference with service  Multiple producers & platforms  Monolithic to service oriented Many More Services
  17. 17. Data Producers. Lessons learned  Hundreds of services  Complex rollouts  Minimal logic to avoid painful fixes  Audit streaming? Split to buckets  Real time and “five nines” persistence are incompatible In House 1 Bucket Bucket
  18. 18. Consistent Topic Send message to Kafka local file Persist message to local disk Kafka Bridge Send message to Kafka Fast Topic Kafka Resilience Real-time Customers Offline Customers Kafka Data Producers. Flow
  19. 19. Data Model Framework Why Avro:  Schema based evolution  Performance - Untagged bytes  HDFS ecosystem support Lessons Learned:  Schema evolution breaks  Big schema (ours is over 65k) not recommended  Avoid deep nesting and multiple unions  Need a framework Chaos – Non-Schema space delimited Order – Avro Schema
  20. 20. Framework Flow 1. Event is created according to Avro Schema version 3.5 2. Schema is registered into the repository (once) 3. Value 3.5 is written to header 4. Event is encoded with schema version 3.5 and added to message 5. Message is sent to Kafka 6. Message is read by consumer 7. Header is read from message 8. Schema is retrieved from repository according to scheme version 9. Event decoded using the proper Avro schema 10.Decoded event is processed 3.5 3.5 Consumer Repository
  21. 21. Apache Kafka  More than 15 billion events a day  More than 1 million events per second  Hundreds of producers & consumers Why Kafka?  Scale where traditional MQs fail  Industry standard for big data log messaging  Reliable, flexible and easy to use Deployment:  We have 15 clusters across the world  Our biggest cluster has 8 nodes with more than 6TB (Avro + Kafka compression)  Maximum retention of 72 hours
  22. 22. Apache Kafka. Lessons Learned  Scale horizontally for hardware resources and vertically for throughput  Look at trends of network & IO & Kafka's JMX statistics Partitions Servers Bytes in
  23. 23. Apache Kafka. Lessons Learned cont.  Know your data and message sizes:  Large messages can break you  Data growth can overfill your capacity  Set the right configuration  Adding or removing a broker is not trivial  Decide on single or multiple topics
  24. 24. Apache Storm Why Storm?  Growing community with good integration to Kafka  At the time, it was the leading product  Easy development and customization  The POC was successful Deployment:  We have 6 clusters across the world  Our biggest cluster has more then 30 nodes  We have 20 topologies on a single cluster  Uptime of months for a single topology
  25. 25. Apache Storm. Typical topology Storm Topology KAFKA SPOUT FILTER BOLT WRITER BOLT emit emit ack ack fetch Zookeeper Kafka Fast topic writecommit
  26. 26. Apache Storm. Lessons learned  Develop SDK and educate R&D  Where did my topology run last week? What is my overtime capacity?  Know your bolts, must return a timely answer  Coding is easy, performance is hard  Use isolation Capacity
  27. 27. Apache Storm. Lessons learned cont.  Use local shuffling  Use Ack KAFKA SPOUT FILTER BOLT WRITER BOLT KAFKA SPOUT FILTER BOLT WRITER BOLT Local emit ACKER BOLT ACKER BOLT COMM BOLT COMM BOLT Worker A Worker B Local emit Local emit Local emit
  28. 28. Summary  No one-size-fits-all solution  Ask product for a clearly defined SLA  Separate between fast and consistent data flows - they don’t merge!  Use schema for a data model - keep it flat and small  Kafka rules! It’s reliable and fast - use it  Storm has it’s toll. For some use-cases we would be using Spark Streaming today
  29. 29. THANK YOU! We are hiring Q/A
  30. 30.