Liveperson DLD 2015

DLD. Tel-Aviv. 2015
Making Scale a Non-Issue
for Real-Time Data Apps
Vladi Feigin, LivePerson
Kobi Salant, LivePerson

Agenda
 Intro
 About LivePerson
 Digital Engagements
 Call Center Use Case
 Architecture
 Zoom-In

Bio
Vladi Feigin
 System Architect in LivePerson
 18 years in software development
 Interests : distributed computing, data, analytics and
martial arts

Bio
Kobi Salant
 Data Platform Tech Lead in LivePerson
 25 years in software development
 Interests : Application performance, traveling and coffee

LivePerson
 We do Digital Engagements
 Agile and very technological
 Real Big Data and Analytics company
 Really cool place to work in
 One of the SaaS pioneers
 6 Data Centers across the world
 Founded in 1995,
a public company
since 2000
(NASDAQ: LPSN)
 More than 18,000
customers
worldwide
 More than 1000
employees

We are Big Data
 1.4 Million concurrent visits
 1 Million events per second
 2 billion site visits per month
 27 million live engagements per month
 Data freshness SLA (RT flow): up to 5 seconds

Call Center Operating
Digital engagement requires operating a call center in the
most efficient way
How to operate a call center in the most efficient way?
 Provide operational metrics … In real-time
What are the challenges?
 Huge scale, load peaks, real-time calculations, high data
freshness SLA

Architecture. Real-Time data flow
producer
(agent)
producer
(sess.)
producer
(chat)
Kafka
Storm
Cassandra
Storm
Fast topic
ElasticSearch CouchBase
API
Consistent topic
Batch
layer
(Hadoop)
producer
(conv.)
producer
(other)
Custom
Apps.

Chat History. Example
producer
(agent)
producer
(sess.)
producer
(chat.)
Kafka
Storm
Fast topic
ElasticSearch
API
Consistent topic
MR job
Very low latency
99.5% of data
High latency
99.999% of data

Data Producers. Requirements
 Real time
 “Five nines” persistence
 Small footprint
 No interference with service
 Multiple producers & platforms
 Monolithic to service oriented
Many
More
Services

Data Producers. Lessons learned
 Hundreds of services
 Complex rollouts
 Minimal logic to avoid painful fixes
 Audit streaming? Split to buckets
 Real time and “five nines” persistence are incompatible
In House
1
Bucket Bucket

Consistent
Topic
Send message
to Kafka
local file
Persist message to
local disk
Kafka Bridge
Send message
to Kafka
Fast
Topic
Kafka Resilience
Real-time
Customers
Offline
Customers
Kafka
Data Producers. Flow

Data Model Framework
Why Avro:
 Schema based evolution
 Performance - Untagged bytes
 HDFS ecosystem support
Lessons Learned:
 Schema evolution breaks
 Big schema (ours is over 65k) not recommended
 Avoid deep nesting and multiple unions
 Need a framework
Chaos – Non-Schema
space delimited
Order – Avro Schema

Framework Flow
1. Event is created according to Avro
Schema version 3.5
2. Schema is registered into the
repository (once)
3. Value 3.5 is written to header
4. Event is encoded with schema
version 3.5 and added to message
5. Message is sent to Kafka
6. Message is read by consumer
7. Header is read from message
8. Schema is retrieved from repository
according to scheme version
9. Event decoded using the proper Avro
schema
10.Decoded event is processed
3.5
3.5
Consumer
Repository

Apache Kafka
 More than 15 billion events a day
 More than 1 million events per second
 Hundreds of producers & consumers
Why Kafka?
 Scale where traditional MQs fail
 Industry standard for big data log messaging
 Reliable, flexible and easy to use
Deployment:
 We have 15 clusters across the world
 Our biggest cluster has 8 nodes with more than 6TB (Avro + Kafka
compression)
 Maximum retention of 72 hours

Apache Kafka. Lessons Learned
 Scale horizontally for hardware resources and vertically for
throughput
 Look at trends of network & IO & Kafka's JMX statistics
Partitions Servers
Bytes in

Apache Kafka. Lessons Learned cont.
 Know your data and message sizes:
 Large messages can break you
 Data growth can overfill your capacity
 Set the right configuration
 Adding or removing a broker is not trivial
 Decide on single or multiple topics

Apache Storm
Why Storm?
 Growing community with good integration to Kafka
 At the time, it was the leading product
 Easy development and customization
 The POC was successful
Deployment:
 We have 6 clusters across the world
 Our biggest cluster has more then 30 nodes
 We have 20 topologies on a single cluster
 Uptime of months for a single topology

Apache Storm. Typical topology
Storm Topology
KAFKA SPOUT FILTER BOLT WRITER BOLT
emit emit
ack ack
fetch
Zookeeper
Kafka Fast topic
writecommit

Apache Storm. Lessons learned
 Develop SDK and educate R&D
 Where did my topology run last week? What is my overtime
capacity?
 Know your bolts, must return a timely answer
 Coding is easy, performance is hard
 Use isolation
Capacity

Apache Storm. Lessons learned cont.
 Use local shuffling
 Use Ack
Local
emit
ACKER BOLT
ACKER BOLT
COMM BOLT
COMM BOLT
Worker
A
Worker
B
Local
emit
Local
emit
Local
emit

Summary
 No one-size-fits-all solution
 Ask product for a clearly defined SLA
 Separate between fast and consistent data flows - they
don’t merge!
 Use schema for a data model - keep it flat and small
 Kafka rules! It’s reliable and fast - use it
 Storm has it’s toll. For some use-cases we would be
using Spark Streaming today

THANK YOU!
We are hiring
http://www.liveperson.com/company/careers
Q/A

YouTube.com/LivePersonDev
Twitter.com/LivePersonDev
Facebook.com/LivePersonDev
Slideshare.net/LivePersonDev

Liveperson DLD 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (20)

Similar to Liveperson DLD 2015

Similar to Liveperson DLD 2015 (20)

More from LivePerson

More from LivePerson (20)

Recently uploaded

Recently uploaded (20)

Liveperson DLD 2015