SlideShare a Scribd company logo
1
1
1
Streaming millions of
Contact Center
interactions in (near)
real-time with Pulsar
Frank Kelly
Principal Engineer, Cogito Corp
Slack: https://apache-pulsar.slack.com/
A panoply of parameters
2
● Cogito & What we do
● Architecture & Use-Cases
● Challenges
● Initial lessons learned
● Kubernetes lessons learned
● Performance & Scaling settings
● Results
● Q&A
Intended Audience
Those who understand the main APIs and components but who may not be familiar with all the configuration
settings or how to optimize the system for high write throughput and/or millions of topics.
Overview
3
Formed in 2007 out of MIT - based out of Boston - now with a Global Engineering Footprint
Vision: Elevating the human connection in real time . . . .
Product: Call center AI solution that analyzes the human voice and provides real-time guidance to
enhance emotional intelligence and customer service.
Cogito: Who we are and what we do
4
Architecture
5
● Streaming: Real-time audio and analytic
results from our AI/ML models
● We break each customer call into
separate logical units called “intervals”
● Each interval is backed by two Pulsar
topics
○ Real-time Audio Topic
○ Real-time Analytics Topic
● Splicing up binary formats into discrete
messages → Deduplication is VERY
important!
● With 15,000 concurrent users - we
estimate 1.5m to 2m topics per day
● Each topic has moderate throughput ~ 32
Kb/s
● Also Messaging: Work-Queue events
Use-Cases for Pulsar
6
● Streaming Use-Case
○ Lots of throughput ~ 10 Gbps
○ Message-ordering & deduplication are critical
○ Near real-time requirements (< 250ms)
■ Think about timeouts/retries/failover
● Challenges
○ Zookeeper stores all the topics for a namespace
under one ZNode
○ Brokers require more memory
● Alternatives considered
○ Using key_shared would require us to disable
batching in the producer (not a huge deal)
○ Risk: Message dispatch will stop if there is a
subscription / consumer that has built up a backlog
of messages in their hash-range
○ Filtering on the client-side
The Challenges
7
● Processing real-time binary streams
○ Consumer: SubscriptionInitialPosition.Earliest
○ Broker Configuration: brokerDeduplicationEnabled: "true"
● Client Performance
○ Producer: sendAsync() ~10x improvement
○ Producer: blockIfQueueFull(true)
○ Batching: Enabled but the throughput per Producer is so low it rarely becomes helpful
● Default Timeouts
○ For our real-time system the default connection / operation timeout of 30s is too high
● Persistent vs. Non-Persistent
○ We support both use-cases (some customers wish for zero persistence)
Initial Lessons on the basics
8
● 15k Users ⇒ ingress of 5 Gbps Audio Data ⇒ 20 TB in a 12 hour window
● Open Subscriptions keep the topic data from being deleted
○ Code: pulsarAdmin.namespaces().setSubscriptionExpirationTime());
○ Broker Deduplication has its own subscription
■ brokerDeduplicationEntriesInterval: "50" (default: 1000)
■ brokerDeduplicationProducerInactivityTimeoutMinutes: "15" (default: 360)
● Bookie Compaction Thresholds (Delete more and do it more frequently)
○ majorCompactionInterval / majorCompactionThreshold
○ minorCompactionInterval / minorCompactionThreshold
○ compactionRate
● Tiered Storage
○ Although we use some Tiered storage there will be too many topics in ZK over time
○ Created our own Stream Offload that stores S3 location in RDS DB
Disk Space Challenges
9
● Which Helm chart?
○ Apache Pulsar (“Official”) vs Streamnative (Also “Official”) vs Kafkaesque
● GC Settings
○ Java Ergonomics: -XX:+PrintFlagsFinal
○ GC Settings tied to Pod Memory: -Xms2g -Xmx2g -XX:MaxDirectMemorySize=6g
○ resources.requests.memory = Heap + Direct Memory + Some Buffer
○ Looking forward to seeing modern JVM settings e.g. -XX:MaxRAMPercentage=75%
● Most helm charts set requests but not limits. We set requests == limits
○ JVM Memory is not elastic
○ CPU is however we experienced a lot of throttling from K8S Scheduler
● Istio Service Mesh
○ Integration with Istio for mTLS and service-level authorization took a chunk of time
Kubernetes Lessons
10
● Config: exposeTopicLevelMetricsInPrometheus: "false"
Passive Monitoring with Prometheus / Grafana
11
Active Monitoring with Prometheus Alerts
Integration with Prometheus Alerting to Slack / PagerDuty
12
● Namespace Bundles
○ For 15 Brokers: defaultNumberOfNamespaceBundles: "128" (Default: 4)
● Pulsar Load Balancer
○ # Disable Bundle split due to https://github.com/apache/pulsar/issues/5510
○ loadBalancerAutoBundleSplitEnabled: "false"
● Balancing throughput, durability and reliability across Bookies
○ managedLedgerDefaultEnsembleSize: "N"
○ managedLedgerDefaultWriteQuorum: "2"
○ managedLedgerDefaultAckQuorum: "1"
○ Striping is great for write-throughput but adds cost for read throughput
Real-Time / Scaling Journey Lessons
13
● Error
○ PerChannelBookieClient - Add for failed on bookie bookkeeper-2:3181 code EIO
Bookie EIO Error
Root Cause: At peak load Write Cache not big enough to hold
accumulated data while waiting on second cache flush
14
● Key Prometheus Metrics
○ Bookie
■ bookie_throttled_write_requests
■ bookie_rejected_write_request
○ Broker
■ pulsar_ml_cache_hits_rate
■ pulsar_ml_cache_misses_rate
Bookie EIO Error
BAD
GOOD
Key Lesson
The more we read from the Broker cache, the less we use
the Bookie ledger disk (enabling faster flush of write cache
→ ledger)
15
● EBS drives for Journal & Ledger
○ GP3 with max settings 16000 IOPS, 1000 MB/s
● Broker Cache
○ managedLedgerCacheEvictionTimeThresholdMillis: "5000" (Default: 1000)
○ managedLedgerCacheSizeMB: "512" (Default: 20% of total direct Memory)
● Bookie
○ dbStorage_writeCacheMaxSizeMb: "3072" (Default: 25% of total direct memory)
○ dbStorage_rocksDB_blockCacheSize: "1073741824" (Default: 10% of total direct memory)
○ journalMaxGroupWaitMSec: "10" (Default: 1ms)
● Scaling approach
○ Scale-out Bookies
○ Scale-up and Scale-out Brokers
Key Scaling Settings . . .
16
We’re not at millions yet but we’re seeing a trend . . . .
1) Simulated 300 users for about 18 hours with artificially short 1 minute calls
2) 500k topics created (250k Audio / 250k Signal Analytics)
Latest Results
17
Observations: ZooKeeper
ZK JVM Heap demands increasing . . .
18
Observations: ZooKeeper
ZK Disk Usage Increasing . . .
Suppressed: java.io.IOException: No space left on device
at
org.apache.zookeeper.server.SyncRequestProcessor$1.run(SyncRequestProcessor.java
:135) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1]
at
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:31
2) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1]
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:
406) ~[org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1]
.
.
[Snapshot Thread] ERROR org.apache.zookeeper.server.ZooKeeperServer - Severe
unrecoverable error, exiting
19
Observations: ZooKeeper
ZK 99%ile response times increasing. . .
20
Observations: Broker
Broker Heap Increasing . . .
Topic Metadata here as well as in ZK
21
Implications
1) ZooKeeper
a) More Heap
b) More CPU for GC (and to avoid throttling during GC)
c) Watch ZooKeeper disk space /pulsar/data
2) Broker
a) More Heap
b) Maybe more CPU for GC (and to avoid throttling during GC)
c) Watch for Broker → ZK latency issues
i) zooKeeperSessionTimeoutMillis: "60000" (default: 30000)
ii) zooKeeperOperationTimeoutSeconds: "60" (default: 30)
22
Recap: Key Metrics for our Streaming Use-Case
23
Thanks
Cogito
Bruce, Hamid, Andy, Jimmy, George, Gibby, Kyle, Matt, Amanda, John,
Ian, Mihai, Luis, Anthony, Karl and many more
Pulsar Community
Addison, Sijie, Matteo, Joshua etc.
24
Thank you!
25
● Benchmarking Pulsar and Kafka - A More Accurate Perspective on Pulsar’s Performance
○ https://streamnative.io/en/blog/tech/2020-11-09-benchmark-pulsar-kafka-performance#maximum-t
hroughput-test
● Taking a Deep-Dive into Apache Pulsar Architecture for Performance Tuning
○ https://streamnative.io/en/blog/tech/2021-01-14-pulsar-architecture-performance-tuning
● Understanding How Apache Pulsar Works
○ https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works
References

More Related Content

What's hot

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Bookkeeper and Apache Zookeeper for Apache PulsarApache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
Enrico Olivelli
 
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Jon Watte
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Yahoo Developer Network
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
Matteo Merli
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
StreamNative
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
JinfengHuang3
 
1. Core Features of Apache RocketMQ
1. Core Features of Apache RocketMQ1. Core Features of Apache RocketMQ
1. Core Features of Apache RocketMQ
振东 刘
 
Securing your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggSecuring your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris Kellogg
StreamNative
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
StreamNative
 
High performance messaging with Apache Pulsar
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache Pulsar
Matteo Merli
 
Building a FaaS with pulsar
Building a FaaS with pulsarBuilding a FaaS with pulsar
Building a FaaS with pulsar
StreamNative
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
confluent
 
Transaction preview of Apache Pulsar
Transaction preview of Apache PulsarTransaction preview of Apache Pulsar
Transaction preview of Apache Pulsar
StreamNative
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
StreamNative
 

What's hot (20)

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
 
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Bookkeeper and Apache Zookeeper for Apache PulsarApache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
 
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
 
Load balancing at tuenti
Load balancing at tuentiLoad balancing at tuenti
Load balancing at tuenti
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
1. Core Features of Apache RocketMQ
1. Core Features of Apache RocketMQ1. Core Features of Apache RocketMQ
1. Core Features of Apache RocketMQ
 
Securing your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggSecuring your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris Kellogg
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 
High performance messaging with Apache Pulsar
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache Pulsar
 
Building a FaaS with pulsar
Building a FaaS with pulsarBuilding a FaaS with pulsar
Building a FaaS with pulsar
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
Transaction preview of Apache Pulsar
Transaction preview of Apache PulsarTransaction preview of Apache Pulsar
Transaction preview of Apache Pulsar
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 

Similar to Streaming millions of Contact Center interactions in (near) real-time with Pulsar

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kevin Lynch
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze
 
Managing 600 instances
Managing 600 instancesManaging 600 instances
Managing 600 instances
Geoffrey Beausire
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Netty training
Netty trainingNetty training
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
Alluxio, Inc.
 
Microservices with Micronaut
Microservices with MicronautMicroservices with Micronaut
Microservices with Micronaut
QAware GmbH
 
Netty training
Netty trainingNetty training
Netty training
Marcelo Serpa
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
Sharma Podila
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
DataStax
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
Joerg Henning
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
Etienne Chauchot
 
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharingVinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
VINEYARD - Versatile Integrated Accelerator-based Heterogeneous Data Centres
 

Similar to Streaming millions of Contact Center interactions in (near) real-time with Pulsar (20)

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Managing 600 instances
Managing 600 instancesManaging 600 instances
Managing 600 instances
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Netty training
Netty trainingNetty training
Netty training
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
Microservices with Micronaut
Microservices with MicronautMicroservices with Micronaut
Microservices with Micronaut
 
Netty training
Netty trainingNetty training
Netty training
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharingVinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Streaming millions of Contact Center interactions in (near) real-time with Pulsar

  • 1. 1 1 1 Streaming millions of Contact Center interactions in (near) real-time with Pulsar Frank Kelly Principal Engineer, Cogito Corp Slack: https://apache-pulsar.slack.com/ A panoply of parameters
  • 2. 2 ● Cogito & What we do ● Architecture & Use-Cases ● Challenges ● Initial lessons learned ● Kubernetes lessons learned ● Performance & Scaling settings ● Results ● Q&A Intended Audience Those who understand the main APIs and components but who may not be familiar with all the configuration settings or how to optimize the system for high write throughput and/or millions of topics. Overview
  • 3. 3 Formed in 2007 out of MIT - based out of Boston - now with a Global Engineering Footprint Vision: Elevating the human connection in real time . . . . Product: Call center AI solution that analyzes the human voice and provides real-time guidance to enhance emotional intelligence and customer service. Cogito: Who we are and what we do
  • 5. 5 ● Streaming: Real-time audio and analytic results from our AI/ML models ● We break each customer call into separate logical units called “intervals” ● Each interval is backed by two Pulsar topics ○ Real-time Audio Topic ○ Real-time Analytics Topic ● Splicing up binary formats into discrete messages → Deduplication is VERY important! ● With 15,000 concurrent users - we estimate 1.5m to 2m topics per day ● Each topic has moderate throughput ~ 32 Kb/s ● Also Messaging: Work-Queue events Use-Cases for Pulsar
  • 6. 6 ● Streaming Use-Case ○ Lots of throughput ~ 10 Gbps ○ Message-ordering & deduplication are critical ○ Near real-time requirements (< 250ms) ■ Think about timeouts/retries/failover ● Challenges ○ Zookeeper stores all the topics for a namespace under one ZNode ○ Brokers require more memory ● Alternatives considered ○ Using key_shared would require us to disable batching in the producer (not a huge deal) ○ Risk: Message dispatch will stop if there is a subscription / consumer that has built up a backlog of messages in their hash-range ○ Filtering on the client-side The Challenges
  • 7. 7 ● Processing real-time binary streams ○ Consumer: SubscriptionInitialPosition.Earliest ○ Broker Configuration: brokerDeduplicationEnabled: "true" ● Client Performance ○ Producer: sendAsync() ~10x improvement ○ Producer: blockIfQueueFull(true) ○ Batching: Enabled but the throughput per Producer is so low it rarely becomes helpful ● Default Timeouts ○ For our real-time system the default connection / operation timeout of 30s is too high ● Persistent vs. Non-Persistent ○ We support both use-cases (some customers wish for zero persistence) Initial Lessons on the basics
  • 8. 8 ● 15k Users ⇒ ingress of 5 Gbps Audio Data ⇒ 20 TB in a 12 hour window ● Open Subscriptions keep the topic data from being deleted ○ Code: pulsarAdmin.namespaces().setSubscriptionExpirationTime()); ○ Broker Deduplication has its own subscription ■ brokerDeduplicationEntriesInterval: "50" (default: 1000) ■ brokerDeduplicationProducerInactivityTimeoutMinutes: "15" (default: 360) ● Bookie Compaction Thresholds (Delete more and do it more frequently) ○ majorCompactionInterval / majorCompactionThreshold ○ minorCompactionInterval / minorCompactionThreshold ○ compactionRate ● Tiered Storage ○ Although we use some Tiered storage there will be too many topics in ZK over time ○ Created our own Stream Offload that stores S3 location in RDS DB Disk Space Challenges
  • 9. 9 ● Which Helm chart? ○ Apache Pulsar (“Official”) vs Streamnative (Also “Official”) vs Kafkaesque ● GC Settings ○ Java Ergonomics: -XX:+PrintFlagsFinal ○ GC Settings tied to Pod Memory: -Xms2g -Xmx2g -XX:MaxDirectMemorySize=6g ○ resources.requests.memory = Heap + Direct Memory + Some Buffer ○ Looking forward to seeing modern JVM settings e.g. -XX:MaxRAMPercentage=75% ● Most helm charts set requests but not limits. We set requests == limits ○ JVM Memory is not elastic ○ CPU is however we experienced a lot of throttling from K8S Scheduler ● Istio Service Mesh ○ Integration with Istio for mTLS and service-level authorization took a chunk of time Kubernetes Lessons
  • 10. 10 ● Config: exposeTopicLevelMetricsInPrometheus: "false" Passive Monitoring with Prometheus / Grafana
  • 11. 11 Active Monitoring with Prometheus Alerts Integration with Prometheus Alerting to Slack / PagerDuty
  • 12. 12 ● Namespace Bundles ○ For 15 Brokers: defaultNumberOfNamespaceBundles: "128" (Default: 4) ● Pulsar Load Balancer ○ # Disable Bundle split due to https://github.com/apache/pulsar/issues/5510 ○ loadBalancerAutoBundleSplitEnabled: "false" ● Balancing throughput, durability and reliability across Bookies ○ managedLedgerDefaultEnsembleSize: "N" ○ managedLedgerDefaultWriteQuorum: "2" ○ managedLedgerDefaultAckQuorum: "1" ○ Striping is great for write-throughput but adds cost for read throughput Real-Time / Scaling Journey Lessons
  • 13. 13 ● Error ○ PerChannelBookieClient - Add for failed on bookie bookkeeper-2:3181 code EIO Bookie EIO Error Root Cause: At peak load Write Cache not big enough to hold accumulated data while waiting on second cache flush
  • 14. 14 ● Key Prometheus Metrics ○ Bookie ■ bookie_throttled_write_requests ■ bookie_rejected_write_request ○ Broker ■ pulsar_ml_cache_hits_rate ■ pulsar_ml_cache_misses_rate Bookie EIO Error BAD GOOD Key Lesson The more we read from the Broker cache, the less we use the Bookie ledger disk (enabling faster flush of write cache → ledger)
  • 15. 15 ● EBS drives for Journal & Ledger ○ GP3 with max settings 16000 IOPS, 1000 MB/s ● Broker Cache ○ managedLedgerCacheEvictionTimeThresholdMillis: "5000" (Default: 1000) ○ managedLedgerCacheSizeMB: "512" (Default: 20% of total direct Memory) ● Bookie ○ dbStorage_writeCacheMaxSizeMb: "3072" (Default: 25% of total direct memory) ○ dbStorage_rocksDB_blockCacheSize: "1073741824" (Default: 10% of total direct memory) ○ journalMaxGroupWaitMSec: "10" (Default: 1ms) ● Scaling approach ○ Scale-out Bookies ○ Scale-up and Scale-out Brokers Key Scaling Settings . . .
  • 16. 16 We’re not at millions yet but we’re seeing a trend . . . . 1) Simulated 300 users for about 18 hours with artificially short 1 minute calls 2) 500k topics created (250k Audio / 250k Signal Analytics) Latest Results
  • 17. 17 Observations: ZooKeeper ZK JVM Heap demands increasing . . .
  • 18. 18 Observations: ZooKeeper ZK Disk Usage Increasing . . . Suppressed: java.io.IOException: No space left on device at org.apache.zookeeper.server.SyncRequestProcessor$1.run(SyncRequestProcessor.java :135) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:31 2) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java: 406) ~[org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] . . [Snapshot Thread] ERROR org.apache.zookeeper.server.ZooKeeperServer - Severe unrecoverable error, exiting
  • 19. 19 Observations: ZooKeeper ZK 99%ile response times increasing. . .
  • 20. 20 Observations: Broker Broker Heap Increasing . . . Topic Metadata here as well as in ZK
  • 21. 21 Implications 1) ZooKeeper a) More Heap b) More CPU for GC (and to avoid throttling during GC) c) Watch ZooKeeper disk space /pulsar/data 2) Broker a) More Heap b) Maybe more CPU for GC (and to avoid throttling during GC) c) Watch for Broker → ZK latency issues i) zooKeeperSessionTimeoutMillis: "60000" (default: 30000) ii) zooKeeperOperationTimeoutSeconds: "60" (default: 30)
  • 22. 22 Recap: Key Metrics for our Streaming Use-Case
  • 23. 23 Thanks Cogito Bruce, Hamid, Andy, Jimmy, George, Gibby, Kyle, Matt, Amanda, John, Ian, Mihai, Luis, Anthony, Karl and many more Pulsar Community Addison, Sijie, Matteo, Joshua etc.
  • 25. 25 ● Benchmarking Pulsar and Kafka - A More Accurate Perspective on Pulsar’s Performance ○ https://streamnative.io/en/blog/tech/2020-11-09-benchmark-pulsar-kafka-performance#maximum-t hroughput-test ● Taking a Deep-Dive into Apache Pulsar Architecture for Performance Tuning ○ https://streamnative.io/en/blog/tech/2021-01-14-pulsar-architecture-performance-tuning ● Understanding How Apache Pulsar Works ○ https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works References