Presenters: Madhavan Narayanan Sajith Sebastian Amit Kaushal Gokul Sarangapani
Pulsar Journey@Intuit
Building our next-gen messaging platform
Topic : persistent://pulsar/intuit/our-migration-story
Intuit Confidential and Proprietary 2
Messaging at Intuit
Background of Intuit Messaging platform and current technology used
Need for Migration
Limitations of the current platform and our migration goals
Messaging with Pulsar
Feasibility study and the target architecture for next-gen platform
Challenges and Solutions
Problems faced and the solutions
Journey Ahead
The future roadmap items
Agenda
Messaging at Intuit
Current State
Intuit Confidential and Proprietary 4
Intuit Messaging Platform - Current State
Tax Filing Workflow
Dispatchers
Order Management
Payments
Processing
Billing
Schedulers Processors Observers
…
…
Products
Use Cases
Services
The Platform
Point-to-Point
Queues
Multi-Subscription
Topics
Persistent
Storage
Multi-Region
Support
Active-Active Use
cases
Highly resilient
ActiveMQ Network-of-
Brokers
Intuit
Messaging
Platform
Intuit Confidential and Proprietary 5
Messaging with ActiveMQ
Network-of-Brokers
Broker2
Broker1
WEST EAST
Producers
Broker3 Broker4 Broker5 Broker6
➢ ActiveMQ brokers distributed
across 2 regions
➢ NLB in each region to route
connections to brokers
➢ Route53 for latency based
routing to closest NLB
➢ All brokers know each other
and form a network. Not easily
scalable
➢ Brokers store messages in
local files
➢ Producers and Consumers use
JMS APIs and connect to the
Route53 endpoint
NLB NLB
Route53
Consumers
JMS API JMS API
Each broker has connection to every other broker
Intuit Confidential and Proprietary 6
Active-Active support with ActiveMQ
ActiveMQ Network-of-
Brokers
Broker2
Broker1
WEST EAST
West
Producer
Broker3 Broker4 Broker5 Broker6
East
Consumer
East
Producer
➢ Producers and Consumers
connect to broker(s) in the local
region
➢ Producers always see low latency
➢ Messages for a given topic can
be stored in multiple brokers
➢ Messages are internally
forwarded between brokers and
find their way to consumers.
There is no message replication
➢ Inefficient and waste of
bandwidth due to the high
volume of inter-broker traffic.
➢ Highly resilient to individual
broker failures.
Intuit Confidential and Proprietary 7
Handling Region failure with ActiveMQ
West brokers down
Broker2
Broker1
WEST EAST
West
Producer
Broker3 Broker4 Broker5 Broker6
East
Consumer
East
Producer ➢ Producers transparently
reconnect to a broker in
remote region
➢ Producers now see a high
publish latency
➢ However producers can
continue their operation
without any adverse impact
➢ Messages stored in the
affected brokers are not
available for consumers until
they come back ‘online’
➢ The network automatically
recovers once the brokers are
available
Need for Migration
Intuit Confidential and Proprietary 9
Technology
➢ ActiveMQ is an outdated technology, with architectural limitations
➢ To keep abreast with latest, modern, cloud-native technology
Scalability
➢ Scalability in ActiveMQ NoB is non-trivial and complex
➢ Significant increase in overheads as more brokers are added to the network
Throughput
➢ Maximum throughput of ActiveMQ NoB is limited, with little room to grow
➢ Need to be ready for future needs at Intuit. Significant growth in traffic projected
Cost
➢ High Price-Performance ratio of NoB. Significant loss of bandwidth in inter-broker traffic
➢ Need a solution that maximizes throughput with available resources
Operations
➢ Lack of central management in NoB. High cost of maintenance operations
➢ Lack of cluster level statistics and monitoring
Why we were looking to migrate
Intuit Confidential and Proprietary 10
Retain
● Multi-Region support
● Active-Active support
● Resiliency to system
failures
While we were evaluating multiple options against ActiveMQ capabilities, our focus was to
Migration Focus
Improve
● Ease of scalability
● Ease of operations
● Throughput and
performance
Avoid
● A single layer handling both
storage and customer traffic
● Inefficient inter-broker traffic
within the platform
● Duplicate message storage
for each subscriber
Messaging with Pulsar
Intuit Confidential and Proprietary 12
Feasibility Study
➢ Setup a Pulsar cluster that was equivalent in cost to an ActiveMQ NoB
➢ Extended Pulsar Broker to encrypt/decrypt messages for parity with existing system
➢ Verified all basic messaging functions for queueing use case (produce/consume operations
for persistent topics, single and multiple subscriptions)
➢ Verified scalability of broker and proxy tiers
➢ Verified dynamic addition of bookies, racks placement strategies and namespace isolation
➢ Ran extensive performance tests
What
we did
Results ➢ For nearly the same cost, a pulsar cluster was able to support 3.5x times the throughput of
an equivalent ActiveMQ NoB
➢ Highly consistent and contained publish latencies even at high throughput traffic. Unlike in
the case of ActiveMQ brokers, producers were relatively unaffected by the presence of
consumer connections
Intuit Confidential and Proprietary 13
Next Gen Messaging Platform with Pulsar
➢ Global zookeeper
spanning multiple regions
➢ Proxies, Brokers and
Bookies connect to a local
zookeeper
➢ Scalable and extensible
Proxy tier for managing
traffic
➢ Scalable Broker tier for
serving messages
➢ A separate scalable
storage tier with rack
support
➢ JMS wrapper over pulsar
client library
JMS Producers
JMS API
Pulsar Client
JMS Consumers
JMS API
Pulsar Client
Pulsar SDK
Producers
Pulsar Client
Pulsar SDK
Consumers
Pulsar Client
Challenges & Solutions
Zookeeper Issues
Intuit Confidential and Proprietary 15
Challenge #1 - Zookeeper Quorum Issue
➢ Intuit operates primarily in 2 AWS regions in US, namely us-west-2 and us-east-2
➢ Messaging platform also spans these 2 regions only
➢ When a region failure occurs within the platform, the entire pulsar cluster collapses due to
zookeeper failure
➢ Zookeepers lose majority quorum when one region is down and take the cluster down
➢ Our clients suddenly start failing since the cluster is unavailable. This is a regression
Issue
Solution ➢ We added one more region ‘us-east-1’ to the cluster
➢ Only one zookeeper instance runs in ‘us-east-1’. No other components are used there
➢ us-east-1 is a rarely used region by Intuit services and doesn’t have the same support/SLA
from AWS as the other 2 regions
Intuit Confidential and Proprietary 16
Challenge #2 - Zookeeper issue again
➢ With zookeeper in 3 regions, we started seeing frequent issues even during normal mode of
operation. i.e when all the 3 regions were active
➢ Zookeepers would frequently seize and stall making the cluster unavailable
➢ Zookeeper in us-east-1 region was becoming the leader most of the time, but was unable to
moderate and keep the quorum working.
➢ This was due to high network latency in us-east-1. Also, the overall cluster performance
dropped significantly when an east zookeeper become the leader (most of traffic is in west)
➢ Unable to find any solution to precisely control who becomes the leader in a ZK cluster
Issue
Solution ➢ After a lot of troubleshooting and experiments, we found that the zookeeper instance with a
larger server id value had more probability of becoming the leader
➢ Now we just had to control the sequence of zookeeper server id values in configuration,
keeping the us-east-1 instance at the smallest value
➢ Never saw the issue again after this fix
Challenges & Solutions
Latency and Ledger Issues
Intuit Confidential and Proprietary 18
Challenge #3 - High publish latencies
➢ Pulsar design assigns a single owner broker for a topic. All traffic for the topic is handled by
this broker
➢ All message producers from both regions end up getting connected to this single broker
(via proxies in local region)
➢ This results in latency disparity between producers who are in the same region as the broker
and the ones who are in remote region
➢ The cross region latencies are as high as 50ms average. This was a serious regression when
compared to ActiveMQ Network-of-Brokers
Issue
Solution ➢ Since our customers use region-agnostic topic names and expect active-active support from
us, we had to implement region-level isolation of topics underneath
➢ Implemented a service discovery extension that is configured in proxy to handle custom
topic name lookups. Also used namespace isolation policies to pin topics to specific brokers
➢ Implemented a wrapper over pulsar client library that uses the extended lookup to
transparently map the region-agnostic topic name to a region-specific sub topic.
➢ Consumers read messages from all the sub topics
Intuit Confidential and Proprietary 19
Brokers
West Namespace Brokers
Bookie2
west-2a
Bookie1
west-2b
Bookie2
west-2b
Bookie1
west-2c
Bookie2
west-2c
Rack1 Rack2 Rack3
Bookie Group - West Local
Bookie1
west-2a
Bookie2
east-2a
Bookie1
east-2b
Bookie2
east-2b
Bookie1
west-2c
Bookie2
east-2c
Rack1 Rack2 Rack3
Bookie Group - East Local
Bookie1
east-2a
Pulsar Proxy
Service
Discovery
WEST EAST
Zookeepers
Brokers
Pulsar Proxy
Service
Discovery
JMS Producers
JMS API
Pulsar Client
JMS Consumers
JMS API
Pulsar Client
Pulsar SDK
Producers
Pulsar Client
Pulsar SDK
Consumers
Pulsar Client
Challenge #3 - High publish latencies - Solution
East Namespace Brokers
Intuit Confidential and Proprietary 20
Challenge #4 - Ledger recovery failure
➢ Messages for a topic are stored in a sequence of ledgers in BookKeeper. The owner broker
for the topic manages the state of the ledgers.
➢ Ledgers are replicated to multiple bookies based on the write quorum value
➢ When a bookie crashes, open ledgers in it are closed by brokers which then create new
ledgers using other available bookies. When a broker crashes, other brokers assume
ownership of the abandoned topics and are able to re-open the ledgers
➢ However in case of multiple system failures resulting in a combination of broker and bookie
crashes, it leads to a situation where the ledgers cannot be recovered and topic producers
are stalled and new messages cannot be published. This results in business impact
Issue
Solution ➢ For recovery, a quick restart of the bookies is needed. Due to sync operation, a delayed
restart can overshoot the SLA and result in customer impact
➢ We are working on a solution to use our custom service discovery to detect this condition
and redirect producer to the sub-topic for the remote region.
Journey Ahead
Intuit Confidential and Proprietary 22
Journey Ahead
➢ We are in production now with limited availability to restricted set of customers
➢ As we move towards making the platform generally available to all customers, the following are some
items of focus
○ Enhancing and fortifying the resiliency of the system
○ Enabling Transaction Support
○ Auto scaling of brokers
○ Enabling Pulsar Schema Support using a custom schema registry
➢ We also have long term plans to
○ Move the platform to Intuit’s Kubernetes Platform
○ Support multi-cloud messaging
Intuit Confidential and Proprietary 23
Let us know your thoughts
Please write your feedback and comments to
● madhavan_narayanan@intuit.com
● gokul_s@intuit.com
● sajith_sebastian@intuit.com
● amit_kaushal@intuit.com
Thank You

Building the Next-Generation Messaging Platform on Pulsar at Intuit - Pulsar Summit NA 2021 Keynote

  • 1.
    Presenters: Madhavan NarayananSajith Sebastian Amit Kaushal Gokul Sarangapani Pulsar Journey@Intuit Building our next-gen messaging platform Topic : persistent://pulsar/intuit/our-migration-story
  • 2.
    Intuit Confidential andProprietary 2 Messaging at Intuit Background of Intuit Messaging platform and current technology used Need for Migration Limitations of the current platform and our migration goals Messaging with Pulsar Feasibility study and the target architecture for next-gen platform Challenges and Solutions Problems faced and the solutions Journey Ahead The future roadmap items Agenda
  • 3.
  • 4.
    Intuit Confidential andProprietary 4 Intuit Messaging Platform - Current State Tax Filing Workflow Dispatchers Order Management Payments Processing Billing Schedulers Processors Observers … … Products Use Cases Services The Platform Point-to-Point Queues Multi-Subscription Topics Persistent Storage Multi-Region Support Active-Active Use cases Highly resilient ActiveMQ Network-of- Brokers Intuit Messaging Platform
  • 5.
    Intuit Confidential andProprietary 5 Messaging with ActiveMQ Network-of-Brokers Broker2 Broker1 WEST EAST Producers Broker3 Broker4 Broker5 Broker6 ➢ ActiveMQ brokers distributed across 2 regions ➢ NLB in each region to route connections to brokers ➢ Route53 for latency based routing to closest NLB ➢ All brokers know each other and form a network. Not easily scalable ➢ Brokers store messages in local files ➢ Producers and Consumers use JMS APIs and connect to the Route53 endpoint NLB NLB Route53 Consumers JMS API JMS API Each broker has connection to every other broker
  • 6.
    Intuit Confidential andProprietary 6 Active-Active support with ActiveMQ ActiveMQ Network-of- Brokers Broker2 Broker1 WEST EAST West Producer Broker3 Broker4 Broker5 Broker6 East Consumer East Producer ➢ Producers and Consumers connect to broker(s) in the local region ➢ Producers always see low latency ➢ Messages for a given topic can be stored in multiple brokers ➢ Messages are internally forwarded between brokers and find their way to consumers. There is no message replication ➢ Inefficient and waste of bandwidth due to the high volume of inter-broker traffic. ➢ Highly resilient to individual broker failures.
  • 7.
    Intuit Confidential andProprietary 7 Handling Region failure with ActiveMQ West brokers down Broker2 Broker1 WEST EAST West Producer Broker3 Broker4 Broker5 Broker6 East Consumer East Producer ➢ Producers transparently reconnect to a broker in remote region ➢ Producers now see a high publish latency ➢ However producers can continue their operation without any adverse impact ➢ Messages stored in the affected brokers are not available for consumers until they come back ‘online’ ➢ The network automatically recovers once the brokers are available
  • 8.
  • 9.
    Intuit Confidential andProprietary 9 Technology ➢ ActiveMQ is an outdated technology, with architectural limitations ➢ To keep abreast with latest, modern, cloud-native technology Scalability ➢ Scalability in ActiveMQ NoB is non-trivial and complex ➢ Significant increase in overheads as more brokers are added to the network Throughput ➢ Maximum throughput of ActiveMQ NoB is limited, with little room to grow ➢ Need to be ready for future needs at Intuit. Significant growth in traffic projected Cost ➢ High Price-Performance ratio of NoB. Significant loss of bandwidth in inter-broker traffic ➢ Need a solution that maximizes throughput with available resources Operations ➢ Lack of central management in NoB. High cost of maintenance operations ➢ Lack of cluster level statistics and monitoring Why we were looking to migrate
  • 10.
    Intuit Confidential andProprietary 10 Retain ● Multi-Region support ● Active-Active support ● Resiliency to system failures While we were evaluating multiple options against ActiveMQ capabilities, our focus was to Migration Focus Improve ● Ease of scalability ● Ease of operations ● Throughput and performance Avoid ● A single layer handling both storage and customer traffic ● Inefficient inter-broker traffic within the platform ● Duplicate message storage for each subscriber
  • 11.
  • 12.
    Intuit Confidential andProprietary 12 Feasibility Study ➢ Setup a Pulsar cluster that was equivalent in cost to an ActiveMQ NoB ➢ Extended Pulsar Broker to encrypt/decrypt messages for parity with existing system ➢ Verified all basic messaging functions for queueing use case (produce/consume operations for persistent topics, single and multiple subscriptions) ➢ Verified scalability of broker and proxy tiers ➢ Verified dynamic addition of bookies, racks placement strategies and namespace isolation ➢ Ran extensive performance tests What we did Results ➢ For nearly the same cost, a pulsar cluster was able to support 3.5x times the throughput of an equivalent ActiveMQ NoB ➢ Highly consistent and contained publish latencies even at high throughput traffic. Unlike in the case of ActiveMQ brokers, producers were relatively unaffected by the presence of consumer connections
  • 13.
    Intuit Confidential andProprietary 13 Next Gen Messaging Platform with Pulsar ➢ Global zookeeper spanning multiple regions ➢ Proxies, Brokers and Bookies connect to a local zookeeper ➢ Scalable and extensible Proxy tier for managing traffic ➢ Scalable Broker tier for serving messages ➢ A separate scalable storage tier with rack support ➢ JMS wrapper over pulsar client library JMS Producers JMS API Pulsar Client JMS Consumers JMS API Pulsar Client Pulsar SDK Producers Pulsar Client Pulsar SDK Consumers Pulsar Client
  • 14.
  • 15.
    Intuit Confidential andProprietary 15 Challenge #1 - Zookeeper Quorum Issue ➢ Intuit operates primarily in 2 AWS regions in US, namely us-west-2 and us-east-2 ➢ Messaging platform also spans these 2 regions only ➢ When a region failure occurs within the platform, the entire pulsar cluster collapses due to zookeeper failure ➢ Zookeepers lose majority quorum when one region is down and take the cluster down ➢ Our clients suddenly start failing since the cluster is unavailable. This is a regression Issue Solution ➢ We added one more region ‘us-east-1’ to the cluster ➢ Only one zookeeper instance runs in ‘us-east-1’. No other components are used there ➢ us-east-1 is a rarely used region by Intuit services and doesn’t have the same support/SLA from AWS as the other 2 regions
  • 16.
    Intuit Confidential andProprietary 16 Challenge #2 - Zookeeper issue again ➢ With zookeeper in 3 regions, we started seeing frequent issues even during normal mode of operation. i.e when all the 3 regions were active ➢ Zookeepers would frequently seize and stall making the cluster unavailable ➢ Zookeeper in us-east-1 region was becoming the leader most of the time, but was unable to moderate and keep the quorum working. ➢ This was due to high network latency in us-east-1. Also, the overall cluster performance dropped significantly when an east zookeeper become the leader (most of traffic is in west) ➢ Unable to find any solution to precisely control who becomes the leader in a ZK cluster Issue Solution ➢ After a lot of troubleshooting and experiments, we found that the zookeeper instance with a larger server id value had more probability of becoming the leader ➢ Now we just had to control the sequence of zookeeper server id values in configuration, keeping the us-east-1 instance at the smallest value ➢ Never saw the issue again after this fix
  • 17.
  • 18.
    Intuit Confidential andProprietary 18 Challenge #3 - High publish latencies ➢ Pulsar design assigns a single owner broker for a topic. All traffic for the topic is handled by this broker ➢ All message producers from both regions end up getting connected to this single broker (via proxies in local region) ➢ This results in latency disparity between producers who are in the same region as the broker and the ones who are in remote region ➢ The cross region latencies are as high as 50ms average. This was a serious regression when compared to ActiveMQ Network-of-Brokers Issue Solution ➢ Since our customers use region-agnostic topic names and expect active-active support from us, we had to implement region-level isolation of topics underneath ➢ Implemented a service discovery extension that is configured in proxy to handle custom topic name lookups. Also used namespace isolation policies to pin topics to specific brokers ➢ Implemented a wrapper over pulsar client library that uses the extended lookup to transparently map the region-agnostic topic name to a region-specific sub topic. ➢ Consumers read messages from all the sub topics
  • 19.
    Intuit Confidential andProprietary 19 Brokers West Namespace Brokers Bookie2 west-2a Bookie1 west-2b Bookie2 west-2b Bookie1 west-2c Bookie2 west-2c Rack1 Rack2 Rack3 Bookie Group - West Local Bookie1 west-2a Bookie2 east-2a Bookie1 east-2b Bookie2 east-2b Bookie1 west-2c Bookie2 east-2c Rack1 Rack2 Rack3 Bookie Group - East Local Bookie1 east-2a Pulsar Proxy Service Discovery WEST EAST Zookeepers Brokers Pulsar Proxy Service Discovery JMS Producers JMS API Pulsar Client JMS Consumers JMS API Pulsar Client Pulsar SDK Producers Pulsar Client Pulsar SDK Consumers Pulsar Client Challenge #3 - High publish latencies - Solution East Namespace Brokers
  • 20.
    Intuit Confidential andProprietary 20 Challenge #4 - Ledger recovery failure ➢ Messages for a topic are stored in a sequence of ledgers in BookKeeper. The owner broker for the topic manages the state of the ledgers. ➢ Ledgers are replicated to multiple bookies based on the write quorum value ➢ When a bookie crashes, open ledgers in it are closed by brokers which then create new ledgers using other available bookies. When a broker crashes, other brokers assume ownership of the abandoned topics and are able to re-open the ledgers ➢ However in case of multiple system failures resulting in a combination of broker and bookie crashes, it leads to a situation where the ledgers cannot be recovered and topic producers are stalled and new messages cannot be published. This results in business impact Issue Solution ➢ For recovery, a quick restart of the bookies is needed. Due to sync operation, a delayed restart can overshoot the SLA and result in customer impact ➢ We are working on a solution to use our custom service discovery to detect this condition and redirect producer to the sub-topic for the remote region.
  • 21.
  • 22.
    Intuit Confidential andProprietary 22 Journey Ahead ➢ We are in production now with limited availability to restricted set of customers ➢ As we move towards making the platform generally available to all customers, the following are some items of focus ○ Enhancing and fortifying the resiliency of the system ○ Enabling Transaction Support ○ Auto scaling of brokers ○ Enabling Pulsar Schema Support using a custom schema registry ➢ We also have long term plans to ○ Move the platform to Intuit’s Kubernetes Platform ○ Support multi-cloud messaging
  • 23.
    Intuit Confidential andProprietary 23 Let us know your thoughts Please write your feedback and comments to ● madhavan_narayanan@intuit.com ● gokul_s@intuit.com ● sajith_sebastian@intuit.com ● amit_kaushal@intuit.com
  • 24.