Testing a High Performance Cloud Based Distributed Messaging System

•

0 likes•153 views

MessageMedia’s next generation gateway processes more than 10 billion events per year, this talk covers the approach adopted in the development of this system and shares the lessons learnt in building such a scalable and resilient system.

Technology

STRICTLY CONFIDENTIAL
TESTING A HIGH PERFORMANCE CLOUD
BASED DISTRIBUTED MESSAGING SYSTEM

MessageMedia provides a high performance two way
messaging platform for developers to build robust and
highly scalable applications. Every year more than
23,000 customers send over 1.5 billion messages,
enriching their applications with business messaging.

Old gateway couldn’t deliver reliability and
performance anymore.

Requirements
50,000 messages per minute throughput
99% message delivered in less than 2
seconds Resilient, durable, scalable and cost
effective

New architecture, new platforms, new
technologies, how do we test all this?

A message queue provides an asynchronous
communications protocol, ie. a system that puts a message
onto a queue does not require an immediate response to
continue processing

Approach
Crawling, walking and running skeleton approach
Build tools and instrumentation
Test performance continuously and consistently

RabbitMQ vs SQS
187,000
97,000
2,400
Throughput in Messages per Minute

All The Metrics!
Worker[1..n] Instance CPU
Worker[1..n] Instance Disk I/O
Worker[1..n] Instance RAM
Worker[1..n] Instance Network I/O
Worker[1..n] Instance Latency
Worker[1..n] Instance Throughput
RabbitMQ Nodes[1..3] CPU
RabbitMQ Nodes[1..3] Disk I/O
RabbitMQ Nodes[1..3] RAM
RabbitMQ Nodes[1..3] Network I/O
DynamoDB Provisioned Throughput
DynamoDB Throttling
Redis Nodes CPU
Redis Nodes RAM
JMeter Throughput
RabbitMQ Cluster Ingest Rate
Rabbit MQ Cluster Egress Rate
Micro Service[1..n] HTTP 200 Requests
Micro Service[1..n] HTTP 500 Requests
Micro Service[1..n] HTTP Request Latency
Micro Service[1..n] HTTP CPU
Micro Service[1..n] HTTP Network I/O
Micro Service[1..n] HTTP Request Latency

Worker Metrics
CPU RAM Disk IO
Latency
Time

DynamoDB Metrics
Provisioned Read Provisioned Writes Throttled Requests
Latency
Time

Lessons Learnt
Expect AWS services to fail (occasionally)
Expect developers to change to use different AWS services
Ensure queueing system provides durable and persistent storage
and application can survive upgrades (if self-hosted)

Conclusion
Skeleton approach de-risked project
Appropriate tooling and instrumentation is critical
Consistent and iterative performance testing helped build
stakeholder confidence

References
https://aws.amazon.com/sqs/
https://www.rabbitmq.com/
http://colby.id.au/benchmarking-sqs/
http://jmeter.apache.org/

Stay in touch!!
Slack: https://messagemediadevs.slack.com
Developer Portal: https://developers.messagemedia.com
Twitter: https://twitter.com/messagemedia1

Similar to Testing a High Performance Cloud Based Distributed Messaging System

Intermedia OverviewClayton Oswald

Cross selling 5Sen Nathan

High Scalability Network Monitoring for Communications Service ProvidersCA Technologies

IBM MQ Advanced - IBM InterConnect 2016Leif Davidsen

Running IBM MQ in the CloudRobert Parker

IBM MQ v8 enhancementsChris Sparshott

Powering performance through a tailor-made solution.Mindtree Ltd.

Dslf Broadband Suitefor CabaCABA

API Days Australiaconfluent

apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...apidays

Secure email gate wayvfmindia

CIMCO NetworkSara Webb

Nakina NOS Overviewhal2005

Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016Leif Davidsen

Pradeesh ResumePradeesh Kumar

Realtime mobile&iot solutions using mqtt and message sightfloridawusergroup

Brochure of Luxoft telecom solutions by Luxoft software developmentLuxoft

Service Mesh Talk for CTO ForumRick Hightower

Jayeed 062424056 Ete605 Sec 2mashiur

Transaction Processing monitorDistrict Administration

Similar to Testing a High Performance Cloud Based Distributed Messaging System (20)

Intermedia Overview

Cross selling 5

High Scalability Network Monitoring for Communications Service Providers

IBM MQ Advanced - IBM InterConnect 2016

Running IBM MQ in the Cloud

IBM MQ v8 enhancements

Powering performance through a tailor-made solution.

Dslf Broadband Suitefor Caba

API Days Australia

apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...

Secure email gate way

CIMCO Network

Nakina NOS Overview

Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016

Pradeesh Resume

Realtime mobile&iot solutions using mqtt and message sight

Brochure of Luxoft telecom solutions by Luxoft software development

Service Mesh Talk for CTO Forum

Jayeed 062424056 Ete605 Sec 2

Transaction Processing monitor

Recently uploaded

Developing An App To Navigate The Roads of BrazilV3cube

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Real Time Object Detection Using Open CVKhem

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Powerful Google developer tools for immediate impact! (2023-24 C)

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

presentation ICT roal in 21st century education

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Axa Assurance Maroc - Insurer Innovation Award 2024

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Boost PC performance: How more available memory can improve productivity

Scaling API-first – The story of a global engineering organization

Boost Fertility New Invention Ups Success Rates.pdf

What Are The Drone Anti-jamming Systems Technology?

How to Troubleshoot Apps for the Modern Connected Worker

Strategies for Landing an Oracle DBA Job as a Fresher

Real Time Object Detection Using Open CV

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Testing a High Performance Cloud Based Distributed Messaging System

1. STRICTLY CONFIDENTIAL TESTING A HIGH PERFORMANCE CLOUD BASED DISTRIBUTED MESSAGING SYSTEM

2. MessageMedia provides a high performance two way messaging platform for developers to build robust and highly scalable applications. Every year more than 23,000 customers send over 1.5 billion messages, enriching their applications with business messaging.

3. Old gateway couldn’t deliver reliability and performance anymore.

4. Requirements 50,000 messages per minute throughput 99% message delivered in less than 2 seconds Resilient, durable, scalable and cost effective

5. New architecture, new platforms, new technologies, how do we test all this?

6. A message queue provides an asynchronous communications protocol, ie. a system that puts a message onto a queue does not require an immediate response to continue processing

7. Approach Crawling, walking and running skeleton approach Build tools and instrumentation Test performance continuously and consistently

8. The Crawling Skeleton

9. Tooling & Instrumentation

10. RabbitMQ vs SQS 187,000 97,000 2,400 Throughput in Messages per Minute

11. The Walking Skeleton

12. Tooling & Instrumentation

13. The Running Skeleton

14. Tooling & Instrumentation #3

15.

16. All The Metrics! Worker[1..n] Instance CPU Worker[1..n] Instance Disk I/O Worker[1..n] Instance RAM Worker[1..n] Instance Network I/O Worker[1..n] Instance Latency Worker[1..n] Instance Throughput RabbitMQ Nodes[1..3] CPU RabbitMQ Nodes[1..3] Disk I/O RabbitMQ Nodes[1..3] RAM RabbitMQ Nodes[1..3] Network I/O DynamoDB Provisioned Throughput DynamoDB Throttling Redis Nodes CPU Redis Nodes RAM JMeter Throughput RabbitMQ Cluster Ingest Rate Rabbit MQ Cluster Egress Rate Micro Service[1..n] HTTP 200 Requests Micro Service[1..n] HTTP 500 Requests Micro Service[1..n] HTTP Request Latency Micro Service[1..n] HTTP CPU Micro Service[1..n] HTTP Network I/O Micro Service[1..n] HTTP Request Latency

17. Worker Metrics CPU RAM Disk IO Latency Time

18. DynamoDB Metrics Provisioned Read Provisioned Writes Throttled Requests Latency Time

19. Lessons Learnt Expect AWS services to fail (occasionally) Expect developers to change to use different AWS services Ensure queueing system provides durable and persistent storage and application can survive upgrades (if self-hosted)

20. Conclusion Skeleton approach de-risked project Appropriate tooling and instrumentation is critical Consistent and iterative performance testing helped build stakeholder confidence

21. References https://aws.amazon.com/sqs/ https://www.rabbitmq.com/ http://colby.id.au/benchmarking-sqs/ http://jmeter.apache.org/

22. Questions?

23. Stay in touch!! Slack: https://messagemediadevs.slack.com Developer Portal: https://developers.messagemedia.com Twitter: https://twitter.com/messagemedia1

Editor's Notes

Introduction Ben gave an introduction about MM, we are the #1 Business messaging provider in Australia, processing over 1.5 billion messages per year for over 23k customers. We send messages worldwide SMS: born in 1992, ubiquitous, very high open rate (~98%), much higher than other messaging means such as email – we send to 12 million unique phones in Australia each month OTP Reminders Emergency messages Marketing, we don’t send messages selling certain types of pharmaceutical products or saying that you have just won a lottery somewhere in Africa We processed 1B events in our new system, but what was wrong with the old, legacy system?
Gateway is a collection of applications and databases, responsible for message processing. Where did we come from? Legacy system: Monolithic application built in a LAMP stack deployed in data centres Unreliable performance as business grew, especially when we had spikes in usage from customers Using database for transient transactions, a single messages sent/receive can lead to ~10 DB READs and WRITEs Scaling relational databases is hard – fundamentally, RDMS are designed to run on a single server in order to maintain data integrity We also had heaps of problems with inherent issues with databases such as contention, replication lags etc Lack of metrics, evidence (logs) suggests peak throughput of about 8k messages/minute
The business wanted a system that won’t become obsolete in 5 years but rather something that can grow easily as the business grows. This is going to be the focus of today’s talk.
Problem #2 What does our new architecture look like? Micro services – it’s a complete re-architecture Deployed on cloud and taking advantage of AWS services Queueing systems
Benefits: Use queues to facilitate communications among worker applications, This Async approach allows easy scaling As applications process messages from queues, they can also generate events, which get published to services/event buses, other applications can subscribe to these buses to get the events and then perform business logic Proven to work well with large tech companies.
Our Approach – 3 Key Points Crawling, walking and running skeleton approach to building out our platform Tools and instrumentation critical to be able to test throughout each stage of our build Performance and functional testing approaches during each stage
Skeleton #1 The first stage of our build was the crawling skeleton. This involved testing of RabbitMQ versus SQS to see which one would be better and to ensure that the chosen technology could meet our requirements of 50,000 messages / minute with 2 seconds latency There are _2 hops__ in the system
Tooling & Instrumentation #1 Completely different tools were required for testing a queue based system to an RDBMS – publishers that could flood the system with messages for performance testing and metrics we could use to measure how the system was performing We built tools to publish and consumer high volumne of messages We built tools to measure throughput and latency end to end We built tools to check for message duplication and consistency, by checking how many messages were published vs consumed
RabbitMQ: 1 broker (rabbit), 2 consumers (3 machines): 187k/min, max latency 873ms, Multiple producers and consumers (11 machines): 97k/min, max latency 17000ms SQS: 2 consumers: 2.5k/min Conclusion: RabbitMQ has 70 times more throughput in a single cluster, has better latency and can be scaled up. To support the 50k/min throughput target we would need around 100 machines to support. Duplication rate also increases proportionally to # of consumers.
Skeleton #2 This involved creating all the workers that perform very basic business logics and log messages Testing here involved ensuring that in our architecture we could still meet performance requirements of 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency
Skeleton #2 This involved creating all the workers that perform very basic business logics and log messages Testing here involved ensuring that in our architecture we could still meet performance requirements of 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency We would check for end to end performance, as well as each individual worker’s performance. We also started __testing RabbitMQ clustering and configuration__. This stage of testing picked up a networking issue which was present on certain types of EC2 instance and required further investigation by our platform engineers, great to find this out at this stage rather than further down the track. The walking skeleton approach allowed us to easily find and detect this issue _before_ the system became too complex
Skeleton #3 This involved adding business logic to all the workers in the system, including __calls to microservices__, __adding data to the system__, testing different __workflows and failure scenarios__, what happens when microservice x is unavailable – whilst still meeting the 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency Talk to diagram to explain databases, cache servers, and _populate data_ to them
Tooling & Instrumentation #3 Then we needed tools to load the system with data, tools to test with real life traffic and tools to measure all parts of the system to identify bottlenecks as messages flowed through the system that would breach our 2 second latency requirement. We built a traffic pump that literally would pump production data based on log files in legacy, obfuscated and multiplied from old gateway into the new gateway. We improved our tooling by collecting metrics from all parts of the system – RabbitMQ, database, EC2 instances across all workers and microservices. We also built HTTP and SMPP simulators with various parameters, to simulate real world providers eg. Rate limiting, contractual agreements A diagram of the sampling tool, pulling all these metrics into one central place?
Tooling & Instrumentation #3 Then we needed tools to load the system with data, tools to test with real life traffic and tools to measure all parts of the system to identify bottlenecks as messages flowed through the system that would breach our 2 second latency requirement. We built a traffic pump that literally would pump production data based on log files in legacy, obfuscated and multiplied from old gateway into the new gateway. We improved our tooling by collecting metrics from all parts of the system – RabbitMQ, database, EC2 instances across all workers and microservices. We also built HTTP and SMPP simulators with various parameters, to simulate real world providers eg. Rate limiting, contractual agreements A diagram of the sampling tool, pulling all these metrics into one central place?
As we added more business logic to the system we started using the traffic pump to feed more and more data into the new system, which was running side by side along the legacy gateway. Because we put in a lot of efforts in tooling and instrumentation we had great visibility into all applications and AWS services. We graphed all this data to show us spikes in latency or drops in throughput, and more importantly tried to correlate these events to spikes in CPU, network, database utilisation and so on. When we found a bottleneck, we would tweak it, eg. _add more capacity to the database_, _add more workers_ to the pool and rerun the test. This was a rinse and repeat process until all bottlenecks had been identified and resolved providing us with a platform that could handle 50k messages/min with <2 second latency. At the same time, proving we had 0 duplicates and 0 message loss. __We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised. Ex 1. we could see that throughput started to flatline, and we were able to see a spike in disk I/O from EC2 metrics, so we changed EC2 type and introduced compression of messages. Ex 2. when we pushed a lot of messages through, we could see suddenly latency shot up, but EC2 metrics looked normal, turns out we were getting throttled due to dynamodb provision limits.
As we added more business logic to the system we started using the traffic pump to feed more and more data into the new system, which was running side by side along the legacy gateway. Because we put in a lot of efforts in tooling and instrumentation we had great visibility into all applications and AWS services. We graphed all this data to show us spikes in latency or drops in throughput, and more importantly tried to correlate these events to spikes in CPU, network, database utilisation and so on. When we found a bottleneck, we would tweak it, eg. _add more capacity to the database_, _add more workers_ to the pool and rerun the test. This was a rinse and repeat process until all bottlenecks had been identified and resolved providing us with a platform that could handle 50k messages/min with <2 second latency. At the same time, proving we had 0 duplicates and 0 message loss. __We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised. Ex 1. we could see that throughput started to flatline, and we were able to see a spike in disk I/O from EC2 metrics, so we changed EC2 type and introduced compression of messages. Ex 2. when we pushed a lot of messages through, we could see suddenly latency shot up, but EC2 metrics looked normal, turns out we were getting throttled due to dynamodb provision limits.
__We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised.
#1. AWS is known for its service reliability, but don’t assume they never go down. Famous example is s3 outage but aws dashboard is all green, because it uses s3. There can also be AWS services with degraded performance such as networking – multi-tenanted environment. We need to ensure the system can still function or at least in a degraded mode, with known and acceptable behaviours. #2. Expect development team to change use of AWS services eg. Dynamodb vs. ElasticCache, RDS vs Redshift, SQS vs. Kinesis etc. The ability to consistently test and measure results makes this a lot less risky from functional and performance point of view. #3. Expect to have many upgrades with system, some could contain breaking changes. Eg. Message versioning.
Conclusion The crawling, walking, running skeleton approach allowed us to de risk the project, by proving performance early, identifying issues early and also allowed our testing tools to be developed in an similar iterative fashion that allowed us to adapt and model our tools to continually test to the original requirements. Tooling and instrumentation was critical. Building hundreds of metrics into our codebase, combined with the multitude of metrics available from RabbitMQ and AWS allowed us to produce highly detailed pictures of how the system was performing at every point _making performance testing and tuning highly data driven._ Iterative performance and functional testing was done at every stage of the skeleton build, allowing us to easily isolate and identify issues early on – this included message duplication with SQS, networking irregularities on certain EC2 machines, differing performance patterns with RabbitMQ depending on configuration. This also helped build stakeholder confidence throughout the journey as well as helping platform/operations plan and manage to support the system in production. Similar to slide 5 to loop back and reinforce your 3 key points
Queues: * Decoupling of processes; redundancy; scalable; elastic; resilience; delivery guarantee
Queues: * Decoupling of processes; redundancy; scalable; elastic; resilience; delivery guarantee
Queues: * Decoupling of processes; redundancy; scalable; elastic; resilience; delivery guarantee

Testing a High Performance Cloud Based Distributed Messaging System

Recommended

Recommended

More Related Content

Similar to Testing a High Performance Cloud Based Distributed Messaging System

Similar to Testing a High Performance Cloud Based Distributed Messaging System (20)

Recently uploaded

Recently uploaded (20)

Testing a High Performance Cloud Based Distributed Messaging System

Editor's Notes