MessageMedia’s next generation gateway processes more than 10 billion events per year, this talk covers the approach adopted in the development of this system and shares the lessons learnt in building such a scalable and resilient system.
2. MessageMedia provides a high performance two way
messaging platform for developers to build robust and
highly scalable applications. Every year more than
23,000 customers send over 1.5 billion messages,
enriching their applications with business messaging.
6. A message queue provides an asynchronous
communications protocol, ie. a system that puts a message
onto a queue does not require an immediate response to
continue processing
7. Approach
Crawling, walking and running skeleton approach
Build tools and instrumentation
Test performance continuously and consistently
19. Lessons Learnt
Expect AWS services to fail (occasionally)
Expect developers to change to use different AWS services
Ensure queueing system provides durable and persistent storage
and application can survive upgrades (if self-hosted)
20. Conclusion
Skeleton approach de-risked project
Appropriate tooling and instrumentation is critical
Consistent and iterative performance testing helped build
stakeholder confidence
Introduction
Ben gave an introduction about MM, we are the #1 Business messaging provider in Australia, processing over 1.5 billion messages per year for over 23k customers. We send messages worldwide
SMS: born in 1992, ubiquitous, very high open rate (~98%), much higher than other messaging means such as email – we send to 12 million unique phones in Australia each month
OTP
Reminders
Emergency messages
Marketing, we don’t send messages selling certain types of pharmaceutical products or saying that you have just won a lottery somewhere in Africa
We processed 1B events in our new system, but what was wrong with the old, legacy system?
Gateway is a collection of applications and databases, responsible for message processing.
Where did we come from? Legacy system:
Monolithic application built in a LAMP stack deployed in data centres
Unreliable performance as business grew, especially when we had spikes in usage from customers
Using database for transient transactions, a single messages sent/receive can lead to ~10 DB READs and WRITEs
Scaling relational databases is hard – fundamentally, RDMS are designed to run on a single server in order to maintain data integrity
We also had heaps of problems with inherent issues with databases such as contention, replication lags etc
Lack of metrics, evidence (logs) suggests peak throughput of about 8k messages/minute
The business wanted a system that won’t become obsolete in 5 years but rather something that can grow easily as the business grows.
This is going to be the focus of today’s talk.
Problem #2
What does our new architecture look like?
Micro services – it’s a complete re-architecture
Deployed on cloud and taking advantage of AWS services
Queueing systems
Benefits:
Use queues to facilitate communications among worker applications,
This Async approach allows easy scaling
As applications process messages from queues, they can also generate events, which get published to services/event buses, other applications can subscribe to these buses to get the events and then perform business logic
Proven to work well with large tech companies.
Our Approach – 3 Key Points
Crawling, walking and running skeleton approach to building out our platform
Tools and instrumentation critical to be able to test throughout each stage of our build
Performance and functional testing approaches during each stage
Skeleton #1
The first stage of our build was the crawling skeleton.
This involved testing of RabbitMQ versus SQS to see which one would be better and to ensure that the chosen technology could meet our requirements of 50,000 messages / minute with 2 seconds latency
There are _2 hops__ in the system
Tooling & Instrumentation #1
Completely different tools were required for testing a queue based system to an RDBMS – publishers that could flood the system with messages for performance testing and metrics we could use to measure how the system was performing
We built tools to publish and consumer high volumne of messages
We built tools to measure throughput and latency end to end
We built tools to check for message duplication and consistency, by checking how many messages were published vs consumed
RabbitMQ:
1 broker (rabbit), 2 consumers (3 machines): 187k/min, max latency 873ms,
Multiple producers and consumers (11 machines): 97k/min, max latency 17000ms
SQS: 2 consumers: 2.5k/min
Conclusion: RabbitMQ has 70 times more throughput in a single cluster, has better latency and can be scaled up.
To support the 50k/min throughput target we would need around 100 machines to support. Duplication rate also increases proportionally to # of consumers.
Skeleton #2
This involved creating all the workers that perform very basic business logics and log messages
Testing here involved ensuring that in our architecture we could still meet performance requirements of 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency
Skeleton #2
This involved creating all the workers that perform very basic business logics and log messages
Testing here involved ensuring that in our architecture we could still meet performance requirements of 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency
We would check for end to end performance, as well as each individual worker’s performance.
We also started __testing RabbitMQ clustering and configuration__. This stage of testing picked up a networking issue which was present on certain types of EC2 instance and required further investigation by our platform engineers, great to find this out at this stage rather than further down the track.
The walking skeleton approach allowed us to easily find and detect this issue _before_ the system became too complex
Skeleton #3
This involved adding business logic to all the workers in the system, including __calls to microservices__, __adding data to the system__, testing different __workflows and failure scenarios__, what happens when microservice x is unavailable – whilst still meeting the 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency
Talk to diagram to explain databases, cache servers, and _populate data_ to them
Tooling & Instrumentation #3
Then we needed tools to load the system with data, tools to test with real life traffic and tools to measure all parts of the system to identify bottlenecks as messages flowed through the system that would breach our 2 second latency requirement.
We built a traffic pump that literally would pump production data based on log files in legacy, obfuscated and multiplied from old gateway into the new gateway.
We improved our tooling by collecting metrics from all parts of the system – RabbitMQ, database, EC2 instances across all workers and microservices.
We also built HTTP and SMPP simulators with various parameters, to simulate real world providers eg. Rate limiting, contractual agreements
A diagram of the sampling tool, pulling all these metrics into one central place?
Tooling & Instrumentation #3
Then we needed tools to load the system with data, tools to test with real life traffic and tools to measure all parts of the system to identify bottlenecks as messages flowed through the system that would breach our 2 second latency requirement.
We built a traffic pump that literally would pump production data based on log files in legacy, obfuscated and multiplied from old gateway into the new gateway.
We improved our tooling by collecting metrics from all parts of the system – RabbitMQ, database, EC2 instances across all workers and microservices.
We also built HTTP and SMPP simulators with various parameters, to simulate real world providers eg. Rate limiting, contractual agreements
A diagram of the sampling tool, pulling all these metrics into one central place?
As we added more business logic to the system we started using the traffic pump to feed more and more data into the new system, which was running side by side along the legacy gateway. Because we put in a lot of efforts in tooling and instrumentation we had great visibility into all applications and AWS services. We graphed all this data to show us spikes in latency or drops in throughput, and more importantly tried to correlate these events to spikes in CPU, network, database utilisation and so on.
When we found a bottleneck, we would tweak it, eg. _add more capacity to the database_, _add more workers_ to the pool and rerun the test. This was a rinse and repeat process until all bottlenecks had been identified and resolved providing us with a platform that could handle 50k messages/min with <2 second latency. At the same time, proving we had 0 duplicates and 0 message loss.
__We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised.
Ex 1. we could see that throughput started to flatline, and we were able to see a spike in disk I/O from EC2 metrics, so we changed EC2 type and introduced compression of messages.
Ex 2. when we pushed a lot of messages through, we could see suddenly latency shot up, but EC2 metrics looked normal, turns out we were getting throttled due to dynamodb provision limits.
As we added more business logic to the system we started using the traffic pump to feed more and more data into the new system, which was running side by side along the legacy gateway. Because we put in a lot of efforts in tooling and instrumentation we had great visibility into all applications and AWS services. We graphed all this data to show us spikes in latency or drops in throughput, and more importantly tried to correlate these events to spikes in CPU, network, database utilisation and so on.
When we found a bottleneck, we would tweak it, eg. _add more capacity to the database_, _add more workers_ to the pool and rerun the test. This was a rinse and repeat process until all bottlenecks had been identified and resolved providing us with a platform that could handle 50k messages/min with <2 second latency. At the same time, proving we had 0 duplicates and 0 message loss.
__We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised.
Ex 1. we could see that throughput started to flatline, and we were able to see a spike in disk I/O from EC2 metrics, so we changed EC2 type and introduced compression of messages.
Ex 2. when we pushed a lot of messages through, we could see suddenly latency shot up, but EC2 metrics looked normal, turns out we were getting throttled due to dynamodb provision limits.
__We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised.
#1. AWS is known for its service reliability, but don’t assume they never go down. Famous example is s3 outage but aws dashboard is all green, because it uses s3. There can also be AWS services with degraded performance such as networking – multi-tenanted environment. We need to ensure the system can still function or at least in a degraded mode, with known and acceptable behaviours.
#2. Expect development team to change use of AWS services eg. Dynamodb vs. ElasticCache, RDS vs Redshift, SQS vs. Kinesis etc. The ability to consistently test and measure results makes this a lot less risky from functional and performance point of view.
#3. Expect to have many upgrades with system, some could contain breaking changes. Eg. Message versioning.
Conclusion
The crawling, walking, running skeleton approach allowed us to de risk the project, by proving performance early, identifying issues early and also allowed our testing tools to be developed in an similar iterative fashion that allowed us to adapt and model our tools to continually test to the original requirements.
Tooling and instrumentation was critical. Building hundreds of metrics into our codebase, combined with the multitude of metrics available from RabbitMQ and AWS allowed us to produce highly detailed pictures of how the system was performing at every point _making performance testing and tuning highly data driven._
Iterative performance and functional testing was done at every stage of the skeleton build, allowing us to easily isolate and identify issues early on – this included message duplication with SQS, networking irregularities on certain EC2 machines, differing performance patterns with RabbitMQ depending on configuration. This also helped build stakeholder confidence throughout the journey as well as helping platform/operations plan and manage to support the system in production.
Similar to slide 5 to loop back and reinforce your 3 key points