SlideShare a Scribd company logo
1 of 23
STRICTLY CONFIDENTIAL
TESTING A HIGH PERFORMANCE CLOUD
BASED DISTRIBUTED MESSAGING SYSTEM
MessageMedia provides a high performance two way
messaging platform for developers to build robust and
highly scalable applications. Every year more than
23,000 customers send over 1.5 billion messages,
enriching their applications with business messaging.
Old gateway couldn’t deliver reliability and
performance anymore.
Requirements
50,000 messages per minute throughput
99% message delivered in less than 2
seconds Resilient, durable, scalable and cost
effective
New architecture, new platforms, new
technologies, how do we test all this?
A message queue provides an asynchronous
communications protocol, ie. a system that puts a message
onto a queue does not require an immediate response to
continue processing
Approach
Crawling, walking and running skeleton approach
Build tools and instrumentation
Test performance continuously and consistently
The Crawling Skeleton
Tooling & Instrumentation
RabbitMQ vs SQS
187,000
97,000
2,400
Throughput in Messages per Minute
The Walking Skeleton
Tooling & Instrumentation
The Running Skeleton
Tooling & Instrumentation #3
All The Metrics!
Worker[1..n] Instance CPU
Worker[1..n] Instance Disk I/O
Worker[1..n] Instance RAM
Worker[1..n] Instance Network I/O
Worker[1..n] Instance Latency
Worker[1..n] Instance Throughput
RabbitMQ Nodes[1..3] CPU
RabbitMQ Nodes[1..3] Disk I/O
RabbitMQ Nodes[1..3] RAM
RabbitMQ Nodes[1..3] Network I/O
DynamoDB Provisioned Throughput
DynamoDB Throttling
Redis Nodes CPU
Redis Nodes RAM
JMeter Throughput
RabbitMQ Cluster Ingest Rate
Rabbit MQ Cluster Egress Rate
Micro Service[1..n] HTTP 200 Requests
Micro Service[1..n] HTTP 500 Requests
Micro Service[1..n] HTTP Request Latency
Micro Service[1..n] HTTP CPU
Micro Service[1..n] HTTP Network I/O
Micro Service[1..n] HTTP Request Latency
Worker Metrics
CPU RAM Disk IO
Latency
Time
DynamoDB Metrics
Provisioned Read Provisioned Writes Throttled Requests
Latency
Time
Lessons Learnt
Expect AWS services to fail (occasionally)
Expect developers to change to use different AWS services
Ensure queueing system provides durable and persistent storage
and application can survive upgrades (if self-hosted)
Conclusion
Skeleton approach de-risked project
Appropriate tooling and instrumentation is critical
Consistent and iterative performance testing helped build
stakeholder confidence
References
https://aws.amazon.com/sqs/
https://www.rabbitmq.com/
http://colby.id.au/benchmarking-sqs/
http://jmeter.apache.org/
Questions?
Stay in touch!!
Slack: https://messagemediadevs.slack.com
Developer Portal: https://developers.messagemedia.com
Twitter: https://twitter.com/messagemedia1

More Related Content

Similar to Testing a High Performance Cloud Based Distributed Messaging System

Cross selling 5
Cross selling 5Cross selling 5
Cross selling 5Sen Nathan
 
High Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service ProvidersHigh Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service ProvidersCA Technologies
 
IBM MQ Advanced - IBM InterConnect 2016
IBM MQ Advanced - IBM InterConnect 2016IBM MQ Advanced - IBM InterConnect 2016
IBM MQ Advanced - IBM InterConnect 2016Leif Davidsen
 
Running IBM MQ in the Cloud
Running IBM MQ in the CloudRunning IBM MQ in the Cloud
Running IBM MQ in the CloudRobert Parker
 
Powering performance through a tailor-made solution.
Powering performance through a tailor-made solution.Powering performance through a tailor-made solution.
Powering performance through a tailor-made solution.Mindtree Ltd.
 
Dslf Broadband Suitefor Caba
Dslf Broadband Suitefor CabaDslf Broadband Suitefor Caba
Dslf Broadband Suitefor CabaCABA
 
API Days Australia
API Days AustraliaAPI Days Australia
API Days Australiaconfluent
 
apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...
apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...
apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...apidays
 
Secure email gate way
Secure email gate waySecure email gate way
Secure email gate wayvfmindia
 
CIMCO Network
CIMCO NetworkCIMCO Network
CIMCO NetworkSara Webb
 
Nakina NOS Overview
Nakina NOS OverviewNakina NOS Overview
Nakina NOS Overviewhal2005
 
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016Leif Davidsen
 
Realtime mobile&iot solutions using mqtt and message sight
Realtime mobile&iot solutions using mqtt and message sightRealtime mobile&iot solutions using mqtt and message sight
Realtime mobile&iot solutions using mqtt and message sightfloridawusergroup
 
Brochure of Luxoft telecom solutions by Luxoft software development
Brochure of Luxoft telecom solutions by Luxoft software developmentBrochure of Luxoft telecom solutions by Luxoft software development
Brochure of Luxoft telecom solutions by Luxoft software developmentLuxoft
 
Service Mesh Talk for CTO Forum
Service Mesh Talk for CTO ForumService Mesh Talk for CTO Forum
Service Mesh Talk for CTO ForumRick Hightower
 
Jayeed 062424056 Ete605 Sec 2
Jayeed 062424056 Ete605 Sec 2Jayeed 062424056 Ete605 Sec 2
Jayeed 062424056 Ete605 Sec 2mashiur
 

Similar to Testing a High Performance Cloud Based Distributed Messaging System (20)

Intermedia Overview
Intermedia OverviewIntermedia Overview
Intermedia Overview
 
Cross selling 5
Cross selling 5Cross selling 5
Cross selling 5
 
High Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service ProvidersHigh Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service Providers
 
IBM MQ Advanced - IBM InterConnect 2016
IBM MQ Advanced - IBM InterConnect 2016IBM MQ Advanced - IBM InterConnect 2016
IBM MQ Advanced - IBM InterConnect 2016
 
Running IBM MQ in the Cloud
Running IBM MQ in the CloudRunning IBM MQ in the Cloud
Running IBM MQ in the Cloud
 
IBM MQ v8 enhancements
IBM MQ v8 enhancementsIBM MQ v8 enhancements
IBM MQ v8 enhancements
 
Powering performance through a tailor-made solution.
Powering performance through a tailor-made solution.Powering performance through a tailor-made solution.
Powering performance through a tailor-made solution.
 
Dslf Broadband Suitefor Caba
Dslf Broadband Suitefor CabaDslf Broadband Suitefor Caba
Dslf Broadband Suitefor Caba
 
API Days Australia
API Days AustraliaAPI Days Australia
API Days Australia
 
apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...
apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...
apidays LIVE Australia 2020 - Building an Enterprise Eventing Platform by Gna...
 
Secure email gate way
Secure email gate waySecure email gate way
Secure email gate way
 
CIMCO Network
CIMCO NetworkCIMCO Network
CIMCO Network
 
Nakina NOS Overview
Nakina NOS OverviewNakina NOS Overview
Nakina NOS Overview
 
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
 
Pradeesh Resume
Pradeesh ResumePradeesh Resume
Pradeesh Resume
 
Realtime mobile&iot solutions using mqtt and message sight
Realtime mobile&iot solutions using mqtt and message sightRealtime mobile&iot solutions using mqtt and message sight
Realtime mobile&iot solutions using mqtt and message sight
 
Brochure of Luxoft telecom solutions by Luxoft software development
Brochure of Luxoft telecom solutions by Luxoft software developmentBrochure of Luxoft telecom solutions by Luxoft software development
Brochure of Luxoft telecom solutions by Luxoft software development
 
Service Mesh Talk for CTO Forum
Service Mesh Talk for CTO ForumService Mesh Talk for CTO Forum
Service Mesh Talk for CTO Forum
 
Jayeed 062424056 Ete605 Sec 2
Jayeed 062424056 Ete605 Sec 2Jayeed 062424056 Ete605 Sec 2
Jayeed 062424056 Ete605 Sec 2
 
Transaction Processing monitor
Transaction Processing monitorTransaction Processing monitor
Transaction Processing monitor
 

Recently uploaded

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Testing a High Performance Cloud Based Distributed Messaging System

Editor's Notes

  1. Introduction Ben gave an introduction about MM, we are the #1 Business messaging provider in Australia, processing over 1.5 billion messages per year for over 23k customers. We send messages worldwide SMS: born in 1992, ubiquitous, very high open rate (~98%), much higher than other messaging means such as email – we send to 12 million unique phones in Australia each month OTP Reminders Emergency messages Marketing, we don’t send messages selling certain types of pharmaceutical products or saying that you have just won a lottery somewhere in Africa We processed 1B events in our new system, but what was wrong with the old, legacy system?
  2. Gateway is a collection of applications and databases, responsible for message processing. Where did we come from? Legacy system: Monolithic application built in a LAMP stack deployed in data centres Unreliable performance as business grew, especially when we had spikes in usage from customers Using database for transient transactions, a single messages sent/receive can lead to ~10 DB READs and WRITEs Scaling relational databases is hard – fundamentally, RDMS are designed to run on a single server in order to maintain data integrity We also had heaps of problems with inherent issues with databases such as contention, replication lags etc Lack of metrics, evidence (logs) suggests peak throughput of about 8k messages/minute
  3. The business wanted a system that won’t become obsolete in 5 years but rather something that can grow easily as the business grows. This is going to be the focus of today’s talk.
  4. Problem #2 What does our new architecture look like? Micro services – it’s a complete re-architecture Deployed on cloud and taking advantage of AWS services Queueing systems
  5. Benefits: Use queues to facilitate communications among worker applications, This Async approach allows easy scaling As applications process messages from queues, they can also generate events, which get published to services/event buses, other applications can subscribe to these buses to get the events and then perform business logic Proven to work well with large tech companies.
  6. Our Approach – 3 Key Points Crawling, walking and running skeleton approach to building out our platform Tools and instrumentation critical to be able to test throughout each stage of our build Performance and functional testing approaches during each stage
  7. Skeleton #1 The first stage of our build was the crawling skeleton. This involved testing of RabbitMQ versus SQS to see which one would be better and to ensure that the chosen technology could meet our requirements of 50,000 messages / minute with 2 seconds latency There are _2 hops__ in the system
  8. Tooling & Instrumentation #1 Completely different tools were required for testing a queue based system to an RDBMS – publishers that could flood the system with messages for performance testing and metrics we could use to measure how the system was performing We built tools to publish and consumer high volumne of messages We built tools to measure throughput and latency end to end We built tools to check for message duplication and consistency, by checking how many messages were published vs consumed
  9. RabbitMQ: 1 broker (rabbit), 2 consumers (3 machines): 187k/min, max latency 873ms, Multiple producers and consumers (11 machines): 97k/min, max latency 17000ms SQS: 2 consumers: 2.5k/min Conclusion: RabbitMQ has 70 times more throughput in a single cluster, has better latency and can be scaled up. To support the 50k/min throughput target we would need around 100 machines to support. Duplication rate also increases proportionally to # of consumers.
  10. Skeleton #2 This involved creating all the workers that perform very basic business logics and log messages Testing here involved ensuring that in our architecture we could still meet performance requirements of 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency
  11. Skeleton #2 This involved creating all the workers that perform very basic business logics and log messages Testing here involved ensuring that in our architecture we could still meet performance requirements of 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency We would check for end to end performance, as well as each individual worker’s performance. We also started __testing RabbitMQ clustering and configuration__. This stage of testing picked up a networking issue which was present on certain types of EC2 instance and required further investigation by our platform engineers, great to find this out at this stage rather than further down the track. The walking skeleton approach allowed us to easily find and detect this issue _before_ the system became too complex
  12. Skeleton #3 This involved adding business logic to all the workers in the system, including __calls to microservices__, __adding data to the system__, testing different __workflows and failure scenarios__, what happens when microservice x is unavailable – whilst still meeting the 50,000 messages / minute with no message loss or duplication whilst maintaining the 2 seconds latency Talk to diagram to explain databases, cache servers, and _populate data_ to them
  13. Tooling & Instrumentation #3 Then we needed tools to load the system with data, tools to test with real life traffic and tools to measure all parts of the system to identify bottlenecks as messages flowed through the system that would breach our 2 second latency requirement. We built a traffic pump that literally would pump production data based on log files in legacy, obfuscated and multiplied from old gateway into the new gateway. We improved our tooling by collecting metrics from all parts of the system – RabbitMQ, database, EC2 instances across all workers and microservices. We also built HTTP and SMPP simulators with various parameters, to simulate real world providers eg. Rate limiting, contractual agreements A diagram of the sampling tool, pulling all these metrics into one central place?
  14. Tooling & Instrumentation #3 Then we needed tools to load the system with data, tools to test with real life traffic and tools to measure all parts of the system to identify bottlenecks as messages flowed through the system that would breach our 2 second latency requirement. We built a traffic pump that literally would pump production data based on log files in legacy, obfuscated and multiplied from old gateway into the new gateway. We improved our tooling by collecting metrics from all parts of the system – RabbitMQ, database, EC2 instances across all workers and microservices. We also built HTTP and SMPP simulators with various parameters, to simulate real world providers eg. Rate limiting, contractual agreements A diagram of the sampling tool, pulling all these metrics into one central place?
  15. As we added more business logic to the system we started using the traffic pump to feed more and more data into the new system, which was running side by side along the legacy gateway. Because we put in a lot of efforts in tooling and instrumentation we had great visibility into all applications and AWS services. We graphed all this data to show us spikes in latency or drops in throughput, and more importantly tried to correlate these events to spikes in CPU, network, database utilisation and so on. When we found a bottleneck, we would tweak it, eg. _add more capacity to the database_, _add more workers_ to the pool and rerun the test. This was a rinse and repeat process until all bottlenecks had been identified and resolved providing us with a platform that could handle 50k messages/min with <2 second latency. At the same time, proving we had 0 duplicates and 0 message loss. __We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised. Ex 1. we could see that throughput started to flatline, and we were able to see a spike in disk I/O from EC2 metrics, so we changed EC2 type and introduced compression of messages. Ex 2. when we pushed a lot of messages through, we could see suddenly latency shot up, but EC2 metrics looked normal, turns out we were getting throttled due to dynamodb provision limits.
  16. As we added more business logic to the system we started using the traffic pump to feed more and more data into the new system, which was running side by side along the legacy gateway. Because we put in a lot of efforts in tooling and instrumentation we had great visibility into all applications and AWS services. We graphed all this data to show us spikes in latency or drops in throughput, and more importantly tried to correlate these events to spikes in CPU, network, database utilisation and so on. When we found a bottleneck, we would tweak it, eg. _add more capacity to the database_, _add more workers_ to the pool and rerun the test. This was a rinse and repeat process until all bottlenecks had been identified and resolved providing us with a platform that could handle 50k messages/min with <2 second latency. At the same time, proving we had 0 duplicates and 0 message loss. __We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised. Ex 1. we could see that throughput started to flatline, and we were able to see a spike in disk I/O from EC2 metrics, so we changed EC2 type and introduced compression of messages. Ex 2. when we pushed a lot of messages through, we could see suddenly latency shot up, but EC2 metrics looked normal, turns out we were getting throttled due to dynamodb provision limits.
  17. __We started to tune EC2s__, eg. 2 machines with 2 GB ram each may provide less performance than 4 machines with 1 GB ram each. CPU optimised, memory optimised and IO optimised.
  18. #1. AWS is known for its service reliability, but don’t assume they never go down. Famous example is s3 outage but aws dashboard is all green, because it uses s3. There can also be AWS services with degraded performance such as networking – multi-tenanted environment. We need to ensure the system can still function or at least in a degraded mode, with known and acceptable behaviours. #2. Expect development team to change use of AWS services eg. Dynamodb vs. ElasticCache, RDS vs Redshift, SQS vs. Kinesis etc. The ability to consistently test and measure results makes this a lot less risky from functional and performance point of view. #3. Expect to have many upgrades with system, some could contain breaking changes. Eg. Message versioning.
  19. Conclusion The crawling, walking, running skeleton approach allowed us to de risk the project, by proving performance early, identifying issues early and also allowed our testing tools to be developed in an similar iterative fashion that allowed us to adapt and model our tools to continually test to the original requirements. Tooling and instrumentation was critical. Building hundreds of metrics into our codebase, combined with the multitude of metrics available from RabbitMQ and AWS allowed us to produce highly detailed pictures of how the system was performing at every point _making performance testing and tuning highly data driven._ Iterative performance and functional testing was done at every stage of the skeleton build, allowing us to easily isolate and identify issues early on – this included message duplication with SQS, networking irregularities on certain EC2 machines, differing performance patterns with RabbitMQ depending on configuration. This also helped build stakeholder confidence throughout the journey as well as helping platform/operations plan and manage to support the system in production. Similar to slide 5 to loop back and reinforce your 3 key points
  20. Queues: * Decoupling of processes; redundancy; scalable; elastic; resilience; delivery guarantee
  21. Queues: * Decoupling of processes; redundancy; scalable; elastic; resilience; delivery guarantee
  22. Queues: * Decoupling of processes; redundancy; scalable; elastic; resilience; delivery guarantee