Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Patterns for building resilient and scalable
microservices platform on AWS
Boyan Dimitrov,
Platform Automation Lead @ Hail...
Back in 2011 we started simple
We quickly found out that supporting monoliths is hard:
• Hard to maintain the codebase
• H...
So in 2013 we ended up
doing…
At present we have
• Microservices ecosystem (99.9% written in Go)
• Designed specifically for the cloud – different build...
The Platform under the hood
TeVPC
Auto Scaling
S3
OrchestrationEnv DNS
Release AutoScaling
Discovery
Monitoring
CFEC2
Route 53
Redshift
ComputeEIP
Rou...
• Lowest level building blocks
• We mostly use basic PaaS components and services as they cover most of our
needs
• We exp...
eu-west-1
Message Bus+
Go Services
Proxy Layer
C*
us-east-1
Proxy Layer
C*
Go Services
Message Bus+
eu-west-1
Proxy Layer
Message Bus
eu-west-1a
Services
eu-west-1b eu-west-1c
Shared Infra
RabbitMQ RabbitMQ RabbitMQ
API AP...
• We use auto scaling groups for everything
 Guarantees each component can be rebuilt automatically
 Including our datab...
• Our “cloud provider abstraction” layer
• Main purpose is infrastructure and workflow automation and discovery
• Has a gl...
It all started by a small challenge we had to overcome:
Payment providers whitelist sources
EIP Service
Elastic IP Provisioning Service
NAT
LIVE
NAT
FOO
51.x.x.1 nat live
51.x.x.2 nat live
51.x.x.3 nat live
50.x.x....
We do a lot of server discovery
• Both external and internal orchestration tools rely on AWS APIs for server
discovery
• P...
Compute service to the rescue
• A distributed cache of all compute instances and their meta data
• Powerful query API ( Ve...
Everything in our platform emits events
So naturally we want to capture all external events as well!
Whisper Service
It’s all about event driven compute – think Lambda but within our platform
Events
Events
Hundreds of publi...
What about AWS resource access?
temporary
security
credentials
AWS Account X AWS Account Y
service
temporary
security
credentials
role role
• Each externa...
Shared environments create contention. We decided to boost our
developers productivity and give them on demand environment...
Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
Clou...
Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
Rele...
Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Vagrant support
Hundreds of servers
/ single...
SIE
Pre Prod
All of this so we can do
SIE
MIE
MIE
SIE
MIE
SIE Live
Orchestration
SIE
Preparing for…
SIE
MIE
MIE
SIE
MIE
SIE Live
Orchestration
• The only services directly aware of our cloud provider specifics – gives us a lot of
flexibility and let us introduce ch...
Provides the most essential platform functions for every service:
• Service Discovery
• Service Provisioning
• Routing & L...
Service Provisioning
Provisioning Service
Build Pipeline
Amazon S3
Provisioning Manager
Provisioning Service
Docker Registry
Provisioning overv...
Service deployment specifics
• Each service is decoupled from the rest and deployed individually
• We run multiple service...
Deploying a service
service name version
auto scaling group
Coming soon: Elastic resource pools and QoS scheduling
Elastic Resource Pool
ECS
Agent
ECS
Agent
ECS
Agent
ECS
Agent
ECS
A...
So what does this mean?
Elastic resource pool
75-80%
Utilization
eu-west-1a eu-west-1b eu-west-1c
One word – such differen...
Why building our own scheduler?
• Service Priority
• Service specific runtime metrics
• Interference
• Cloud awareness ( a...
• Self-contained units of execution
• Built around business capabilities or domain objects
• Small enough to be rewritten ...
Service interactions – not as scary as it looks!
A microservice under the hood
Logic
Storage
Library for abstracting service-
to-service comms
service-layer
Handler platfo...
Microservices are all about tooling
Live request tracing
You need to identify your main KPIs
Thanks!
Get a taxi home on us:
@nathariel
boyan@hailocab.com
@HailoTech
Patterns for building resilient and scalable microservices platform on AWS
Patterns for building resilient and scalable microservices platform on AWS
Patterns for building resilient and scalable microservices platform on AWS
Patterns for building resilient and scalable microservices platform on AWS
Upcoming SlideShare
Loading in …5
×

Patterns for building resilient and scalable microservices platform on AWS

16,594 views

Published on

In this talk we explore Hailo's H2 platform under the hood taking a peek into the orchestration layer and introducing various patterns for building scalable and resilient microservices platform. We share insights about our architecture and how it evolved into a cloud agnostic self-managed system.

Published in: Technology
  • Be the first to comment

Patterns for building resilient and scalable microservices platform on AWS

  1. 1. Patterns for building resilient and scalable microservices platform on AWS Boyan Dimitrov, Platform Automation Lead @ Hailo @nathariel
  2. 2. Back in 2011 we started simple We quickly found out that supporting monoliths is hard: • Hard to maintain the codebase • Hard to build new features • Hard to scale the dev teams Failure to deliver business value Frontend Backend MySQL
  3. 3. So in 2013 we ended up doing…
  4. 4. At present we have • Microservices ecosystem (99.9% written in Go) • Designed specifically for the cloud – different building blocks and components will constantly be in flux, broken or unavailable • 1000+ AWS instances spanning multiple regions • 200+ services in production
  5. 5. The Platform under the hood
  6. 6. TeVPC Auto Scaling S3 OrchestrationEnv DNS Release AutoScaling Discovery Monitoring CFEC2 Route 53 Redshift ComputeEIP Routing Core Platform Provisioning Login Services Cloud Provider Whisper Config
  7. 7. • Lowest level building blocks • We mostly use basic PaaS components and services as they cover most of our needs • We expect every underlying component to fail and we designed for this
  8. 8. eu-west-1 Message Bus+ Go Services Proxy Layer C* us-east-1 Proxy Layer C* Go Services Message Bus+
  9. 9. eu-west-1 Proxy Layer Message Bus eu-west-1a Services eu-west-1b eu-west-1c Shared Infra RabbitMQ RabbitMQ RabbitMQ API API API Go Go Go x many C* NSQ ZK C* NSQ ZK C* NSQ ZK x many x many
  10. 10. • We use auto scaling groups for everything  Guarantees each component can be rebuilt automatically  Including our database clusters that run on ephemeral storage ( we do keep 6 copies of each piece of data in 2 regions ) • Minimum of 3 AZs in every region • Every workflow is automated • Every component has to be self healing and scalable Basic principles
  11. 11. • Our “cloud provider abstraction” layer • Main purpose is infrastructure and workflow automation and discovery • Has a global view of everything happening across our infrastructure • Provides additional capabilities on top of AWS • Runs in a dedicated VPCs across two regions OrchestrationEnv DNS Release AutoScalingComputeEIP Whisper
  12. 12. It all started by a small challenge we had to overcome: Payment providers whitelist sources
  13. 13. EIP Service Elastic IP Provisioning Service NAT LIVE NAT FOO 51.x.x.1 nat live 51.x.x.2 nat live 51.x.x.3 nat live 50.x.x.5 1 nat foo Maintains elastic IP pools across all our accounts and matches them against auto scaling groups and environments auto scaling group auto scaling group
  14. 14. We do a lot of server discovery • Both external and internal orchestration tools rely on AWS APIs for server discovery • Puppet has AWS integration for clustering infra • Exponential back-off mitigates the issue but does not solve it if you have many clients “RequestLimitExceeded”.
  15. 15. Compute service to the rescue • A distributed cache of all compute instances and their meta data • Powerful query API ( Very Fast!) • Main interface for creating new compute instances • Reconciles any changes in any AWS account within seconds Compute Service Other providers Internal tools External toolsServices
  16. 16. Everything in our platform emits events So naturally we want to capture all external events as well!
  17. 17. Whisper Service It’s all about event driven compute – think Lambda but within our platform Events Events Hundreds of publishers & subscribe NSQ Topics Events External sources Actions To subscribe to any new event source we have to only change a single service
  18. 18. What about AWS resource access?
  19. 19. temporary security credentials AWS Account X AWS Account Y service temporary security credentials role role • Each external orchestration service instance has a “global” view of our infrastructure • Relies heavily on STS to operate across different accounts and regions • Each service has a designated role for every account and region AWS Auth under the hood
  20. 20. Shared environments create contention. We decided to boost our developers productivity and give them on demand environments ENV ENV ENV
  21. 21. Environment Service SIE MIE Infrastructure Core Platform Single server on AWS Hundreds of servers / single AWS region CloudFormation Orchestration layer On demand environments Single Instance Environment Multi instance environment
  22. 22. Environment Service SIE MIE Infrastructure Core Platform Single server on AWS Hundreds of servers / single AWS region Release Service ANY ENV (PROD) Services Config *Data clone ETA: ~12 min ETA: ~40 min CloudFormation Orchestration layer On demand environments Single Instance Environment Multi instance environment
  23. 23. Environment Service SIE MIE Infrastructure Core Platform Single server on AWS Vagrant support Hundreds of servers / single AWS region Multi-region environments Release Service ANY ENV (PROD) Services Config *Data clone ETA: ~12 min ETA: ~40 min CloudFormation Orchestration layer On demand environments Single Instance Environment Multi instance environment
  24. 24. SIE Pre Prod All of this so we can do SIE MIE MIE SIE MIE SIE Live Orchestration
  25. 25. SIE Preparing for… SIE MIE MIE SIE MIE SIE Live Orchestration
  26. 26. • The only services directly aware of our cloud provider specifics – gives us a lot of flexibility and let us introduce changes quickly • Each of them fulfills a very specific task and together create powerful workflows • Nothing else in our platform is aware of the underlying cloud layer • We did not envision being “cloud agnostic” – it just happened
  27. 27. Provides the most essential platform functions for every service: • Service Discovery • Service Provisioning • Routing & Load Balancing • Authentication/Authorization • Monitoring • Configuration
  28. 28. Service Provisioning
  29. 29. Provisioning Service Build Pipeline Amazon S3 Provisioning Manager Provisioning Service Docker Registry Provisioning overview Instance Instance Process Container Auto Scaling GroupAuto Scaling Group
  30. 30. Service deployment specifics • Each service is decoupled from the rest and deployed individually • We run multiple services on the same instance but each service is deployed in at least 3 AZs • We rely on auto scaling groups for organizing and scaling our workload • We use static partitioning to match a service to an auto scaling group and this results in non optimal resource utilisation (25% - 50%)
  31. 31. Deploying a service service name version auto scaling group
  32. 32. Coming soon: Elastic resource pools and QoS scheduling Elastic Resource Pool ECS Agent ECS Agent ECS Agent ECS Agent ECS Agent ECS Agent QoS Scheduler eu-west-1a eu-west-1b eu-west-1c AWS Cloud Provider ECS Cluster Manager instance instance instance instance instance instance
  33. 33. So what does this mean? Elastic resource pool 75-80% Utilization eu-west-1a eu-west-1b eu-west-1c One word – such difference! instance instance instance instance instance instance
  34. 34. Why building our own scheduler? • Service Priority • Service specific runtime metrics • Interference • Cloud awareness ( availability zones, pool elasticity…) Running services in a pay as you go fashion will soon be a reality as much as todays on demand compute We want a cloud-native scheduler that is aware of the cloud specifics and our microservices ecosystem:
  35. 35. • Self-contained units of execution • Built around business capabilities or domain objects • Small enough to be rewritten in a few days • They are all about adding business value
  36. 36. Service interactions – not as scary as it looks!
  37. 37. A microservice under the hood Logic Storage Library for abstracting service- to-service comms service-layer Handler platform-layer Self-configuring external service adapters Service • Service to service communication libs • Discovery • Configuration • A/B testing capabilities • Monitoring & Instrumentation • … and much more Any service gets for free:
  38. 38. Microservices are all about tooling
  39. 39. Live request tracing
  40. 40. You need to identify your main KPIs
  41. 41. Thanks! Get a taxi home on us: @nathariel boyan@hailocab.com @HailoTech

×