Video - https://www.youtube.com/watch?v=uAladw9ef-I
Learn how Carousell manages infrastructure to support sustained growth, high levels of uptime, and a variety of use cases. Specifically:
The journey so far
Cloud migration
How we manage configuration for hundreds of servers
A deployment pipeline that allows quick deployments and rollbacks
Adoption of new technology such as Kubernetes for smoother deployments and management
How we manage all data stores in-house with no 3rd-party help in the entire transactional flow.
Monitoring: our eyes and ears into hundreds of our servers
How all of the above was put into practice to execute a successful flash sale.
1. Scaling Infrastructure at Carousell
Harshad Rotithor & Ankur Shrivastava
January 12, 2017
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 1 / 48
2. Who are we?
Harshad Rotithor
Principle Software Engineer
Leads Infrastructure team
Previously at Flipkart,
Airpush, Zynga, etc.
harshad@carousell.com
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 2 / 48
3. Who are we?
Ankur Shrivastava
Senior Software Engineer
Engineer in the Infrastructure
team
Previously at Flipkart,
Amazon, Zynga, etc.
ankur@carousell.com
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 3 / 48
4. Where are we currently?
Started in 2012 at a Hackathon
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
5. Where are we currently?
Started in 2012 at a Hackathon
7 countries, 19 cities
57M+ listings
23M+ items sold
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
6. Where are we currently?
Started in 2012 at a Hackathon
7 countries, 19 cities
57M+ listings
23M+ items sold
Carousell makes buying and selling
simple, so that you can fill our life
with more meaningful things
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
7. Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
8. Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Self Managed deployments
PostgresSQL
ElasticSearch
Cassandra
RabbitMQ
Kafka
Redis
Memcache
and more ...
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
9. Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Self Managed deployments
PostgresSQL
ElasticSearch
Cassandra
RabbitMQ
Kafka
Redis
Memcache
and more ...
Uptime of 99.95
Ability to handle AZ failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
10. So what is this talk about ?
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 6 / 48
11. What it took to reach here
And what lies ahead!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 7 / 48
12. Current Infrastructure - Overview
Infrastructure is:
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
13. Current Infrastructure - Overview
Infrastructure is:
Architecture
Systems
Operations
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
14. Current Infrastructure - Overview
Infrastructure is:
Architecture
Systems
Operations
Stateful components most important
We self-manage user path data
stores
Enable choice of data stores
Right tradeoff in terms of
consistency
Enable possibilities of
workarounds during rough times
Have flexibility in node
configuration etc
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
16. Current Infrastructure - Data Stores
Master + 2 Slaves in each AZ (Total
7)
pgbouncer + HA Proxy
(config-service)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
17. Current Infrastructure - Data Stores
Master + 2 Slaves in each AZ (Total
7)
pgbouncer + HA Proxy
(config-service)
Dedicated data disks (always use
SSDs)
Master disk snapshot every 3hr
(fsync enabled)
Don’t turn off Autovacuum
(transaction id)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
18. Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
19. Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Keep masters in different AZ
HAProxy with L7 healthchecks
(config-service)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
20. Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Keep masters in different AZ
HAProxy with L7 healthchecks
(config-service)
Incremental backups
Set shard count correctly, be on higher side.
Rely on linux page cache
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
21. History
Cloud provider ’x’
Everyday firefighting
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
22. History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
Network
Disk
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
23. History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
Network
Disk
Noisy neighbours
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
24. History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
Network
Disk
Noisy neighbours
Limited types of instances
Lack of features
Load balancer
Autoscaling
Security!
Decided on Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
25. Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
26. Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Perform comparative benchmarks
Redefine node and cluster configuration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
27. Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Perform comparative benchmarks
Redefine node and cluster configuration
Isolated deployment in GCP
Dry run data migration for all clusters
Estimate time
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
28. Preparation
July 2016
VPN across the providers (Heavy
Duty)
Replicate all that can be replicated
(inter DC)
Keep stateless nodes ready
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
29. Preparation
July 2016
VPN across the providers (Heavy
Duty)
Replicate all that can be replicated
(inter DC)
Keep stateless nodes ready
Make DNS nameserver changes in
advance (3-4 days)
Script everything - node creation,
data movement, etc.
Aim for only data movement during
Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
31. Migration
29th July 2016 at 3am
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
32. Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
33. Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
DB
Replicated slaves across DC
Promote to master and create
slaves
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
34. Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
DB
Replicated slaves across DC
Promote to master and create
slaves
ElasticSearch & Cassandra
Snapshot/Restore
Very Quick - Fast GCP network
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
35. Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
DB
Replicated slaves across DC
Promote to master and create
slaves
ElasticSearch & Cassandra
Snapshot/Restore
Very Quick - Fast GCP network
Redis
RDB restore, create slaves
Beware of cluster state in case of
redis cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
36. Post Migration
5-6hr of Maintenance
Latency dropped to 1/4th on GCP
DNS propagation issue (even after 2 days)
L7 tunnels over VPN
Ensure monitoring is taken over after migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 17 / 48
37. Key Take Away
Practice makes the migration
perfect!
Keep stateless nodes ready
Keep configuration updated
Expect issues
Redis cluster state switch
DNS caching by ISPs for days
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
38. Key Take Away
Practice makes the migration
perfect!
Keep stateless nodes ready
Keep configuration updated
Expect issues
Redis cluster state switch
DNS caching by ISPs for days
Keep Calm!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
39. From Pets To Cattle
⇓
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 19 / 48
40. From Pets To Cattle
Static Infrastructure is a myth!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
41. From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
42. From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one after
another
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
43. From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one after
another
Configuration can quickly become
stale
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
44. From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one after
another
Configuration can quickly become
stale
Misconfiguration of Nodes
Salt propagation issues
Recent config update
Painful to detect and fix
Production impact!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
45. From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
46. From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
47. From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
48. From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
49. From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes
(create/update/migrate)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
50. From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes
(create/update/migrate)
Aggressive Monitoring and Alerting
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
51. From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes
(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
52. Configuration and Service Discovery
For Configuration we needed →
Centralized configuration storage
Consistent store
Audit of configuration changes
Versioning for quick reverts
Easy to deploy and manage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 22 / 48
53. Configuration and Service Discovery
For Service Discovery we needed →
Decoupled from application code
Health checks
Easy to Scale Out
Easy to deploy and manage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 23 / 48
54. Configuration and Service Discovery
We built ’Config-Service’ on top on
’Consul’
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
55. Configuration and Service Discovery
We built ’Config-Service’ on top on
’Consul’
Configuration on nodes using Consul
Template & Envconsul
Installation on instances using
internal Debian package and repo
’Config-Service’ package takes care
of consul cluster configuration and
health check registration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
56. Configuration Management
Git repository to manage
configuration
Filename is the key, content is the
value
Single source of truth
Audit log of changes
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
57. Configuration Management
Git repository to manage
configuration
Filename is the key, content is the
value
Single source of truth
Audit log of changes
Easy reverts and versioning (just use
git revert)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
59. Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
60. Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /
Number of Requests
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
61. Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /
Number of Requests
Node Maintenance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
63. Auto Scaling
Pay as you go, lower cost
Better fault tolerance
Availability zone failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
64. Auto Scaling
Pay as you go, lower cost
Better fault tolerance
Availability zone failures
Handle sudden increase in traffic (specially at midnight!)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
65. Key Take Away
Assume things will
break
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
66. Key Take Away
Assume things will
break
Set Convention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
67. Key Take Away
Assume things will
break
Set Convention
Script everything
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
68. Key Take Away
Assume things will
break
Set Convention
Script everything
Use deb/rpm packages
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
69. Key Take Away
Assume things will
break
Set Convention
Script everything
Use deb/rpm packages
Instance groups for
stateless services
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
70. Key Take Away
Assume things will
break
Set Convention
Script everything
Use deb/rpm packages
Instance groups for
stateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
71. Kubernetes
Partial Kubernetes deployment since
Oct, 2016
Full Production deployment since
Nov, 2016
Using Google Container Engine
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
72. Kubernetes
Partial Kubernetes deployment since
Oct, 2016
Full Production deployment since
Nov, 2016
Using Google Container Engine
30+ deployments
500+ containers (At Peak)
Autoscale on CPU targets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
73. Kubernetes
Partial Kubernetes deployment since
Oct, 2016
Full Production deployment since
Nov, 2016
Using Google Container Engine
30+ deployments
500+ containers (At Peak)
Autoscale on CPU targets
Not all services on boarded yet
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
74. Kubernetes
We don’t use K8S Ingress/Service
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
75. Kubernetes
We don’t use K8S Ingress/Service
Config-Service (consul) as
DaemonSet
Containers get registered on
Config-Service (NodePort) from
health check
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
76. Kubernetes
We don’t use K8S Ingress/Service
Config-Service (consul) as
DaemonSet
Containers get registered on
Config-Service (NodePort) from
health check
No change in existing architecture
needed
Service discovery from
Internal/External HA Proxy still
works
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
77. Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
78. Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Recovery mechanism / Transitioning
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
79. Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Recovery mechanism / Transitioning
Instance group size set to zero (Fully on K8S)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
83. Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
84. Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
85. Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume or
revert deployment
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
86. Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume or
revert deployment
Tracked in Slack channels
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
87. Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume or
revert deployment
Tracked in Slack channels
Soon to be transformed to
CI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
88. Monitoring & Alerting
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 35 / 48
89. Monitoring & Alerting
Monitoring is critical
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
90. Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
91. Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
92. Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (with
exporters)
ELK
Sentry
StatsD
NewRelic
OpsGenie
Pingdom
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
93. Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (with
exporters)
ELK
Sentry
StatsD
NewRelic
OpsGenie
Pingdom
Identify Retention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
94. Monitoring & Alerting
Bare minimum required metrics→
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
95. Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
96. Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
97. Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Disk IOPS
Disk Usage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
102. Monitoring & Alerting
’Config-Service’ logs auto
failover
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
103. Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
104. Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
105. Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
106. Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
107. Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
108. Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
109. Future Plans
Hire more engineers!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
110. Future Plans
Hire more engineers!
Move more services to Kubernetes
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
111. Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
112. Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
113. Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
114. Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
118. Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
119. Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real time
monitoring
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
120. Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real time
monitoring
Zipkin for request tracing
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
121. Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real time
monitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
122. Flash Sale
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
123. Flash Sale
Ultimate test of scalability
Hard to judge peak
Throughput can multiply in
short time
Planned for 2x throughput
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
124. Flash Sale - Latency
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 43 / 48
125. Flash Sale
Cache read calls at multiple layers
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
126. Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
127. Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Local SSD PG slaves with RAID 0
(100k IOPS)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
128. Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Local SSD PG slaves with RAID 0
(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
129. Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Local SSD PG slaves with RAID 0
(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Build and keep SOP handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
130. Flash Sale - Standard Operating Procedure
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 45 / 48
131. Infrastructure Team at Carousell
400+ servers
Thousands of requests per second
Production Issues get looked after in < 5 Mins
Uptime of 99.95
Failures don’t result in outages
All thanks to Planning, Monitoring and Automation
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 46 / 48
132. Take Away
Isolate stateful and stateless components
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
133. Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
134. Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
135. Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
136. Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
137. Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
138. Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
139. Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
140. Thank You
Q&A
P.S. we are hiring http://careers.carousell.com/
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 48 / 48