Mad scalability: Scaling when you are not Google

Scaling when you are not Google
Abel Muiño

Abel Muino
‣ Lead Software Engineer
‣ Tweets as @amuino
‣ In another life, co-owned
1uptalent.com, played with
Docker and used AWS for
everything.

Disclaimer
‣ Cabify is 5 years old
‣ I joined Cabify about 1.5 years ago to work on product
‣ What you will hear today might be
‣ 70% folklore / 30% experience
‣ Only about production
‣ Not applicable to other areas (data analytics)

2011 2012 2013 2014 2015 2016
Completed Journeys
(Axis has no legend because NDA and stuff)

Backend committers
0
5
9
14
18
2011 2012 2013 2014 2015 2016

We are hiring!
(As if it wasn’t obvious from the charts)

Cabify foundations
‣ Mostly Ruby, some Go
‣ Running on VPS
‣ No sysadmins (devops?)
‣ CouchDB
‣ Redis
‣ Home-grown metrics &
monitoring (limited)

Servers
‣ 3 ⨉ Host servers
‣ Horizontally scalable
‣ Most services included (sidecars)
‣ Front + Back + Queue workers
‣ 1 ⨉ Realtime server
‣ Single Point of Failure
‣ Ansible for setting them up
VPS
Provider
LB
web1 web2 web3
worker1
LBLB
redis1 redis2elastic
realtime osrm
websock

CouchDB
‣ Used to be run in-house → Unreliable
‣ Moved to Cloudant
‣ Managed
‣ Bare metal servers
‣ Requisite for everything else: to run on the same datacenter
‣ …because the network matters
Database of choice for Cabify

Pros
‣ Cheap servers
‣ Profesional DB management
‣ Still cheaper than in-house staff
‣ Scales up by either
‣ Emailing Cloudant
‣ Deploying new VPSs
‣ Datacenter lock-in
‣ Scarce visibility on load
‣ Low VPS utilization (for some
services)
Cons

Tl;dr: everything was fine
Until it wasn’t

In 2014 we handled
7 times the load of 2013

Installed NewRelic
‣ Monitors our ruby stack
‣ Built custom adapters for API toolkit and CouchDB
‣ Golang not supported 😭
‣ Low hanging fruit for increasing performance
‣ Hint: Always contact a Sales Rep
‣ Bye bye home-grown monitoring! 👋

VPS provider
DDoSed
‣ Several times a week
‣ Cabify was unreachable
‣ VPSs where unreachable  
on the internal network
‣ Slow & bad support
‣ Reputation
‣ Solution: Level up!

Nobody ever got fired for choosing IBM
Moved to Bare Metal @ Softlayer
Same guys hosting our Cloudant cluster 👍

Mindset
Control the core, minimise work for everything else

Everything must go
VPS
Provider
web1 web2 web3
worker1realtime
LBLBLB
redis1 redis2elastic
osrm
subscriber

Load Balancer
‣ Multiple PoP (starting operations in several countries)
‣ CDN
‣ Supporting websockets
‣ … and Load Balancing
‣ Low TCO
‣ https://www.incapsula.com

Redis, ElasticSearch
‣ Same datacenter
‣ Completely managed
‣ Clustered / reliable
‣ RedisLabs
‣ Bonus: Memcached
‣ Qbox

OSRM
‣ Same datacenter
‣ Completely managed
‣ Enhanced dataset
‣ Google Maps & Places (with enterprise license)
‣ 2 / 3, good enough

Can do better?
Can we manage less infra?
Softlayer
web1 web2 web3
worker1realtime
Google
subscriber
Incapsula
RedislabsRedislabsRedislabs
qboxqboxQbox
RedislabsRedislabsCloudant

Subscriber
‣ Felt like reinventing the wheel
‣ Looked for battle-tested bus / queue / broker
‣ In the same datacenter
‣ Had previous experience with RabbitMQ
‣ CloudAMQP
Homebrew message bus / queue

Sidecars
‣ Every server could run Cabify
‣ All services installed
‣ Except Realtime (SPOF)
‣ Horizontal scaling
‣ Good server utilisation (bare metal servers are larger)
Make each host self-sufficient

Cut own servers in 50%
Served 5 times more requests
Softlayer
host01 host02 host03
realtime
Google
Incapsula
qboxqboxqbox
CloudAMQPCloudAMQP

Pros
‣ Same-datacenter latencies
‣ Only care about our product
‣ Still cheaper than in-house staff
‣ Scales up by either
‣ Emailing a provider
‣ Deploying new Servers
‣ Good visibility on perf
‣ Still no visibility on Golang perf
‣ Competing services on each
server with different needs
‣ Fast & light http requests
‣ Slow & heavy queue workers
‣ Debugability
Cons

In 2015 we handled

In 2016 we would invade LatAm
(new countries, cities, marketing…)

Bumps on the road
‣ Start seeing intermittent latency spikes on Cloudant
‣ Disable some services, get back on track
‣ Tied to peak hours
‣ We lived through these, but was stressful

Be easy on the database
‣ Removed frequent N+1 queries patterns
‣ Moved some queries to ElasticSearch
‣ Started caching more on Memcache
‣ Grew the cluster
‣ From 200ms to 100ms (average) 👏
(trying to sleep better)

RabbitMQ can’t cope
‣ We saturated the cluster CPU with moderate load
‣ Tied to us using tag-based routing
‣ Messages were delivered much later than expected
‣ Made changes to use simpler routing
‣ Is there anything simpler than RabbitMQ for simple routing? 🤔

Interlude
DynDNS goes down, Cloudant uses them
We lose access to our databases cluster load balancer
Patched /etc/hosts with the actual ips in 30 minutes

The right tool for the job
‣ Clouchdb / Cloudant, not the best database for frequent updates
‣ Looking for alternatives to store fast-changing models
‣ RethinkDB
‣ Fast, easy to use, hosted options in same datacenter
‣ Streaming query updates
Expecting growth in line with previous years

Broke RethinkDB load balancer
Database stats were OK, but the LB couldn’t handle our rate
Slow support, no “enterprise” option
Decided to phase out RethinkDB

Wrote our first «database»
Simple in-memory store, backed by Couchdb
Update indexes on writes. All queries are indexed
Implemented in Golang, consumed from Ruby
Replaces RethinkDB, which replaced CouchDB

Cloudant latency spikes fixed!
Grow the cluster for the second time in the year
Load balancers hardware upgraded, problems gone
Also reduced the number of connections from ruby

Relax the Sidecars
‣ Load on background workers interfering with serving http
‣ Split the servers:
‣ Front (ruby/golang http interface)
‣ Workers (ruby job queues, ruby background)

Remove RabbitMQ
Replace with NSQ
Nice mix of sidecar and discovery

Softlayer
Multiplied own servers by 3
Served 4 times more requests
Google
Incapsula
qboxqboxqbox
CloudAMQPCloudAMQP
host01-09host01-09host01-09host01-09
rt01-02rt01-02
work01-03work01-03work01-03

Pros
‣ Despite the problems, we had
top-notch support from
Cloudant
‣ Easy to scale out
‣ In-process database opened
doors to new features
‣ Still no visibility on Golang perf
Cons

In 2016 we handled

Hired our first
full-time sysadmin!

Taking ownership
Improve our infra

Own load balancers
‣ Still use Incapsula for its PoP
‣ Achieved much better load balancing
‣ 3 new dedicated servers
Better control & traceability

Own redis cluster
‣ Migrating from Redislabs hosted to Redislabs Enterprise
‣ hosted used virtual servers
‣ we rely heavily on redis (and memcached)
‣ 3 new dedicated servers
‣ WIP
Better control & traceability

Ruby → Elixir
‣ Fun to code with
‣ Higher performance
‣ Less memory
‣ Investment, about to release first service to production

Extract from Product
Dedicated teams and resources for specific components
Make the core of Cabify leaner

Thanks!
And sorry for the 60 slides

Mad scalability: Scaling when you are not Google

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Mad scalability: Scaling when you are not Google

Similar to Mad scalability: Scaling when you are not Google (20)

Recently uploaded

Recently uploaded (20)

Mad scalability: Scaling when you are not Google