Konstantin Gredeskoul

CTO, wanelo.com
DevOps without the “Ops”
A fallacy? A dream? A ________?
@kig
@kigster
How Wanelo handles thousands of writes per second with
99.97% uptime without an operations team
@kig
Proprietary and
Wanelo is the digital mall of the future, and a place to find
the most amazing products.
What are you running on?
No really, what’s your stack?
Are you on Mongo?
No!?!?!??
or…
You running ruby? WTF? It’s slow!
You are running Erlang? WTF? It’s in Swedish!
etc.
People often ask…
Backend Stack & Key Vendors
■ MRI Ruby, jRuby, Sinatra, Ruby on Rails
■ PostgreSQL, Solr, redis, twemproxy

memcached, nginx, haproxy, pgbouncer,

elastic search
■ Joyent Cloud, SmartOS, Manta Object Store

ZFS, ARC Cache, superb IO, SMF, Zones, dTrace, humans
■ DNSMadeEasy, MessageBus, Chef, SiftScience
■ LeanPlum, MixPanel, Graphite analytics
■ AWS S3 + Fastly CDN for user / product images
■ Circonus, NewRelic, statsd, Boundary, 

PagerDuty, nagios, SumoLogic 

monitoring, alerting, error reporting
Proprietary and
How much traffic does your app get?
• If you are building an internal web-site in Rails you’d be lucky to get
100 RPMs – your users are only a limited set of employees
• Semi-Popular sites with up to a few hundreds of concurrent users
can expect about 1K-2K RPM
• When you cross 100K RPM mark, you joined the “small big boys” :)
• When you are Pinterest, Facebook or Twitter… You are probably
doing 1-10M RPMs
So what is this talk about?
• Review Operations, DevOps, and the Cloud, and how
the new technologies are changing the landscape
• Learn some key points and patterns that
dramatically reduce stress and pain associated
with running a site, particularly ruby and/or rails
• Discuss if modern startups really need a dedicated
operations team, and if so – at what point?
Let’s start with the basics
DevOps
Proprietary and
What the heck is DevOps?
• “Today, many organizations are confused on what DevOps means
for them..”[2]
1. WikiPedia article on DevOps

2. FORRESTER: “Eliminate DevOps Myths With Situational-Awareness-Based Performance”. John Rakowski, October 10, 2014
• DevOps is a software development method that stresses
communication, collaboration, integration, automation and
measurement cooperation between software developers and
other information-technology (IT) professionals. [1]
Proprietary and
“…Efficient teams are deploying code 30 times
more frequently with 50 percent fewer failures in
2014…” [3]
“…DevOps practices correlate strongly with high
organizational performance” [3]
3. Source: PuppetLabs “State of DevOps Report”, 2014
DevOps however, works…
Traditional “Heavy” Agile
• Traditional Ops responsibilities were often in conflict
with product development: stability versus change.
Product Dev QA OperationsProduct Dev QA Operations
Traditional Operations
• Uptime, stability and reliability
• On-call, fixing site at night
• Backups and disaster recovery
• Security, patching, OpenSSL :)
• Hardware
• Networking
• Colocation / DC
“The Cloud” changed things
• Uptime, stability and reliability
• On-call, fixing site at night
• Backups and disaster recovery
• Security, patching, OpenSSL :)
• Hardware
• Networking
• Colocation / DC
So the Cloud is a big part of
what makes DevOps possible
Let’s talk about a simpler and more
friendly way to build and deploy
software.
Early Company Goals (based on Wanelo)
• Maximize iteration speed
• Practice “aggro-agile”™
• Scale up as we go, keep the app fast
• Break things, learn, move on
• Enable, empower and inspire our team
• Remain in control of our infrastructure
And while moving really fast…
We just never hired Ops
But we did hire several brilliant engineers who
actually enjoyed infrastructure / platform work.
Except they approach it like … code.
Not having Ops meant
• We had to deploy our app to the cloud, and learn
how to provision the nodes we needed, as well as:
• How to provision load balancers and app servers
• How to configure new Solr masters and replicas
• How to install and tune PostgreSQL databases
• memcaches, redis shards, twemproxy, haproxy
Fast forward to today
• 100% cloud hosted (Joyent Cloud)
• 100% automated (Chef)
• 10,000% traffic growth in 6 months and survived
• 99.97% uptime (without trying very hard)
• on call engineers get 1-2 pages per week
• 80% of engineers are on call rotation, including
iOS & Android developers
Still no “Ops” team, but plenty of Ops work
How?
1. Automation and Deployment
• Infrastructure is a first class citizen
• Pairs deliver user stories which include automation
• Did I mention we pair program? It rocks!
• We run Chef continuously in production
• I want to trust my tools, and if they break, fix them
• Partition staging and production environments
Incremental Deployment
• Roll code out everywhere, restart 2% of servers
• Watch errors, latency, other anomalies
• When satisfied continue rolling all servers
• Ensure old and new code can co-exist
• Ensure no “drop/rename” migrations happen on live tables
• Ensure no exclusive locking migrations (eg. create index
concurrently)
2. Fault tolerant infrastructure
• Ensure aggressive client timeouts
• Achieving fault tolerance today is much cheaper
than ever before! It’s a crime not to do it :)
• Put haproxy in front of everything, literally
• Stateless services only
• Put makara, twemproxy, Dalli in front of
database, redis and memcached
Let’s look at a couple of recipes for resilience
Resilience keeps you sleeping at night
Where is everything? HAProxy + Chef Search + Stateless
App talks to

http://127.0.0.1:8000

http://127.0.0.1:8001
App HAProxy
Backend 1
Backend 2
Solr
Web Service
Backend 2
ElasticSearch
Virtual Zone / Server
This pattern allows us to have one place that
knows about everything else, in Chef
What the hell Makara?
• Makara is a simple database routing tool for
ActiveRecord that has been in production on
Wanelo and TaskRabbit for years
• https://github.com/wanelo/makara (PostgreSQL)
• https://github.com/taskrabbit/makara (MySQL)
Proprietary and
• Was the simplest library to
understand, and port to
• Worked in the multi-threaded
environment of Sidekiq
Background Workers
• automatically retries if
replica goes down
• load balances with weights
• Was running in production
Replicate everything that replicates
App
HAProxy
Backend 1
Backend 2
Solr Replica
Backend 2
Solr Replica
Solr Replica
Solr Master
Web / API Requests
Background
WorkerQueue
reads
writes
App
HAProxy
Backend 1
Backend 2
Solr Replica
Backend 2
Solr Replica
Solr Replica
Solr Master
Web / API Requests
Background
WorkerQueue
Degraded State, but still up!
Many replicas can be down
reads
writes
Replicas are great because they are easy to add
and often ok to ignore when they die/reboot/etc.
Don’t buy an expensive load balancer
Load Balancer
haproxy
nginx
Load Balancer
haproxy
nginx
200.200.234.145 200.200.234.146
example.com
App Server App Server App Server App Server App Server App Server
You can build a decent one with DNS
App Server App Server App Server
Load Balancer
haproxy
nginx
App Server App Server App Server
Load Balancer
haproxy
nginx
DNS Provider
pingping
200.200.234.145 200.200.234.146
DNS auto-failover is
offered with some
enterprise DNS services,
e.g. from DNSMadeEasy
When LB goes down, it is removed
from the DNS pool
App Server App Server App Server
Load Balancer
haproxy
nginx
App Server App Server App Server
Load Balancer
haproxy
nginx
DNS Provider
pingping
200.200.234.145 200.200.234.146
It works pretty well
It works pretty well
Load Balancer
haproxy
nginx
Dead Load Balancer
200.200.234.145 200.200.234.146
DNS Provider
ping
App Server App Server App Server App Server App Server App Server
example.com
This works best with a
short TTL
Configure LBs in pairs, as
the others failover, to
account for network
partitioning
When LB goes down, it is removed
from the DNS pool
This pattern allows us to tolerate reboots and
maintenance with minimal effect on our users
Failover to the overflow pattern
Two queues: large primary, small secondary
The primary distributes jobs to a large set of specialized
workers, assigned to specific queues
App
HAProxy
Primary Backend 1
Failover Backend 2
Primary
Background Workers
Redis Primary
Queue
Redis
Failover
"Overflow" Workers
The failover queue has only a small number of overflow
workers, but they will accept any work
During spikes in traffic, this pattern allows our
application to continue enqueuing jobs when the
primary is overwhelmed
This is useful in situations when you can’t easily
round robin between multiple shards. 



Example: Sidekiq with a “Unique Job” extension.
• Some tools allow alerting on the first derivative of an
observed metric.
• This is what we want: rapid drop (or increase) in a
key metric to generate an alert.
3. Alert only on what’s important
• Nagios is great for visibility
• Not great for knowing when to drop everything
because the site is on fire
• We never page on “host down”



Because, who cares?

The host is likely redundant, and will be back. 



…Probably.
Alerting examples
• We only page for things like “sudden drop in product
saves per second”, or a spike in error rate, etc.
• Monitoring / alerting tool Circonus supports this
4. Obsessive monitoring
• Modern tools offer unprecedented visibility
• Real time application monitoring
• Real time business stats monitoring
• Real time network monitoring
• Dashboards, TV Monitor, alerts
• Real time, real time, real time.
Systems Status: Dashboard Monitoring & Graphing with Circonus, NewRelic, statsd, nagios
5. Cloud vendor is your partner
• We get phenomenal customer support from Joyent
• Our Cloud Partner, in a way, is our Ops
• Joyent is innovative in that they develop and run
their own cloud stack: from the OS layer (SmartOS)
to the data center management software
• They offer a unique option to take our “cloud” in-
house when that time comes
6. DevOps, really, is just code
• Hire folks who write code, so that they don’t have to
repeat the same task twice
• Everyone will be happier that way.
So here is how to reduce stress!
1. Insist on 100% automation
2. Deploy fault tolerant patterns wherever possible
3. Page only on what’s important to the business
4. Monitor everything else obsessively
5. Choose a cloud provider that can be your partner
6. Infrastructure work is software engineering
Thanks!
slideshare.net/kigster

github.com/kigster

github.com/wanelo
github.com/wanelo-chef


wanelo technical blog
building.wanelo.com
Proprietary and
@kig
@kig
@kigster

Dev Ops without the Ops

  • 1.
    Konstantin Gredeskoul
 CTO, wanelo.com DevOpswithout the “Ops” A fallacy? A dream? A ________? @kig @kigster How Wanelo handles thousands of writes per second with 99.97% uptime without an operations team @kig
  • 2.
    Proprietary and Wanelo isthe digital mall of the future, and a place to find the most amazing products.
  • 4.
    What are yourunning on? No really, what’s your stack? Are you on Mongo? No!?!?!?? or… You running ruby? WTF? It’s slow! You are running Erlang? WTF? It’s in Swedish! etc. People often ask…
  • 5.
    Backend Stack &Key Vendors ■ MRI Ruby, jRuby, Sinatra, Ruby on Rails ■ PostgreSQL, Solr, redis, twemproxy
 memcached, nginx, haproxy, pgbouncer,
 elastic search ■ Joyent Cloud, SmartOS, Manta Object Store
 ZFS, ARC Cache, superb IO, SMF, Zones, dTrace, humans ■ DNSMadeEasy, MessageBus, Chef, SiftScience ■ LeanPlum, MixPanel, Graphite analytics ■ AWS S3 + Fastly CDN for user / product images ■ Circonus, NewRelic, statsd, Boundary, 
 PagerDuty, nagios, SumoLogic 
 monitoring, alerting, error reporting
  • 6.
    Proprietary and How muchtraffic does your app get? • If you are building an internal web-site in Rails you’d be lucky to get 100 RPMs – your users are only a limited set of employees • Semi-Popular sites with up to a few hundreds of concurrent users can expect about 1K-2K RPM • When you cross 100K RPM mark, you joined the “small big boys” :) • When you are Pinterest, Facebook or Twitter… You are probably doing 1-10M RPMs
  • 7.
    So what isthis talk about? • Review Operations, DevOps, and the Cloud, and how the new technologies are changing the landscape • Learn some key points and patterns that dramatically reduce stress and pain associated with running a site, particularly ruby and/or rails • Discuss if modern startups really need a dedicated operations team, and if so – at what point?
  • 8.
    Let’s start withthe basics DevOps
  • 9.
    Proprietary and What theheck is DevOps? • “Today, many organizations are confused on what DevOps means for them..”[2] 1. WikiPedia article on DevOps
 2. FORRESTER: “Eliminate DevOps Myths With Situational-Awareness-Based Performance”. John Rakowski, October 10, 2014 • DevOps is a software development method that stresses communication, collaboration, integration, automation and measurement cooperation between software developers and other information-technology (IT) professionals. [1]
  • 10.
    Proprietary and “…Efficient teamsare deploying code 30 times more frequently with 50 percent fewer failures in 2014…” [3] “…DevOps practices correlate strongly with high organizational performance” [3] 3. Source: PuppetLabs “State of DevOps Report”, 2014 DevOps however, works…
  • 11.
    Traditional “Heavy” Agile •Traditional Ops responsibilities were often in conflict with product development: stability versus change. Product Dev QA OperationsProduct Dev QA Operations
  • 12.
    Traditional Operations • Uptime,stability and reliability • On-call, fixing site at night • Backups and disaster recovery • Security, patching, OpenSSL :) • Hardware • Networking • Colocation / DC
  • 13.
    “The Cloud” changedthings • Uptime, stability and reliability • On-call, fixing site at night • Backups and disaster recovery • Security, patching, OpenSSL :) • Hardware • Networking • Colocation / DC
  • 14.
    So the Cloudis a big part of what makes DevOps possible
  • 15.
    Let’s talk abouta simpler and more friendly way to build and deploy software.
  • 16.
    Early Company Goals(based on Wanelo) • Maximize iteration speed • Practice “aggro-agile”™ • Scale up as we go, keep the app fast • Break things, learn, move on • Enable, empower and inspire our team • Remain in control of our infrastructure
  • 17.
    And while movingreally fast… We just never hired Ops But we did hire several brilliant engineers who actually enjoyed infrastructure / platform work. Except they approach it like … code.
  • 18.
    Not having Opsmeant • We had to deploy our app to the cloud, and learn how to provision the nodes we needed, as well as: • How to provision load balancers and app servers • How to configure new Solr masters and replicas • How to install and tune PostgreSQL databases • memcaches, redis shards, twemproxy, haproxy
  • 19.
    Fast forward totoday • 100% cloud hosted (Joyent Cloud) • 100% automated (Chef) • 10,000% traffic growth in 6 months and survived • 99.97% uptime (without trying very hard) • on call engineers get 1-2 pages per week • 80% of engineers are on call rotation, including iOS & Android developers
  • 20.
    Still no “Ops”team, but plenty of Ops work
  • 21.
  • 22.
    1. Automation andDeployment • Infrastructure is a first class citizen • Pairs deliver user stories which include automation • Did I mention we pair program? It rocks! • We run Chef continuously in production • I want to trust my tools, and if they break, fix them • Partition staging and production environments
  • 23.
    Incremental Deployment • Rollcode out everywhere, restart 2% of servers • Watch errors, latency, other anomalies • When satisfied continue rolling all servers • Ensure old and new code can co-exist • Ensure no “drop/rename” migrations happen on live tables • Ensure no exclusive locking migrations (eg. create index concurrently)
  • 24.
    2. Fault tolerantinfrastructure • Ensure aggressive client timeouts • Achieving fault tolerance today is much cheaper than ever before! It’s a crime not to do it :) • Put haproxy in front of everything, literally • Stateless services only • Put makara, twemproxy, Dalli in front of database, redis and memcached
  • 25.
    Let’s look ata couple of recipes for resilience Resilience keeps you sleeping at night
  • 26.
    Where is everything?HAProxy + Chef Search + Stateless App talks to
 http://127.0.0.1:8000
 http://127.0.0.1:8001 App HAProxy Backend 1 Backend 2 Solr Web Service Backend 2 ElasticSearch Virtual Zone / Server
  • 27.
    This pattern allowsus to have one place that knows about everything else, in Chef
  • 28.
    What the hellMakara? • Makara is a simple database routing tool for ActiveRecord that has been in production on Wanelo and TaskRabbit for years • https://github.com/wanelo/makara (PostgreSQL) • https://github.com/taskrabbit/makara (MySQL)
  • 29.
    Proprietary and • Wasthe simplest library to understand, and port to • Worked in the multi-threaded environment of Sidekiq Background Workers • automatically retries if replica goes down • load balances with weights • Was running in production
  • 30.
    Replicate everything thatreplicates App HAProxy Backend 1 Backend 2 Solr Replica Backend 2 Solr Replica Solr Replica Solr Master Web / API Requests Background WorkerQueue reads writes
  • 31.
    App HAProxy Backend 1 Backend 2 SolrReplica Backend 2 Solr Replica Solr Replica Solr Master Web / API Requests Background WorkerQueue Degraded State, but still up! Many replicas can be down reads writes
  • 32.
    Replicas are greatbecause they are easy to add and often ok to ignore when they die/reboot/etc.
  • 33.
    Don’t buy anexpensive load balancer Load Balancer haproxy nginx Load Balancer haproxy nginx 200.200.234.145 200.200.234.146 example.com App Server App Server App Server App Server App Server App Server
  • 34.
    You can builda decent one with DNS App Server App Server App Server Load Balancer haproxy nginx App Server App Server App Server Load Balancer haproxy nginx DNS Provider pingping 200.200.234.145 200.200.234.146 DNS auto-failover is offered with some enterprise DNS services, e.g. from DNSMadeEasy
  • 35.
    When LB goesdown, it is removed from the DNS pool App Server App Server App Server Load Balancer haproxy nginx App Server App Server App Server Load Balancer haproxy nginx DNS Provider pingping 200.200.234.145 200.200.234.146 It works pretty well
  • 36.
    It works prettywell Load Balancer haproxy nginx Dead Load Balancer 200.200.234.145 200.200.234.146 DNS Provider ping App Server App Server App Server App Server App Server App Server example.com This works best with a short TTL Configure LBs in pairs, as the others failover, to account for network partitioning When LB goes down, it is removed from the DNS pool
  • 37.
    This pattern allowsus to tolerate reboots and maintenance with minimal effect on our users
  • 38.
    Failover to theoverflow pattern Two queues: large primary, small secondary The primary distributes jobs to a large set of specialized workers, assigned to specific queues App HAProxy Primary Backend 1 Failover Backend 2 Primary Background Workers Redis Primary Queue Redis Failover "Overflow" Workers The failover queue has only a small number of overflow workers, but they will accept any work
  • 39.
    During spikes intraffic, this pattern allows our application to continue enqueuing jobs when the primary is overwhelmed This is useful in situations when you can’t easily round robin between multiple shards. 
 
 Example: Sidekiq with a “Unique Job” extension.
  • 40.
    • Some toolsallow alerting on the first derivative of an observed metric. • This is what we want: rapid drop (or increase) in a key metric to generate an alert. 3. Alert only on what’s important • Nagios is great for visibility • Not great for knowing when to drop everything because the site is on fire
  • 41.
    • We neverpage on “host down”
 
 Because, who cares?
 The host is likely redundant, and will be back. 
 
 …Probably. Alerting examples • We only page for things like “sudden drop in product saves per second”, or a spike in error rate, etc. • Monitoring / alerting tool Circonus supports this
  • 42.
    4. Obsessive monitoring •Modern tools offer unprecedented visibility • Real time application monitoring • Real time business stats monitoring • Real time network monitoring • Dashboards, TV Monitor, alerts • Real time, real time, real time.
  • 43.
    Systems Status: DashboardMonitoring & Graphing with Circonus, NewRelic, statsd, nagios
  • 44.
    5. Cloud vendoris your partner • We get phenomenal customer support from Joyent • Our Cloud Partner, in a way, is our Ops • Joyent is innovative in that they develop and run their own cloud stack: from the OS layer (SmartOS) to the data center management software • They offer a unique option to take our “cloud” in- house when that time comes
  • 45.
    6. DevOps, really,is just code • Hire folks who write code, so that they don’t have to repeat the same task twice • Everyone will be happier that way.
  • 46.
    So here ishow to reduce stress! 1. Insist on 100% automation 2. Deploy fault tolerant patterns wherever possible 3. Page only on what’s important to the business 4. Monitor everything else obsessively 5. Choose a cloud provider that can be your partner 6. Infrastructure work is software engineering
  • 47.