4. Flowdock, team collaboration app with software developer as primary target audience.
Right-hand side: chat, left-hand side: inbox or activity stream for your team.
If you’ve read a Node.JS tutorial you probably know needed the architecture.
5. Facts
• Single page JavaScript front-end
• WebSocket based communication layer
• Three replicated databases
• Running on dedicated servers in Germany
• 99.98% availability
WebSockets == no third-party load-balancers/PaaS for us
99.99% according to CEO, but I’m being conservative
6. Goal: beat your hosting provider in
uptime
Have a good uptime on unreliable hardware.
7. We don’t want to wake up at night to fix our app like this guy in this picture. The founders had previously a hosting company.
8. This is not an exact science, every
app is different.
10. Flowdock 2010
Apache
Messages Rails
MongoDB PostgreSQL
Simple stack, but the messaging part quickly became hairy. It had HTTP streaming, Twitter integration and e-mail server. Lot of
brittle state.
11. Divide and Conquer
Nice strategy for building your SOA, sorting lists and taking over the world.
12. GeoDNS
Stunnel
HAproxy
HTTP
WebSocket
RSS IRC Streaming
API
Redis
API
Rails
Message Backend
MongoDB PostgreSQL
These are all different processes.
More components, but this has enabled us to easily add new features to components
16. Chef
Infrastructure as (Ruby) Code
Chef lets you to automate server configuration with Ruby code.
17. Chef at Flowdock
• Firewall configuration
• Distribute SSH host keys
• User setup
• Join mesh-based VPN
• And app/server specific stuff
Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.
SSH host keys prevent MITM
We have a mesh-based VPN, which is automatically configured based on Chef data
20. cookbooks/flowdock/oulu.rb
include_recipe "flowdock:users"
package "ruby"
%w{port listen_to flowdock_domain}.each do |e|
template "#{node[:flowdock][:oulu][:envdir]}/#{e.upcase}" do
source "envdir_file.erb"
variables :value => node[:flowdock][:oulu][e]
owner "oulu"
mode "0600"
end
end
runit_service "oulu" do
options :use_config => true
end
Recipe for our IRC server
21. roles/rails.rb
name "rails"
description "Rails Box"
run_list(
"recipe[nginx]",
"recipe[passenger]"
)
override_attributes(
passenger:
{ version: "3.0.7" }
)
Recipe in Ruby DSL
Each node can be assigned any number of roles
Override attributes can be used to override recipe attributes
24. Managing Chef cluster
$ knife ssh 'role:qa' 'echo "lol"'
imaginary-server lol
qa-db1 lol
qa-db2 lol
Most useful command: trigger chef run on servers
25. Testing Chef Recipes
• Use Chef environments to
isolate changes
• Run chef-client on throw-away
VMs
• cucumber-chef
sous-chef could be used to automate VM setup
Our experience with cucumber-chef and sous-chef is limited
You need also to monitor stuff e.g. runs have finished on nodes, backups are really taken
26. Automatic Failover
Avoiding Single Point of Failures
MongoDB works flawlessly as failover is built-in, but how to handle Redis?
27. HAproxy
TCP/HTTP Load Balancer with Failover handling
HAproxy provides easy failover for Rails instances
28. MongoDB has automatic failover
built-in
MongoDB might have many problems, but failover isn’t one of them. Drivers are always connected to master.
29. Redis and Postgres have
replication, but failover is manual
Not only do you need to promote master automatically, but also change application configuration.
32. require 'zk'
$queue = Queue.new
zk = ZK.new
zk.register('/hello_world') do |event|
# need to reset watch
data = zk.get('/hello_world', watch: true).first
# do stuff
$queue.push(:event)
end
zk.create('/hello_world', 'sup?')
$queue.pop # Handle local synchronization
zk.set('/hello_world', 'omg, update')
Using the high-level zk gem. Block is run every time value is updated.
ZK gem has locks and other stuff implemented.
36. Redis Failover
W
App
at te Node Manager
c h p da
U Node Manager
App ZooKeeper
Mon itor
App Redis Node
Redis Node
Our apps might not use redis_failover or read ZK directly. Script restarts the app when ZK changes.
HAproxy or DNS based solutions also possible, but this gives us more control over the app restart.
37. Postgres failover with pgpool-II and
ZooKeeper
pgpool manages pg cluster, queries can be distributed to slaves
I’m afraid of pgpool, configuration and monitoring scripts are really scary
38. Postgres Failover
PGpool monitor
ZooKeeper
App
pgpool
PG
PG
zookeeper/pgpool monitoring is used to provide redundancy to pgpool
If pgpool fails, app needs to reconnect to new server
39. Zoos are kept
Similar scheme can be used for other master-slave based replications, e.g. handling twitter integration failover.
REMEMBER TO TEST
40. Test your failover
You might only need some failover few times a year.
Not sure if everything of our stuff is top-notch, but there have been one-time use cases for the complicated stuff.
41. Chef vs ZooKeeper
Chef ZooKeeper
Dynamic configuration
Configuration files
variables
Server boostrap Failover handling
Chef write long configuration files, ZooKeeper only contains few variables
Chef boostraps server and keeps them up-to-date, ZooKeeper is used to elect master nodes in master-slave scenarios.
42. Mesh-based VPN between boxes
Encrypted MongoDB traffic between masters and slaves. Saved the day few times when there has been routing issues between
data centers.
43. SSL endpoints in AWS
Routing issues between our German ISP and Comcast. Move SSL front ends closer to client to fix this and reduce latency. Front-
page loads 150ms faster.
44. Winning
We don’t need to worry about waking up at nights. The whole team could go sailing and be without internet access at the same
time.
46. WebSockets are cool, but make
your life harder
Heroku, Amazon Elastic Load Balancer, CloudFlare and Google App engine don’t work with WS. If you only need to stream stuff,
using HTTP EventStreaming is better choice.
47. Let it crash
Make your app crash, at least you are there to fix things.
Flowdock, team collaboration app with software developer as primary target audience.\nRight-hand side: chat, left-hand side: inbox or activity stream for your team.\nIf you’ve read a Node.JS tutorial you probably know needed the architecture.\n
WebSockets == no third-party load-balancers/PaaS for us\n99.99% according to CEO, but I’m being conservative\n
WebSockets == no third-party load-balancers/PaaS for us\n99.99% according to CEO, but I’m being conservative\n
WebSockets == no third-party load-balancers/PaaS for us\n99.99% according to CEO, but I’m being conservative\n
WebSockets == no third-party load-balancers/PaaS for us\n99.99% according to CEO, but I’m being conservative\n
WebSockets == no third-party load-balancers/PaaS for us\n99.99% according to CEO, but I’m being conservative\n
Have a good uptime on unreliable hardware.\n
We don’t want to wake up at night to fix our app like this guy in this picture. The founders had previously a hosting company.\n
\n
We haven’t been always doing very well\n
Simple stack, but the messaging part quickly became hairy. It had HTTP streaming, Twitter integration and e-mail server. Lot of brittle state.\n
Works well also for sorting lists and taking over the world\n
Works well also for sorting lists and taking over the world\n
Works well also for sorting lists and taking over the world\n
Nice strategy for building your SOA, sorting lists and taking over the world.\n
These are all different processes. \nMore components, but this has enabled us to easily add new features to components\n
\n
\n
\n
\n
Chef lets you to automate server configuration with Ruby code.\n
\n
\n
\n
\n
\n
Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.\nSSH host keys prevent MITM\nWe have a mesh-based VPN, which is automatically configured based on Chef data\n
Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.\nSSH host keys prevent MITM\nWe have a mesh-based VPN, which is automatically configured based on Chef data\n
Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.\nSSH host keys prevent MITM\nWe have a mesh-based VPN, which is automatically configured based on Chef data\n
Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.\nSSH host keys prevent MITM\nWe have a mesh-based VPN, which is automatically configured based on Chef data\n
Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.\nSSH host keys prevent MITM\nWe have a mesh-based VPN, which is automatically configured based on Chef data\n
\n
\n
Centralized chef server which nodes communicate with and get updates from.\n
Recipe for our IRC server\n
Recipe in Ruby DSL\nEach node can be assigned any number of roles\nOverride attributes can be used to override recipe attributes\n
\n
\n
Most useful command: trigger chef run on servers\n
sous-chef could be used to automate VM setup\nOur experience with cucumber-chef and sous-chef is limited\nYou need also to monitor stuff e.g. runs have finished on nodes, backups are really taken\n\n
sous-chef could be used to automate VM setup\nOur experience with cucumber-chef and sous-chef is limited\nYou need also to monitor stuff e.g. runs have finished on nodes, backups are really taken\n\n
sous-chef could be used to automate VM setup\nOur experience with cucumber-chef and sous-chef is limited\nYou need also to monitor stuff e.g. runs have finished on nodes, backups are really taken\n\n
sous-chef could be used to automate VM setup\nOur experience with cucumber-chef and sous-chef is limited\nYou need also to monitor stuff e.g. runs have finished on nodes, backups are really taken\n\n
MongoDB works flawlessly as failover is built-in, but how to handle Redis?\n
HAproxy provides easy failover for Rails instances\n
IP failover has less latency than DNS-based solution, but we got the DNS failover for free\n
IP failover has less latency than DNS-based solution, but we got the DNS failover for free\n
IP failover has less latency than DNS-based solution, but we got the DNS failover for free\n
IP failover has less latency than DNS-based solution, but we got the DNS failover for free\n
MongoDB might have many problems, but failover isn’t one of them. Drivers are always connected to master.\n
Not only do you need to promote master automatically, but also change application configuration.\n
\n
Each operation has to be agreed by majority of servers. Eventual consistency.\n
\n
\n
\n
\n
Using the high-level zk gem. Block is run every time value is updated.\nZK gem has locks and other stuff implemented.\n
\n
\n
\n
Our apps might not use redis_failover or read ZK directly. Script restarts the app when ZK changes.\nHAproxy or DNS based solutions also possible, but this gives us more control over the app restart.\n\n
pgpool manages pg cluster, queries can be distributed to slaves\nI’m afraid of pgpool, configuration and monitoring scripts are really scary\n
zookeeper/pgpool monitoring is used to provide redundancy to pgpool\nIf pgpool fails, app needs to reconnect to new server\n
Similar scheme can be used for other master-slave based replications, e.g. handling twitter integration failover.\n\nREMEMBER TO TEST\n
You might only need some failover few times a year.\nNot sure if everything of our stuff is top-notch, but there have been one-time use cases for the complicated stuff.\n
Chef write long configuration files, ZooKeeper only contains few variablesChef boostraps server and keeps them up-to-date, ZooKeeper is used to elect master nodes in master-slave scenarios.\n
Encrypted MongoDB traffic between masters and slaves. Saved the day few times when there has been routing issues between data centers.\n
Routing issues between our German ISP and Comcast. Move SSL front ends closer to client to fix this and reduce latency. Front-page loads 150ms faster.\n
We don’t need to worry about waking up at nights. The whole team could go sailing and be without internet access at the same time.\n
What have we learned?\n
Heroku, Amazon Elastic Load Balancer, CloudFlare and Google App engine don’t work with WS. If you only need to stream stuff, using HTTP EventStreaming is better choice.\n
Decoupling had instant effect on our uptime\n
Make your app crash, at least you are there to fix things.\n