A Tale of a Server Architecture's Evolution

A Tale of a Server Architecture

Ville Lautanala
@lautis

Flowdock, team collaboration app with software developer as primary target audience.
Right-hand side: chat, left-hand side: inbox or activity stream for your team.
If you’ve read a Node.JS tutorial you probably know needed the architecture.

Facts

• Single page JavaScript front-end

• WebSocket based communication layer

• Three replicated databases

• Running on dedicated servers in Germany

• 99.98% availability

WebSockets == no third-party load-balancers/PaaS for us
99.99% according to CEO, but I’m being conservative

Goal: beat your hosting provider in
uptime

Have a good uptime on unreliable hardware.

We don’t want to wake up at night to ﬁx our app like this guy in this picture. The founders had previously a hosting company.

This is not an exact science, every
app is different.

Architecture Archaeology

We haven’t been always doing very well

Flowdock 2010
Apache

Messages Rails

MongoDB PostgreSQL

Simple stack, but the messaging part quickly became hairy. It had HTTP streaming, Twitter integration and e-mail server. Lot of
brittle state.

Divide and Conquer

Nice strategy for building your SOA, sorting lists and taking over the world.

GeoDNS

Stunnel

HAproxy

HTTP
WebSocket
RSS IRC Streaming
API

Redis
API
Rails

Message Backend

MongoDB PostgreSQL

These are all different processes.
More components, but this has enabled us to easily add new features to components

So, you need to setup boxes...

Chef
Infrastructure as (Ruby) Code

Chef lets you to automate server conﬁguration with Ruby code.

Chef at Flowdock

• Firewall configuration

• Distribute SSH host keys

• User setup

• Join mesh-based VPN

• And app/server specific stuff

Firewall set up is based on IP-whitelist. Only nodes in chef can access private services.
SSH host keys prevent MITM
We have a mesh-based VPN, which is automatically configured based on Chef data

•Cookbooks

• Recipes

• Roles

Chef server

Centralized chef server which nodes communicate with and get updates from.

cookbooks/ﬂowdock/oulu.rb
include_recipe "flowdock:users"
package "ruby"

%w{port listen_to flowdock_domain}.each do |e|
template "#{node[:flowdock][:oulu][:envdir]}/#{e.upcase}" do
source "envdir_file.erb"
variables :value => node[:flowdock][:oulu][e]
owner "oulu"
mode "0600"
end
end

runit_service "oulu" do
options :use_config => true
end

Recipe for our IRC server

roles/rails.rb
name "rails"
description "Rails Box"
run_list(
"recipe[nginx]",
"recipe[passenger]"
)
override_attributes(
passenger:
{ version: "3.0.7" }
)

Recipe in Ruby DSL
Each node can be assigned any number of roles
Override attributes can be used to override recipe attributes

Managing Chef cluster

$ knife cookbook upload -a -o cookbooks


$ knife search node role:flowdock-app-server
Node Name: imaginary-server
Environment: qa
FQDN: imaginary-server.flowdock.dmz
IP: 10.0.0.1
Run List: role[qa], role[flowdock-app-server], role[web-server]
Roles: qa, flowdock-app-server, web-server
Recipes: ubuntu, firewall, chef, flowdock, unicorn, haproxy
Platform: ubuntu 12.04
Tags:


$ knife ssh 'role:qa' 'echo "lol"'
imaginary-server lol
qa-db1 lol
qa-db2 lol

Most useful command: trigger chef run on servers

Testing Chef Recipes

• Use Chef environments to
isolate changes

• Run chef-client on throw-away
VMs

• cucumber-chef

sous-chef could be used to automate VM setup
Our experience with cucumber-chef and sous-chef is limited
You need also to monitor stuff e.g. runs have ﬁnished on nodes, backups are really taken

Automatic Failover
Avoiding Single Point of Failures

MongoDB works ﬂawlessly as failover is built-in, but how to handle Redis?

HAproxy
TCP/HTTP Load Balancer with Failover handling

HAproxy provides easy failover for Rails instances

MongoDB has automatic failover
built-in

MongoDB might have many problems, but failover isn’t one of them. Drivers are always connected to master.

Redis and Postgres have
replication, but failover is manual

Not only do you need to promote master automatically, but also change application conﬁguration.

Distributed coordination

Each operation has to be agreed by majority of servers. Eventual consistency.

require 'zk'

$queue = Queue.new
zk = ZK.new
zk.register('/hello_world') do |event|
# need to reset watch
data = zk.get('/hello_world', watch: true).first
# do stuff
$queue.push(:event)
end

zk.create('/hello_world', 'sup?')
$queue.pop # Handle local synchronization
zk.set('/hello_world', 'omg, update')

Using the high-level zk gem. Block is run every time value is updated.
ZK gem has locks and other stuff implemented.

zk = ZK.new

zk.with_lock('/lock', :wait => 5.0) do |lock|
# do stuff
# others have to wait
end

Redis master failover using
ZooKeeper

gem install redis_failover

but in 3 programming languages

Redis Failover
W
App
at te Node Manager
c h p da
U Node Manager

App ZooKeeper

Mon itor
App Redis Node
Redis Node

Our apps might not use redis_failover or read ZK directly. Script restarts the app when ZK changes.
HAproxy or DNS based solutions also possible, but this gives us more control over the app restart.

Postgres failover with pgpool-II and
ZooKeeper

pgpool manages pg cluster, queries can be distributed to slaves
I’m afraid of pgpool, conﬁguration and monitoring scripts are really scary

Postgres Failover
PGpool monitor

ZooKeeper

App

pgpool

PG
PG

zookeeper/pgpool monitoring is used to provide redundancy to pgpool
If pgpool fails, app needs to reconnect to new server

Zoos are kept
Similar scheme can be used for other master-slave based replications, e.g. handling twitter integration failover.

REMEMBER TO TEST

Test your failover

You might only need some failover few times a year.

Not sure if everything of our stuff is top-notch, but there have been one-time use cases for the complicated stuff.

Chef vs ZooKeeper

Chef ZooKeeper

Dynamic configuration
Configuration files
variables

Server boostrap Failover handling

Chef write long configuration files, ZooKeeper only contains few variables
Chef boostraps server and keeps them up-to-date, ZooKeeper is used to elect master nodes in master-slave scenarios.

Mesh-based VPN between boxes

Encrypted MongoDB traffic between masters and slaves. Saved the day few times when there has been routing issues between
data centers.

SSL endpoints in AWS

Routing issues between our German ISP and Comcast. Move SSL front ends closer to client to ﬁx this and reduce latency. Front-
page loads 150ms faster.

Winning
We don’t need to worry about waking up at nights. The whole team could go sailing and be without internet access at the same
time.

Lessons learned

What have we learned?

WebSockets are cool, but make
your life harder

Heroku, Amazon Elastic Load Balancer, CloudFlare and Google App engine don’t work with WS. If you only need to stream stuff,
using HTTP EventStreaming is better choice.

Let it crash

Make your app crash, at least you are there to ﬁx things.

A Tale of a Server Architecture's Evolution

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (11)

Similar to A Tale of a Server Architecture's Evolution

Similar to A Tale of a Server Architecture's Evolution (20)

Recently uploaded

Recently uploaded (20)

A Tale of a Server Architecture's Evolution

Editor's Notes