The Automation Factory

The
Automation
Factory

nathan@milford.io
blog.milford.io
twitter.com/NathanMilford
github.com/nmilford

This is NOT strictly a
Cassandra talk.

♫ There's no earthly way of knowing ♫

This is an infrastructure talk.

♫ How your infrastructure's growing. ♫

Startups move fast.

Priorities change.

Infrastructure needs to be able to
pivot, too.

♫ Who knows where business is going.
Or which way the data's flowing. ♫

When you scale up,

so do your problems.

♫ Drives imploding?
IO plateauing? ♫

Not to mention unexpected
disasters.

We lost a whole data center during
Hurricane Sandy.
♫ Is a hurricane a'blowing? ♫

How do you keep up with growth?

♫ There's no earthly way of knowing ♫

How do you deal with failure?

♫ Are the status LEDs a 'glowing?
Is the server reaper mowing? ♫

How do you deal with too much
success?

♫ Yes! The danger must be growing
For the data keeps on flowing. ♫

What do you do?

♫ And they're certainly not showing
any signs that they are slowing! ♫

Hold your breath.

Make a wish.

Automate!

♫ Come with me
And you'll be
In a world of systems automation ♫

♫ Take a look
And you’ll see
Into my Chef lucubrations

So login, Install, begin
With the Chef cookbook of my creation
What you'll see might require
Explanation ♫

♫ If you want to view paradise
Simply go to Github and view it
Pull requests welcome, go to it
Want to change the code
A merge will do it ♫

https://github.com/linkedin/glu/
https://github.com/octo/collectd/
https://github.com/opscode/chef/
https://github.com/saltstack/salt/
https://github.com/outbrain/onering/
https://github.com/nmilford/chef-cassandra/
https://github.com/rabbitmq/rabbitmq-server/

def discover_cassandra_schema
require 'cassandra-cql'
schema = {}
server = "#{node[:ipaddress]}:#{node[:Cassandra][:rpc_port]}"

db = CassandraCQL::Database.new("#{server}") rescue nil
if db
db.keyspaces.collect{|s| schema[s.name] =
s.column_families.collect{|cfname, cfobj| cfname } }
schema.delete("system")
schema.delete("OpsCenter")
return schema ♫ There is no life I know
end To compare with writing automation
return nil Write it once
end You’ll be free♫

*clickity*

*clickity*

*clickity*

♫ To play Diablo 3 ♫

♫ If you want to scale past a petabyte
Just install Chef, Salt and Graphite
If you want to sleep the whole night
Automate the world
It will be all right♫

♫ There is no life I know
To compare with writing automation
Write it once
You’ll be free ♫

♫ If you truly wish to be.♫

The
Automation
Factory
A Journey from Bare Metal
to Active Cassandra Node

nathan@milford.io
blog.milford.io
twitter.com/NathanMilford
github.com/nmilford

Cassandra NYC 2011

http://www.slideshare.net/nmilford/cassandra-for-sysadmins

2 Years Later
● 80 billion impressions a month.

● 4 clusters for disparate
use-cases, more in planning.

● 73 Cassandra nodes
across 3 data centers.

Mo' Servers,
Mo' Problems

We got multiple cages of servers.

So... yeah... you can see where
automation might help :)

Automation Attack Plan

●
Provisioning!
●
Orchestration! ●
Command and Control!
●
Config Management! ● Monitoring and Alerting!

Provisioning
●
Started with Cobbler (which is Awesome!)
●
High performance infrastructures are snowflakes,
can get out of hand fast.

●
No tool that worked completely, end to end, the
tool won't write itself.

We Built Our Own: Onering

Note: I am only a moderate Lord of the Rings Fan, and the guy who did most of the work on it, Gary Hetzel, is a
Star Trek fan. We are not responsible for any LotR puns.
https://github.com/outbrain/onering/

Onering: Provisioning &
Orchestration
●
Initiates/manages provisioning
and inventory.
●
Acts as an orchestration layer in
our automation.
●
Keeps all metadata, which is
searchable.
●
Has a CLI tool and REST API to
work with.
●
Acts as our single point of truth
& final authority on state.

Onering Provisioning Workflow
➔
Developers put in machine requests by role for
quarterly order.
➔
Machines show up, get racked and powered on.
➔
Machines boot into the Razor microkernel and report to
Onering.
➔
Appropriate nodes get kickstarted & bootstrapped into
roles specified.
➔
Additional nodes sit idle in 'allocatable' state.
➔
Once OS is installed, configuration is handed off to...

Config Management: Chef
●
Onering bootstraps into a Chef run.
●
Chef installs all the system stuff.
●
Chef sets up Java and tunes the OS how we like.
●
Chef runs the Cassandra Cookbook.
include_recipe "java"

package "apache-cassandra1" do
action :install
end

template "/etc/cassandra/conf/cassandra.yaml" do
owner "cassandra"
group "cassandra"
mode "0755"
source "cassandra.yaml.erb"
end

https://github.com/opscode/chef/

Cassandra Cookbook does it all!
●
Builds/mounts disks.
●
Handles multiple clusters,
different versions.
●
Generates configs (in some
cases automatically based
on hardware profile).
●
Connects to local instance
and gets the schema.
●
Generates collectd config
and maintenance script.
●
Schedules maintenance.
https://github.com/nmilford/chef-cassandra

Glu: Continuous Deployment
● Not related to getting a C* node
to production, but it's how we get
apps there.
● Built at Linkedin.
● Onering talks to it!
●
Holds deployment metadata.
●
Maven Builds an RPM, dumps to a repo.
●
Glu-Agent yum installs it and performs checks.

https://github.com/linkedin/glu

Command & Control:
Distributed commands:
salt '*ny*' cassandra.column_families
salt 'cass*' cassandra.compactionstats
salt '*stg*' cassandra.info
salt 'cass1.ny.*' cassandra.keyspaces
salt -E 'cass1-(stg|prod)' cassandra.netstats
salt '*' cassandra.tpstats

Scary commands:
salt '*' --batch-size 25% service.restart cassandra
salt '*' -b2 cmd.run "nodetool -h $(hostname) -p 7199 snapshot"

We actually wrap salt in Onering to provide AAA, as well to allow use of Onering
metadata for node targeting.

https://github.com/saltstack/salt

Common Monitoring & Events Bus
●
A single infrastructure-wide bus for systems
data:
– Metrics
– Events
– Metadata
●
Collectd as systems agent.
●
RabbitMQ as message bus.
●
Graphite as metrics endpoint.
●
Working on an events mechanism.
●
Each layer should be interchangeable.

Collectd
●
Been around forever.
●
Had to rebuild the JMX plugin to not use OpenJDK.
●
Easy to write plugins and extend.
●
Writes to RabbitMQ out of the box.
●
Easy to templatize config for Chef.
<% @node[:Cassandra][:Keyspaces].each do |ks| -%>
<% ks[1].each do |cf| -%>
Collect "<%= ks[0] %>.<%= cf %>"
Collect "KeyCache.<%= ks[0] %>.<%= cf %>"
Collect "RowCache.<%= ks[0] %>.<%= cf %>"
<% end -%>
<% end -%>
https://github.com/octo/collectd

RabbitMQ
●
Lots of apps support AMPQ.
●
Shovel plugin for multi-site.
●
Pretty stable.
●
I'm not mad at it.

https://github.com/rabbitmq/rabbitmq-server

Graphite

●
Plays well with RabbitMQ.
●
Easy to get metrics into.
●
Scads of functions.
●
Easy to get meaningful data out of.

https://launchpad.net/graphite

Graphite Render, Activate!
http://graphite/render?
Width=800
&height=600
&from=-2hours
&until=now
&target=sortByMaxima(highestCurrent(collectd.machines
.*.cass2*.GenericJMX.ReadStage.PendingTasks,5))
&target=sortByMaxima(highestCurrent(collectd.machines
.*.cass2*.GenericJMX.MutationStage.PendingTasks,5))
&hideLegend=false

Alerting: Nagios Self Serve
●
Uses Onering for new node discovery.
●
Developers add their own alerts based off of
Graphite data.
●
Ops get fewer alerts and are not a bottleneck.
●
Devs are more engaged.
●
Everyone is happy.

The Automation Factory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to The Automation Factory

Similar to The Automation Factory (20)

Recently uploaded

Recently uploaded (20)

The Automation Factory