1. The
Automation
Factory
nathan@milford.io
blog.milford.io
twitter.com/NathanMilford
github.com/nmilford
2. This is NOT strictly a
Cassandra talk.
♫ There's no earthly way of knowing ♫
3. This is an infrastructure talk.
♫ How your infrastructure's growing. ♫
4. Startups move fast.
Priorities change.
Infrastructure needs to be able to
pivot, too.
♫ Who knows where business is going.
Or which way the data's flowing. ♫
5. When you scale up,
so do your problems.
♫ Drives imploding?
IO plateauing? ♫
6. Not to mention unexpected
disasters.
We lost a whole data center during
Hurricane Sandy.
♫ Is a hurricane a'blowing? ♫
7. How do you keep up with growth?
♫ There's no earthly way of knowing ♫
8. How do you deal with failure?
♫ Are the status LEDs a 'glowing?
Is the server reaper mowing? ♫
9. How do you deal with too much
success?
♫ Yes! The danger must be growing
For the data keeps on flowing. ♫
10. What do you do?
♫ And they're certainly not showing
any signs that they are slowing! ♫
12. ♫ Come with me
And you'll be
In a world of systems automation ♫
13. ♫ Take a look
And you’ll see
Into my Chef lucubrations
So login, Install, begin
With the Chef cookbook of my creation
What you'll see might require
Explanation ♫
14. ♫ If you want to view paradise
Simply go to Github and view it
Pull requests welcome, go to it
Want to change the code
A merge will do it ♫
https://github.com/linkedin/glu/
https://github.com/octo/collectd/
https://github.com/opscode/chef/
https://github.com/saltstack/salt/
https://github.com/outbrain/onering/
https://github.com/nmilford/chef-cassandra/
https://github.com/rabbitmq/rabbitmq-server/
15. def discover_cassandra_schema
require 'cassandra-cql'
schema = {}
server = "#{node[:ipaddress]}:#{node[:Cassandra][:rpc_port]}"
db = CassandraCQL::Database.new("#{server}") rescue nil
if db
db.keyspaces.collect{|s| schema[s.name] =
s.column_families.collect{|cfname, cfobj| cfname } }
schema.delete("system")
schema.delete("OpsCenter")
return schema ♫ There is no life I know
end To compare with writing automation
return nil Write it once
end You’ll be free♫
17. ♫ If you want to scale past a petabyte
Just install Chef, Salt and Graphite
If you want to sleep the whole night
Automate the world
It will be all right♫
18. ♫ There is no life I know
To compare with writing automation
Write it once
You’ll be free ♫
20. The
Automation
Factory
A Journey from Bare Metal
to Active Cassandra Node
nathan@milford.io
blog.milford.io
twitter.com/NathanMilford
github.com/nmilford
22. 2 Years Later
● 80 billion impressions a month.
● 4 clusters for disparate
use-cases, more in planning.
● 73 Cassandra nodes
across 3 data centers.
23. Mo' Servers,
Mo' Problems
We got multiple cages of servers.
So... yeah... you can see where
automation might help :)
24. Automation Attack Plan
●
Provisioning!
●
Orchestration! ●
Command and Control!
●
Config Management! ● Monitoring and Alerting!
25. Provisioning
●
Started with Cobbler (which is Awesome!)
●
High performance infrastructures are snowflakes,
can get out of hand fast.
●
No tool that worked completely, end to end, the
tool won't write itself.
26. We Built Our Own: Onering
Note: I am only a moderate Lord of the Rings Fan, and the guy who did most of the work on it, Gary Hetzel, is a
Star Trek fan. We are not responsible for any LotR puns.
https://github.com/outbrain/onering/
27. Onering: Provisioning &
Orchestration
●
Initiates/manages provisioning
and inventory.
●
Acts as an orchestration layer in
our automation.
●
Keeps all metadata, which is
searchable.
●
Has a CLI tool and REST API to
work with.
●
Acts as our single point of truth
& final authority on state.
28.
29. Onering Provisioning Workflow
➔
Developers put in machine requests by role for
quarterly order.
➔
Machines show up, get racked and powered on.
➔
Machines boot into the Razor microkernel and report to
Onering.
➔
Appropriate nodes get kickstarted & bootstrapped into
roles specified.
➔
Additional nodes sit idle in 'allocatable' state.
➔
Once OS is installed, configuration is handed off to...
30. Config Management: Chef
●
Onering bootstraps into a Chef run.
●
Chef installs all the system stuff.
●
Chef sets up Java and tunes the OS how we like.
●
Chef runs the Cassandra Cookbook.
include_recipe "java"
package "apache-cassandra1" do
action :install
end
template "/etc/cassandra/conf/cassandra.yaml" do
owner "cassandra"
group "cassandra"
mode "0755"
source "cassandra.yaml.erb"
end
https://github.com/opscode/chef/
31. Cassandra Cookbook does it all!
●
Builds/mounts disks.
●
Handles multiple clusters,
different versions.
●
Generates configs (in some
cases automatically based
on hardware profile).
●
Connects to local instance
and gets the schema.
●
Generates collectd config
and maintenance script.
●
Schedules maintenance.
https://github.com/nmilford/chef-cassandra
32. Glu: Continuous Deployment
● Not related to getting a C* node
to production, but it's how we get
apps there.
● Built at Linkedin.
● Onering talks to it!
●
Holds deployment metadata.
●
Maven Builds an RPM, dumps to a repo.
●
Glu-Agent yum installs it and performs checks.
https://github.com/linkedin/glu
33.
34.
35. Command & Control:
Distributed commands:
salt '*ny*' cassandra.column_families
salt 'cass*' cassandra.compactionstats
salt '*stg*' cassandra.info
salt 'cass1.ny.*' cassandra.keyspaces
salt -E 'cass1-(stg|prod)' cassandra.netstats
salt '*' cassandra.tpstats
Scary commands:
salt '*' --batch-size 25% service.restart cassandra
salt '*' -b2 cmd.run "nodetool -h $(hostname) -p 7199 snapshot"
We actually wrap salt in Onering to provide AAA, as well to allow use of Onering
metadata for node targeting.
https://github.com/saltstack/salt
37. Common Monitoring & Events Bus
●
A single infrastructure-wide bus for systems
data:
– Metrics
– Events
– Metadata
●
Collectd as systems agent.
●
RabbitMQ as message bus.
●
Graphite as metrics endpoint.
●
Working on an events mechanism.
●
Each layer should be interchangeable.
38. Collectd
●
Been around forever.
●
Had to rebuild the JMX plugin to not use OpenJDK.
●
Easy to write plugins and extend.
●
Writes to RabbitMQ out of the box.
●
Easy to templatize config for Chef.
<% @node[:Cassandra][:Keyspaces].each do |ks| -%>
<% ks[1].each do |cf| -%>
Collect "<%= ks[0] %>.<%= cf %>"
Collect "KeyCache.<%= ks[0] %>.<%= cf %>"
Collect "RowCache.<%= ks[0] %>.<%= cf %>"
<% end -%>
<% end -%>
https://github.com/octo/collectd
39. RabbitMQ
●
Lots of apps support AMPQ.
●
Shovel plugin for multi-site.
●
Pretty stable.
●
I'm not mad at it.
https://github.com/rabbitmq/rabbitmq-server
40. Graphite
●
Plays well with RabbitMQ.
●
Easy to get metrics into.
●
Scads of functions.
●
Easy to get meaningful data out of.
https://launchpad.net/graphite
43. Alerting: Nagios Self Serve
●
Uses Onering for new node discovery.
●
Developers add their own alerts based off of
Graphite data.
●
Ops get fewer alerts and are not a bottleneck.
●
Devs are more engaged.
●
Everyone is happy.