Published on

A #NYCCassandra2013 talk wherein I outline Outbrain's automation infrastructure and how we go from metal to working cluster nodes.

  21. 21. Cassandra NYC 2011
  22. 22. 2 Years Later● 80 billion impressions a month.● 4 clusters for disparate use-cases, more in planning.● 73 Cassandra nodes across 3 data centers.
  23. 23. Mo Servers, Mo ProblemsWe got multiple cages of servers. So... yeah... you can see where automation might help :)
  24. 24. Automation Attack Plan ● Provisioning!● Orchestration! ● Command and Control!● Config Management! ● Monitoring and Alerting!
  25. 25. Provisioning● Started with Cobbler (which is Awesome!)● High performance infrastructures are snowflakes, can get out of hand fast.● No tool that worked completely, end to end, the tool wont write itself.
  26. 26. We Built Our Own: OneringNote: I am only a moderate Lord of the Rings Fan, and the guy who did most of the work on it, Gary Hetzel, is aStar Trek fan. We are not responsible for any LotR puns.
  27. 27. Onering: Provisioning & Orchestration ● Initiates/manages provisioning and inventory. ● Acts as an orchestration layer in our automation. ● Keeps all metadata, which is searchable. ● Has a CLI tool and REST API to work with. ● Acts as our single point of truth & final authority on state.
  28. 28. Onering Provisioning Workflow➔ Developers put in machine requests by role forquarterly order.➔ Machines show up, get racked and powered on.➔ Machines boot into the Razor microkernel and report toOnering.➔ Appropriate nodes get kickstarted & bootstrapped intoroles specified.➔ Additional nodes sit idle in allocatable state.➔ Once OS is installed, configuration is handed off to...
  29. 29. Config Management: Chef● Onering bootstraps into a Chef run.● Chef installs all the system stuff.● Chef sets up Java and tunes the OS how we like.● Chef runs the Cassandra Cookbook.include_recipe "java"package "apache-cassandra1" do action :installendtemplate "/etc/cassandra/conf/cassandra.yaml" do owner "cassandra" group "cassandra" mode "0755" source "cassandra.yaml.erb"end
  30. 30. Cassandra Cookbook does it all! ● Builds/mounts disks. ● Handles multiple clusters, different versions. ● Generates configs (in some cases automatically based on hardware profile). ● Connects to local instance and gets the schema. ● Generates collectd config and maintenance script. ● Schedules maintenance.
  31. 31. Glu: Continuous Deployment ● Not related to getting a C* node to production, but its how we get apps there. ● Built at Linkedin. ● Onering talks to it!● Holds deployment metadata.● Maven Builds an RPM, dumps to a repo.● Glu-Agent yum installs it and performs checks.
  32. 32. Command & Control:Distributed commands:salt *ny* cassandra.column_familiessalt cass* cassandra.compactionstatssalt *stg* cassandra.infosalt cass1.ny.* cassandra.keyspacessalt -E cass1-(stg|prod) cassandra.netstatssalt * cassandra.tpstatsScary commands:salt * --batch-size 25% service.restart cassandrasalt * -b2 "nodetool -h $(hostname) -p 7199 snapshot"We actually wrap salt in Onering to provide AAA, as well to allow use of Oneringmetadata for node targeting.
  33. 33. Monitoring Is Hard...
  34. 34. Common Monitoring & Events Bus● A single infrastructure-wide bus for systems data: – Metrics – Events – Metadata● Collectd as systems agent.● RabbitMQ as message bus.● Graphite as metrics endpoint.● Working on an events mechanism.● Each layer should be interchangeable.
  35. 35. Collectd ● Been around forever. ● Had to rebuild the JMX plugin to not use OpenJDK. ● Easy to write plugins and extend. ● Writes to RabbitMQ out of the box. ● Easy to templatize config for Chef.<% @node[:Cassandra][:Keyspaces].each do |ks| -%><% ks[1].each do |cf| -%> Collect "<%= ks[0] %>.<%= cf %>" Collect "KeyCache.<%= ks[0] %>.<%= cf %>" Collect "RowCache.<%= ks[0] %>.<%= cf %>"<% end -%><% end -%>
  36. 36. RabbitMQ● Lots of apps support AMPQ.● Shovel plugin for multi-site.● Pretty stable.● Im not mad at it.
  37. 37. Graphite● Plays well with RabbitMQ.● Easy to get metrics into.● Scads of functions.● Easy to get meaningful data out of.
  38. 38. Graphite Render, Activate!http://graphite/render?Width=800&height=600&from=-2hours&until=now&target=sortByMaxima(highestCurrent(collectd.machines.*.cass2*.GenericJMX.ReadStage.PendingTasks,5))&target=sortByMaxima(highestCurrent(collectd.machines.*.cass2*.GenericJMX.MutationStage.PendingTasks,5))&hideLegend=false
  39. 39. Alerting: Nagios Self Serve● Uses Onering for new node discovery.● Developers add their own alerts based off of Graphite data.● Ops get fewer alerts and are not a bottleneck.● Devs are more engaged.● Everyone is happy.
