Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Automation Factory


Published on

A #NYCCassandra2013 talk wherein I outline Outbrain's automation infrastructure and how we go from metal to working cluster nodes.

Published in: Technology
  • Be the first to comment

The Automation Factory

  1. 1. TheAutomation Factory
  2. 2. This is NOT strictly a Cassandra talk. ♫ Theres no earthly way of knowing ♫
  3. 3. This is an infrastructure talk. ♫ How your infrastructures growing. ♫
  4. 4. Startups move fast. Priorities change.Infrastructure needs to be able to pivot, too. ♫ Who knows where business is going. Or which way the datas flowing. ♫
  5. 5. When you scale up,so do your problems. ♫ Drives imploding? IO plateauing? ♫
  6. 6. Not to mention unexpected disasters.We lost a whole data center during Hurricane Sandy. ♫ Is a hurricane ablowing? ♫
  7. 7. How do you keep up with growth? ♫ Theres no earthly way of knowing ♫
  8. 8. How do you deal with failure? ♫ Are the status LEDs a glowing? Is the server reaper mowing? ♫
  9. 9. How do you deal with too much success? ♫ Yes! The danger must be growing For the data keeps on flowing. ♫
  10. 10. What do you do?♫ And theyre certainly not showing any signs that they are slowing! ♫
  11. 11. Hold your breath. Make a wish. Automate!
  12. 12. ♫ Come with me And youll beIn a world of systems automation ♫
  13. 13. ♫ Take a look And you’ll see Into my Chef lucubrations So login, Install, beginWith the Chef cookbook of my creation What youll see might require Explanation ♫
  14. 14. ♫ If you want to view paradise Simply go to Github and view it Pull requests welcome, go to it Want to change the code A merge will do it ♫
  15. 15. def discover_cassandra_schema require cassandra-cql schema = {} server = "#{node[:ipaddress]}:#{node[:Cassandra][:rpc_port]}" db ="#{server}") rescue nil if db db.keyspaces.collect{|s| schema[] =s.column_families.collect{|cfname, cfobj| cfname } } schema.delete("system") schema.delete("OpsCenter") return schema ♫ There is no life I know end To compare with writing automation return nil Write it onceend You’ll be free♫
  16. 16. *clickity**clickity**clickity*♫ To play Diablo 3 ♫
  17. 17. ♫ If you want to scale past a petabyte Just install Chef, Salt and GraphiteIf you want to sleep the whole night Automate the world It will be all right♫
  18. 18. ♫ There is no life I knowTo compare with writing automation Write it once You’ll be free ♫
  19. 19. ♫ If you truly wish to be.♫
  20. 20. The Automation Factory A Journey from Bare Metal to Active Cassandra
  21. 21. Cassandra NYC 2011
  22. 22. 2 Years Later● 80 billion impressions a month.● 4 clusters for disparate use-cases, more in planning.● 73 Cassandra nodes across 3 data centers.
  23. 23. Mo Servers, Mo ProblemsWe got multiple cages of servers. So... yeah... you can see where automation might help :)
  24. 24. Automation Attack Plan ● Provisioning!● Orchestration! ● Command and Control!● Config Management! ● Monitoring and Alerting!
  25. 25. Provisioning● Started with Cobbler (which is Awesome!)● High performance infrastructures are snowflakes, can get out of hand fast.● No tool that worked completely, end to end, the tool wont write itself.
  26. 26. We Built Our Own: OneringNote: I am only a moderate Lord of the Rings Fan, and the guy who did most of the work on it, Gary Hetzel, is aStar Trek fan. We are not responsible for any LotR puns.
  27. 27. Onering: Provisioning & Orchestration ● Initiates/manages provisioning and inventory. ● Acts as an orchestration layer in our automation. ● Keeps all metadata, which is searchable. ● Has a CLI tool and REST API to work with. ● Acts as our single point of truth & final authority on state.
  28. 28. Onering Provisioning Workflow➔ Developers put in machine requests by role forquarterly order.➔ Machines show up, get racked and powered on.➔ Machines boot into the Razor microkernel and report toOnering.➔ Appropriate nodes get kickstarted & bootstrapped intoroles specified.➔ Additional nodes sit idle in allocatable state.➔ Once OS is installed, configuration is handed off to...
  29. 29. Config Management: Chef● Onering bootstraps into a Chef run.● Chef installs all the system stuff.● Chef sets up Java and tunes the OS how we like.● Chef runs the Cassandra Cookbook.include_recipe "java"package "apache-cassandra1" do action :installendtemplate "/etc/cassandra/conf/cassandra.yaml" do owner "cassandra" group "cassandra" mode "0755" source "cassandra.yaml.erb"end
  30. 30. Cassandra Cookbook does it all! ● Builds/mounts disks. ● Handles multiple clusters, different versions. ● Generates configs (in some cases automatically based on hardware profile). ● Connects to local instance and gets the schema. ● Generates collectd config and maintenance script. ● Schedules maintenance.
  31. 31. Glu: Continuous Deployment ● Not related to getting a C* node to production, but its how we get apps there. ● Built at Linkedin. ● Onering talks to it!● Holds deployment metadata.● Maven Builds an RPM, dumps to a repo.● Glu-Agent yum installs it and performs checks.
  32. 32. Command & Control:Distributed commands:salt *ny* cassandra.column_familiessalt cass* cassandra.compactionstatssalt *stg* cassandra.infosalt cass1.ny.* cassandra.keyspacessalt -E cass1-(stg|prod) cassandra.netstatssalt * cassandra.tpstatsScary commands:salt * --batch-size 25% service.restart cassandrasalt * -b2 "nodetool -h $(hostname) -p 7199 snapshot"We actually wrap salt in Onering to provide AAA, as well to allow use of Oneringmetadata for node targeting.
  33. 33. Monitoring Is Hard...
  34. 34. Common Monitoring & Events Bus● A single infrastructure-wide bus for systems data: – Metrics – Events – Metadata● Collectd as systems agent.● RabbitMQ as message bus.● Graphite as metrics endpoint.● Working on an events mechanism.● Each layer should be interchangeable.
  35. 35. Collectd ● Been around forever. ● Had to rebuild the JMX plugin to not use OpenJDK. ● Easy to write plugins and extend. ● Writes to RabbitMQ out of the box. ● Easy to templatize config for Chef.<% @node[:Cassandra][:Keyspaces].each do |ks| -%><% ks[1].each do |cf| -%> Collect "<%= ks[0] %>.<%= cf %>" Collect "KeyCache.<%= ks[0] %>.<%= cf %>" Collect "RowCache.<%= ks[0] %>.<%= cf %>"<% end -%><% end -%>
  36. 36. RabbitMQ● Lots of apps support AMPQ.● Shovel plugin for multi-site.● Pretty stable.● Im not mad at it.
  37. 37. Graphite● Plays well with RabbitMQ.● Easy to get metrics into.● Scads of functions.● Easy to get meaningful data out of.
  38. 38. Graphite Render, Activate!http://graphite/render?Width=800&height=600&from=-2hours&until=now&target=sortByMaxima(highestCurrent(collectd.machines.*.cass2*.GenericJMX.ReadStage.PendingTasks,5))&target=sortByMaxima(highestCurrent(collectd.machines.*.cass2*.GenericJMX.MutationStage.PendingTasks,5))&hideLegend=false
  39. 39. Alerting: Nagios Self Serve● Uses Onering for new node discovery.● Developers add their own alerts based off of Graphite data.● Ops get fewer alerts and are not a bottleneck.● Devs are more engaged.● Everyone is happy.
  40. 40. Questions?