TheAutomation  Factory    nathan@milford.io    blog.milford.io    twitter.com/NathanMilford    github.com/nmilford
This is NOT strictly a  Cassandra talk.  ♫ Theres no earthly way of knowing ♫
This is an infrastructure talk.       ♫ How your infrastructures growing. ♫
Startups move fast.        Priorities change.Infrastructure needs to be able to            pivot, too.        ♫ Who knows ...
When you scale up,so do your problems.     ♫ Drives imploding?      IO plateauing? ♫
Not to mention unexpected           disasters.We lost a whole data center during         Hurricane Sandy.          ♫ Is a ...
How do you keep up with growth?       ♫ Theres no earthly way of knowing ♫
How do you deal with failure?      ♫ Are the status LEDs a glowing?       Is the server reaper mowing? ♫
How do you deal with too much          success?      ♫ Yes! The danger must be growing       For the data keeps on flowing...
What do you do?♫ And theyre certainly not showing any signs that they are slowing! ♫
Hold your breath.  Make a wish. Automate!
♫ Come with me            And youll beIn a world of systems automation ♫
♫ Take a look            And you’ll see      Into my Chef lucubrations        So login, Install, beginWith the Chef cookbo...
♫ If you want to view paradise Simply go to Github and view it Pull requests welcome, go to it    Want to change the code ...
def discover_cassandra_schema  require cassandra-cql  schema = {}  server = "#{node[:ipaddress]}:#{node[:Cassandra][:rpc_p...
*clickity**clickity**clickity*♫ To play Diablo 3 ♫
♫ If you want to scale past a petabyte Just install Chef, Salt and GraphiteIf you want to sleep the whole night         Au...
♫ There is no life I knowTo compare with writing automation          Write it once         You’ll be free ♫
♫ If you truly wish to be.♫
The     Automation       Factory     A Journey from Bare Metal     to Active Cassandra Nodenathan@milford.ioblog.milford.i...
Cassandra NYC 2011http://www.slideshare.net/nmilford/cassandra-for-sysadmins
2 Years Later●   80 billion impressions a month.●   4 clusters for disparate    use-cases, more in planning.●   73 Cassand...
Mo Servers,   Mo ProblemsWe got multiple cages of servers.   So... yeah... you can see where     automation might help :)
Automation Attack Plan                     ●                         Provisioning!●    Orchestration!           ●         ...
Provisioning●    Started with Cobbler (which is Awesome!)●    High performance infrastructures are snowflakes,    can get ...
We Built Our Own: OneringNote: I am only a moderate Lord of the Rings Fan, and the guy who did most of the work on it, Gar...
Onering: Provisioning &    Orchestration       ●           Initiates/manages provisioning           and inventory.       ●...
Onering Provisioning Workflow➔ Developers put in machine requests by role forquarterly order.➔    Machines show up, get ra...
Config Management: Chef●  Onering bootstraps into a Chef run.●  Chef installs all the system stuff.●  Chef sets up Java an...
Cassandra Cookbook does it all!                          ●                              Builds/mounts disks.              ...
Glu: Continuous Deployment                ●   Not related to getting a C* node                    to production, but its h...
Command & Control:Distributed commands:salt *ny* cassandra.column_familiessalt cass* cassandra.compactionstatssalt *stg* c...
Monitoring Is Hard...
Common Monitoring & Events Bus●    A single infrastructure-wide bus for systems    data:    –   Metrics    –   Events    –...
Collectd ●     Been around forever. ●     Had to rebuild the JMX plugin to not use OpenJDK. ●     Easy to write plugins an...
RabbitMQ●    Lots of apps support AMPQ.●    Shovel plugin for multi-site.●    Pretty stable.●    Im not mad at it.        ...
Graphite●    Plays well with RabbitMQ.●    Easy to get metrics into.●    Scads of functions.●    Easy to get meaningful da...
Graphite Render, Activate!http://graphite/render?Width=800&height=600&from=-2hours&until=now&target=sortByMaxima(highestCu...
Alerting: Nagios Self Serve●    Uses Onering for new node discovery.●    Developers add their own alerts based off of    G...
Questions?
The Automation Factory
The Automation Factory
The Automation Factory
The Automation Factory
Upcoming SlideShare
Loading in...5
×

The Automation Factory

1,581

Published on

A #NYCCassandra2013 talk wherein I outline Outbrain's automation infrastructure and how we go from metal to working cluster nodes.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,581
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

The Automation Factory

  1. 1. TheAutomation Factory nathan@milford.io blog.milford.io twitter.com/NathanMilford github.com/nmilford
  2. 2. This is NOT strictly a Cassandra talk. ♫ Theres no earthly way of knowing ♫
  3. 3. This is an infrastructure talk. ♫ How your infrastructures growing. ♫
  4. 4. Startups move fast. Priorities change.Infrastructure needs to be able to pivot, too. ♫ Who knows where business is going. Or which way the datas flowing. ♫
  5. 5. When you scale up,so do your problems. ♫ Drives imploding? IO plateauing? ♫
  6. 6. Not to mention unexpected disasters.We lost a whole data center during Hurricane Sandy. ♫ Is a hurricane ablowing? ♫
  7. 7. How do you keep up with growth? ♫ Theres no earthly way of knowing ♫
  8. 8. How do you deal with failure? ♫ Are the status LEDs a glowing? Is the server reaper mowing? ♫
  9. 9. How do you deal with too much success? ♫ Yes! The danger must be growing For the data keeps on flowing. ♫
  10. 10. What do you do?♫ And theyre certainly not showing any signs that they are slowing! ♫
  11. 11. Hold your breath. Make a wish. Automate!
  12. 12. ♫ Come with me And youll beIn a world of systems automation ♫
  13. 13. ♫ Take a look And you’ll see Into my Chef lucubrations So login, Install, beginWith the Chef cookbook of my creation What youll see might require Explanation ♫
  14. 14. ♫ If you want to view paradise Simply go to Github and view it Pull requests welcome, go to it Want to change the code A merge will do it ♫ https://github.com/linkedin/glu/ https://github.com/octo/collectd/ https://github.com/opscode/chef/ https://github.com/saltstack/salt/ https://github.com/outbrain/onering/ https://github.com/nmilford/chef-cassandra/https://github.com/rabbitmq/rabbitmq-server/
  15. 15. def discover_cassandra_schema require cassandra-cql schema = {} server = "#{node[:ipaddress]}:#{node[:Cassandra][:rpc_port]}" db = CassandraCQL::Database.new("#{server}") rescue nil if db db.keyspaces.collect{|s| schema[s.name] =s.column_families.collect{|cfname, cfobj| cfname } } schema.delete("system") schema.delete("OpsCenter") return schema ♫ There is no life I know end To compare with writing automation return nil Write it onceend You’ll be free♫
  16. 16. *clickity**clickity**clickity*♫ To play Diablo 3 ♫
  17. 17. ♫ If you want to scale past a petabyte Just install Chef, Salt and GraphiteIf you want to sleep the whole night Automate the world It will be all right♫
  18. 18. ♫ There is no life I knowTo compare with writing automation Write it once You’ll be free ♫
  19. 19. ♫ If you truly wish to be.♫
  20. 20. The Automation Factory A Journey from Bare Metal to Active Cassandra Nodenathan@milford.ioblog.milford.iotwitter.com/NathanMilfordgithub.com/nmilford
  21. 21. Cassandra NYC 2011http://www.slideshare.net/nmilford/cassandra-for-sysadmins
  22. 22. 2 Years Later● 80 billion impressions a month.● 4 clusters for disparate use-cases, more in planning.● 73 Cassandra nodes across 3 data centers.
  23. 23. Mo Servers, Mo ProblemsWe got multiple cages of servers. So... yeah... you can see where automation might help :)
  24. 24. Automation Attack Plan ● Provisioning!● Orchestration! ● Command and Control!● Config Management! ● Monitoring and Alerting!
  25. 25. Provisioning● Started with Cobbler (which is Awesome!)● High performance infrastructures are snowflakes, can get out of hand fast.● No tool that worked completely, end to end, the tool wont write itself.
  26. 26. We Built Our Own: OneringNote: I am only a moderate Lord of the Rings Fan, and the guy who did most of the work on it, Gary Hetzel, is aStar Trek fan. We are not responsible for any LotR puns. https://github.com/outbrain/onering/
  27. 27. Onering: Provisioning & Orchestration ● Initiates/manages provisioning and inventory. ● Acts as an orchestration layer in our automation. ● Keeps all metadata, which is searchable. ● Has a CLI tool and REST API to work with. ● Acts as our single point of truth & final authority on state.
  28. 28. Onering Provisioning Workflow➔ Developers put in machine requests by role forquarterly order.➔ Machines show up, get racked and powered on.➔ Machines boot into the Razor microkernel and report toOnering.➔ Appropriate nodes get kickstarted & bootstrapped intoroles specified.➔ Additional nodes sit idle in allocatable state.➔ Once OS is installed, configuration is handed off to...
  29. 29. Config Management: Chef● Onering bootstraps into a Chef run.● Chef installs all the system stuff.● Chef sets up Java and tunes the OS how we like.● Chef runs the Cassandra Cookbook.include_recipe "java"package "apache-cassandra1" do action :installendtemplate "/etc/cassandra/conf/cassandra.yaml" do owner "cassandra" group "cassandra" mode "0755" source "cassandra.yaml.erb"end https://github.com/opscode/chef/
  30. 30. Cassandra Cookbook does it all! ● Builds/mounts disks. ● Handles multiple clusters, different versions. ● Generates configs (in some cases automatically based on hardware profile). ● Connects to local instance and gets the schema. ● Generates collectd config and maintenance script. ● Schedules maintenance. https://github.com/nmilford/chef-cassandra
  31. 31. Glu: Continuous Deployment ● Not related to getting a C* node to production, but its how we get apps there. ● Built at Linkedin. ● Onering talks to it!● Holds deployment metadata.● Maven Builds an RPM, dumps to a repo.● Glu-Agent yum installs it and performs checks. https://github.com/linkedin/glu
  32. 32. Command & Control:Distributed commands:salt *ny* cassandra.column_familiessalt cass* cassandra.compactionstatssalt *stg* cassandra.infosalt cass1.ny.* cassandra.keyspacessalt -E cass1-(stg|prod) cassandra.netstatssalt * cassandra.tpstatsScary commands:salt * --batch-size 25% service.restart cassandrasalt * -b2 cmd.run "nodetool -h $(hostname) -p 7199 snapshot"We actually wrap salt in Onering to provide AAA, as well to allow use of Oneringmetadata for node targeting. https://github.com/saltstack/salt
  33. 33. Monitoring Is Hard...
  34. 34. Common Monitoring & Events Bus● A single infrastructure-wide bus for systems data: – Metrics – Events – Metadata● Collectd as systems agent.● RabbitMQ as message bus.● Graphite as metrics endpoint.● Working on an events mechanism.● Each layer should be interchangeable.
  35. 35. Collectd ● Been around forever. ● Had to rebuild the JMX plugin to not use OpenJDK. ● Easy to write plugins and extend. ● Writes to RabbitMQ out of the box. ● Easy to templatize config for Chef.<% @node[:Cassandra][:Keyspaces].each do |ks| -%><% ks[1].each do |cf| -%> Collect "<%= ks[0] %>.<%= cf %>" Collect "KeyCache.<%= ks[0] %>.<%= cf %>" Collect "RowCache.<%= ks[0] %>.<%= cf %>"<% end -%><% end -%> https://github.com/octo/collectd
  36. 36. RabbitMQ● Lots of apps support AMPQ.● Shovel plugin for multi-site.● Pretty stable.● Im not mad at it. https://github.com/rabbitmq/rabbitmq-server
  37. 37. Graphite● Plays well with RabbitMQ.● Easy to get metrics into.● Scads of functions.● Easy to get meaningful data out of. https://launchpad.net/graphite
  38. 38. Graphite Render, Activate!http://graphite/render?Width=800&height=600&from=-2hours&until=now&target=sortByMaxima(highestCurrent(collectd.machines.*.cass2*.GenericJMX.ReadStage.PendingTasks,5))&target=sortByMaxima(highestCurrent(collectd.machines.*.cass2*.GenericJMX.MutationStage.PendingTasks,5))&hideLegend=false
  39. 39. Alerting: Nagios Self Serve● Uses Onering for new node discovery.● Developers add their own alerts based off of Graphite data.● Ops get fewer alerts and are not a bottleneck.● Devs are more engaged.● Everyone is happy.
  40. 40. Questions?

×