Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SaltConf 2014: Safety with powertools

1,198 views

Published on

SaltConf 2014 keynote - Thomas Jackson, LinkedIn
Safety with Power tools
As infrastructure scales, simple tasks become increasingly difficult. For large infrastructures to be manageable, we use automation. But automation, like any power tool, comes with its own set of risks and challenges. Automation should be handled like production code, and great care should be exercised with power tools. This talk will cover how SaltStack is used at LinkedIn and offer tips and tricks for automating management with SaltStack at massive scale including a look at LinkedIn-inspired Salt features such as blacklist and pre-req states. It will also cover Salt master and minion instrumentation and a compilation of how not to use Salt.

Published in: Technology
  • Be the first to comment

SaltConf 2014: Safety with powertools

  1. 1. Safety with power tools ©2013 LinkedIn Corporation. All Rights Reserved.
  2. 2. Who’s this guy? ©2013 LinkedIn Corporation. All Rights Reserved. 2
  3. 3. What is SRE?     Hybrid of operations and engineering Heavily involved in architecture and design Application support ninjas Masters of automation ©2013 LinkedIn Corporation. All Rights Reserved. 3
  4. 4. So, what do I do with salt?  Heavy user  Active developer  Administrator (less so) ©2013 LinkedIn Corporation. All Rights Reserved. 4
  5. 5. What’s LinkedIn?  Professional social network  You probably all have an account  You probably all get email from us too ©2013 LinkedIn Corporation. All Rights Reserved. 5
  6. 6. Salt @ LinkedIn  When LinkedIn started – Aug 2011: Salt 0.8.9 – ~5k minions  When I got involved – May 2012: Salt 0.9.9 – ~10k minions  Today – Now: 2014.01 – ~30k minions ©2013 LinkedIn Corporation. All Rights Reserved. 6
  7. 7. How should you manage a service? ©2013 LinkedIn Corporation. All Rights Reserved. 7
  8. 8. That’s not much of an answer…  Depends on use – – – – Home School Hack Work  How you manage the service changes over time – – – – Make it work – very manual long time to get it to work (more of a work of art…) Reproducibly make it work Script it out And more? ©2013 LinkedIn Corporation. All Rights Reserved. 8
  9. 9. Apache Traffic Server ©2013 LinkedIn Corporation. All Rights Reserved.
  10. 10. ATS: Apache Traffic Server  Fast, scalable and extensible HTTP/1.1 compliant caching proxy server.  Non-blocking IO  Plugin architecture  This is the real logo ©2013 LinkedIn Corporation. All Rights Reserved.
  11. 11. Example: ATS deployment @ LinkedIn  When I started, deployment was less than ideal: – – – – – – – Check into SVN SCP files to hosts Manually remove host from rotation Replace files and install RPMs Restart trafficserver Check some logs to see if its broken Put it in rotation and hope you didn’t miss anything ©2013 LinkedIn Corporation. All Rights Reserved. 11
  12. 12. ©2013 LinkedIn Corporation. All Rights Reserved. 12
  13. 13. Example: ATS deployment @ LinkedIn  So many steps! – Manual config management – Manual rpm deployment – Manual * (<- seriously, you name it!)  Works for a while, but doesn’t scale  Very VERY error prone ©2013 LinkedIn Corporation. All Rights Reserved. 13
  14. 14. Solution? Automation with Salt!  Pillars, runners, and modules, Oh My!  States make this dead simple ©2013 LinkedIn Corporation. All Rights Reserved.
  15. 15. Obligatory SLS formulas ats: pkg: - installed - pkgs: - trafficserver: x.x.x-xx - trafficserver-plugin-header-rewrite: x.x.x-x ... (there are lots) service: - name: trafficserver - running /etc/trafficserver/records.config: file.managed: - makedirs: True - user: nobody - group: nobody - mode: 600 - source: http://repo/ats/records.config - source_hash: md5=20d90b82bb3a4f95d7f17d1be6257246 ©2013 LinkedIn Corporation. All Rights Reserved. 15
  16. 16. Great, SLS– like I wasn’t going to see those @ SaltConf  Had to, sorry! ©2013 LinkedIn Corporation. All Rights Reserved. 16
  17. 17. What is Salt? ©2013 LinkedIn Corporation. All Rights Reserved. 17
  18. 18. What is Salt @ LinkedIn?  Remote execution – Salt * cmd.run date -s "`date`” (leap-pocalypse anyone?)  “Catchall” deployment system – ATS – Couchbase – Etc.  Automation platform – Remote execution behind LinkedIn’s new standardized deployment – Cache copy + torrent-style file distribution (in migration to Salt!) ©2013 LinkedIn Corporation. All Rights Reserved. 18
  19. 19. So what’s this about power tools?  Growing up my dad and I did a lot of cabinetry work  In the old days you did all this by hand  There are actually quite a few similarities ©2013 LinkedIn Corporation. All Rights Reserved. 19
  20. 20. Learning to be a carpenter  Learning in general you start with the basics and move up – Calculator-less math classes anyone?  Carpentry 101: learn the basic tools – Hand saws – Sandpaper – Hammer ©2013 LinkedIn Corporation. All Rights Reserved. 20
  21. 21. Learning to be a carpenter  As a kid I always thought it was ridiculous to use these since I could *see* the power tools my dad was using  With more experience you can use more tools, once you know how to use the ones you have – Tools need to be respected and used properly – Some tools aren’t worth learning the hard way (chainsaws!) ©2013 LinkedIn Corporation. All Rights Reserved. 21
  22. 22. So, SaltConf is about carpentry??  Well, not so much  Computers have lots of different tools – – – – ssh scp Package managers Etc.  As we scale it’s no longer practical to use all these manual tools, so we use power tools (automation) ©2013 LinkedIn Corporation. All Rights Reserved. 22
  23. 23. How should you use Salt?     Understand the problem Learn the tool Test the solution Watch for the result ©2013 LinkedIn Corporation. All Rights Reserved. 23
  24. 24. How should you use Salt: Understand the problem  “If you can't explain it simply, you don't understand it well enough.” – Albert Einstein  What are you trying to automate? – Is this full stack? Or just the application? – What is already automated? – Should it be automated?  Learn how to do it without the tooling – Knowing how to do the deploy manually will help you when you need to debug ©2013 LinkedIn Corporation. All Rights Reserved. 24
  25. 25. How should you use Salt: Learn the tool  “99% of the time you don’t have to write modules to use salt” – *Most* things you want to do can be done with existing code – If you find something that you think needs new code, reach out to the community– someone else probably wants it too!  Learn what it can and can’t do  Keep up with new features coming out as well as coming up  Continually train yourself and your users  Little things can add up: – In your __virtual__ function check your dependencies(~5 lines x ~30K minions) ©2013 LinkedIn Corporation. All Rights Reserved. 25
  26. 26. How should you use Salt: Test the Solution  Don’t’ be that guy ©2013 LinkedIn Corporation. All Rights Reserved. 26
  27. 27. How should you use Salt: Test the Solution  Fact: “AUTOMATION IS CODE!”  It is common to set up extensive tests for code, but less so for automation  In many ways automation testing is just as if not more important! – This applies to SLS formulas, modules, runners, AND salt itself. – Staging is production for infrastructure! ©2013 LinkedIn Corporation. All Rights Reserved. 27
  28. 28. How should you use Salt: Test the Solution  How do we do this @ LinkedIn? – – – – Code reviews VM environment: a pre-staging environment for testing Stress tests: pathological test cases Canary process: careful code rollouts ©2013 LinkedIn Corporation. All Rights Reserved. 28
  29. 29. How should you use Salt: Watch for the result  Once we’ve tested our automation, we need to verify that it does what we expect. – Code can sometimes have unintended consequences ©2013 LinkedIn Corporation. All Rights Reserved. 29
  30. 30. Innocent enough right? Wait, @_withJMXConnection def domains(connection): ''' returns a list of domains available ''' domains = list(connection.getDomains()) domains.sort() return domains ©2013 LinkedIn Corporation. All Rights Reserved. what’s that decorator? 30
  31. 31. See the problem? class _withJMXConnection(object): connection = None def __init__(self, fn, url): self.fn = fn if not _withJMXConnection.connection: # set up a jmx connection ... jpype.startJVM(“libjvm.so", "-Dcom.sun.management.jmxremote.authenticate=false", "-Xms20m", "-Xmx20m") jmxurl = jpype.javax.management.remote.JMXServiceURL(url) jmxsoc = jpype.javax.management.remote.JMXConnectorFactory.connect(jmxurl) _withJMXConnection.connection = jmxsoc.getMBeanServerConnection() self.connection = _withJMXConnection.connection Spins up a JVM! ©2013 LinkedIn Corporation. All Rights Reserved. 31
  32. 32. How should you use Salt: Watch it  Once we’ve tested our automation, we need to verify that it does what we expect. – Code can sometimes have unintended consequences  What metrics do we watch? – – – – – CPU (load and utilization) Memory (real AND virtual) TCP sessions (and overflows!) Event bus (MasterEvent and MinionEvent) Etc. ©2013 LinkedIn Corporation. All Rights Reserved. 32
  33. 33. Now everything is AWESOME!!! ©2013 LinkedIn Corporation. All Rights Reserved. 33
  34. 34. NOPE! Still can have problems ©2013 LinkedIn Corporation. All Rights Reserved. 34
  35. 35. Problems @ scale  timeouts that didn’t work – (#3431) original implementation relied on the zmq poller timeout, which you never hit if the event bus was relatively busy  salt-master memory leaks (all gone now ) – Zeromq3 – Reaping master child processes which crash  Performance problems on master (we’ve dropped CPU usage by ~80%) – Change max open files check to not run per minion request – Don't load minion modules every pillar call  Slow yumpkg5 module – Went from 20s -> 60s! Now down to ~9s (for 55 packages) ©2013 LinkedIn Corporation. All Rights Reserved. 35
  36. 36. Other features we’ve added  yumpkg – support for specific versions (back in the day) – major performance enhancements to the yumpkg module           Compound matchers (range & minion data) Prereq state Client_acl_blacklist Check and set (cas) to the data module depends decorator iterative file hashing in fileclient hash cache for fileserver + hash cache reaping limit memory consumption on module load in *nix kwarg passing with types Profiler within master process ©2013 LinkedIn Corporation. All Rights Reserved. 36
  37. 37. client_acl_blacklist (new in 0.13.0)  Salt had support for whitelisting, and per-user access control  Wanted to blacklist certain modules/users – No root (require sudo) – No cmd module (protect against fat-fingering) client_acl_blacklist: users: - root - '^(?!sudo_).*$' modules: - cmd ©2013 LinkedIn Corporation. All Rights Reserved. # all non sudo users 37
  38. 38. Prereq state (new in 0.16.0)  Came up as we started migrating our deployments to salt states  Motivation was to take hosts out of rotation before deployment  This feature lets us remove our own custom wrappers! graceful-down: cmd.run: - name: service apache graceful - prereq: - file: site-code site-code: file.recurse: - name: /opt/site_code - source: salt://site/code ©2013 LinkedIn Corporation. All Rights Reserved. 38
  39. 39. Kwarg passing with types  Found while trying to pass a pillar as a kwarg to a module (p.s. don’t)  Kwargs were cast as strings and passed as an arg – Fine if the __str__ representation == yaml – Problem if the __str__ representation != yaml  Put all kwargs in a single dict (marked as the kwarg dict) to maintain type ©2013 LinkedIn Corporation. All Rights Reserved. 39
  40. 40. Takeaways  Respect the tool! – – – – Understand the problem Learn the tool Test the solution Watch for the result  Be active in the community  Don’t just consume, Contribute!  Have FUN! ©2013 LinkedIn Corporation. All Rights Reserved. 40
  41. 41. Got more questions about Salt @ LinkedIn  Interested in how we manage Salt @ Scale? – Breakout session with Craig Sebenik @ 11:15 am in Sundance  Got questions? – Drop by our SaltConf booth! – Connect with me on LinkedIn www.linkedin.com/in/jacksontj – Jacksontj on #salt on freenode ©2013 LinkedIn Corporation. All Rights Reserved. 41

×