Your SlideShare is downloading. ×
©2013 LinkedIn Corporation. All Rights Reserved.
Who’s this guy?
2
©2013 LinkedIn Corporation. All Rights Reserved.
What is SRE?
 Hybrid of operations and engineering
 Heavily involved in...
©2013 LinkedIn Corporation. All Rights Reserved.
So, what do I do with salt?
 Heavy user
 Active developer
 Administrat...
©2013 LinkedIn Corporation. All Rights Reserved.
What’s LinkedIn?
 Professional social network
 You probably all have an...
©2013 LinkedIn Corporation. All Rights Reserved.
Salt @ LinkedIn
 When LinkedIn started
– Aug 2011: Salt 0.8.9
– ~5k mini...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you manage a service?
7
©2013 LinkedIn Corporation. All Rights Reserved.
That’s not much of an answer…
 Depends on use
– Home
– School
– Hack
– W...
©2013 LinkedIn Corporation. All Rights Reserved.
Apache Traffic Server
©2013 LinkedIn Corporation. All Rights Reserved.
ATS: Apache Traffic Server
 Fast, scalable and extensible HTTP/1.1 compl...
©2013 LinkedIn Corporation. All Rights Reserved.
Example: ATS deployment @ LinkedIn
 When I started, deployment was less ...
©2013 LinkedIn Corporation. All Rights Reserved. 12
©2013 LinkedIn Corporation. All Rights Reserved.
Example: ATS deployment @ LinkedIn
 So many steps!
– Manual config manag...
©2013 LinkedIn Corporation. All Rights Reserved.
Solution? Automation with Salt!
 Pillars, runners, and modules, Oh My!
...
©2013 LinkedIn Corporation. All Rights Reserved.
Obligatory SLS formulas
ats:
pkg:
- installed
- pkgs:
- trafficserver: x....
©2013 LinkedIn Corporation. All Rights Reserved.
Great, SLS– like I wasn’t going to see those @ SaltConf
 Had to, sorry!
...
©2013 LinkedIn Corporation. All Rights Reserved.
What is Salt?
17
©2013 LinkedIn Corporation. All Rights Reserved.
What is Salt @ LinkedIn?
 Remote execution
– Salt * cmd.run date -s "`da...
©2013 LinkedIn Corporation. All Rights Reserved.
So what’s this about power tools?
 Growing up my dad and I did a lot of ...
©2013 LinkedIn Corporation. All Rights Reserved.
Learning to be a carpenter
 Learning in general you start with the basic...
©2013 LinkedIn Corporation. All Rights Reserved.
Learning to be a carpenter
 As a kid I always thought it was ridiculous ...
©2013 LinkedIn Corporation. All Rights Reserved.
So, SaltConf is about carpentry??
 Well, not so much
 Computers have lo...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt?
 Understand the problem
 Learn the tool
 Test...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt: Understand the problem
 “If you can't explain i...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt: Learn the tool
 “99% of the time you don’t have...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt: Test the Solution
 Don’t’ be that guy
26
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt: Test the Solution
 Fact: “AUTOMATION IS CODE!”
...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt: Test the Solution
 How do we do this @ LinkedIn...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt: Watch for the result
 Once we’ve tested our aut...
©2013 LinkedIn Corporation. All Rights Reserved.
Innocent enough right?
@_withJMXConnection
def domains(connection):
'''
r...
©2013 LinkedIn Corporation. All Rights Reserved.
See the problem?
class _withJMXConnection(object):
connection = None
def ...
©2013 LinkedIn Corporation. All Rights Reserved.
How should you use Salt: Watch it
 Once we’ve tested our automation, we ...
©2013 LinkedIn Corporation. All Rights Reserved.
Now everything is AWESOME!!!
33
©2013 LinkedIn Corporation. All Rights Reserved.
NOPE! Still can have problems
34
©2013 LinkedIn Corporation. All Rights Reserved.
Problems @ scale
 timeouts that didn’t work
– (#3431) original implement...
©2013 LinkedIn Corporation. All Rights Reserved.
Other features we’ve added
 yumpkg
– support for specific versions (back...
©2013 LinkedIn Corporation. All Rights Reserved.
client_acl_blacklist (new in 0.13.0)
 Salt had support for whitelisting,...
©2013 LinkedIn Corporation. All Rights Reserved.
Prereq state (new in 0.16.0)
 Came up as we started migrating our deploy...
©2013 LinkedIn Corporation. All Rights Reserved.
Kwarg passing with types
 Found while trying to pass a pillar as a kwarg...
©2013 LinkedIn Corporation. All Rights Reserved.
Takeaways
 Respect the tool!
– Understand the problem
– Learn the tool
–...
©2013 LinkedIn Corporation. All Rights Reserved.
Got more questions about Salt @ LinkedIn
 Interested in how we manage Sa...
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
Upcoming SlideShare
Loading in...5
×

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

688

Published on

As infrastructure scales, simple tasks become increasingly difficult. For large infrastructures to be manageable, we use automation. But automation, like any power tool, comes with its own set of risks and challenges. Automation should be handled like production code, and great care should be exercised with power tools. This talk will cover how SaltStack is used at LinkedIn and offer tips and tricks for automating management with SaltStack at massive scale including a look at LinkedIn-inspired Salt features such as blacklist and prereq states. It will also cover Salt master and minion instrumentation and a compilation of how not to use Salt.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
688
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • SaltConf keynote review - Thomas Jackson, LinkedInSafety with Power toolsAs infrastructure scales, simple tasks become increasingly difficult. For large infrastructures to be manageable, we use automation. But automation, like any power tool, comes with its own set of risks and challenges. Automation should be handled like production code, and great care should be exercised with power tools. This talk will cover how SaltStack is used at LinkedIn and offer tips and tricks for automating management with SaltStack at massive scale including a look at LinkedIn-inspired Salt features such as blacklist and pre-req states. It will also cover Salt master and minion instrumentation and a compilation of how not to use Salt.
  • Professional, fewer cats
  • How long we’ve been using itTom is embarrased that we started so early 0.8.9runners just addedOutputters just addedCross calling salt modules using __salt__0.9.9Highstate test=TrueExternal pillarmInion swarm
  • We all manage some service, so lets talk about it
  • To get some context, I’m going to talk a little about the main service I support
  • When I started ATS was new– so we had a lot of manual things ;)First question is going to be…
  • Consistency problems (missing a file, or a package)Missing a log entry
  • Really are useful and AMAZINGLY simple
  • Taken fromwikipedia, TODO: get a better one?
  • Remote- faster than old one 30m
  • 14m
  • Tools to use all the toolsOne such tool is Salt, as with any other tools there are some things to keep in mind while using
  • Similar to power tools
  • Other automaton(Don’t step on toes)Clear lines of ownership
  • JMX example
  • Nice doc string!Whatup with decorator?
  • startJVM WAT?Our feature: limit memory consumption on module load in *nix (modules_max_memory)
  • Consume event bus to get information on jobs running
  • Well, not necessarily..
  • Lots of features (and more coming I’m sure)Lets take some time to talk about 2
  • Migrating because we had re-implemented states before it existed, had wrappers to do OOR/IR operationsAnother example of where something you think you need, others want too!
  • No only new features, but we find + fix bugs too!Find the root cause, usually simpler than you’d think
  • Transcript of "SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools"

    1. 1. ©2013 LinkedIn Corporation. All Rights Reserved. Safety with power tools
    2. 2. ©2013 LinkedIn Corporation. All Rights Reserved. Who’s this guy? 2
    3. 3. ©2013 LinkedIn Corporation. All Rights Reserved. What is SRE?  Hybrid of operations and engineering  Heavily involved in architecture and design  Application support ninjas  Masters of automation 3
    4. 4. ©2013 LinkedIn Corporation. All Rights Reserved. So, what do I do with salt?  Heavy user  Active developer  Administrator (less so) 4
    5. 5. ©2013 LinkedIn Corporation. All Rights Reserved. What’s LinkedIn?  Professional social network  You probably all have an account  You probably all get email from us too 5
    6. 6. ©2013 LinkedIn Corporation. All Rights Reserved. Salt @ LinkedIn  When LinkedIn started – Aug 2011: Salt 0.8.9 – ~5k minions  When I got involved – May 2012: Salt 0.9.9 – ~10k minions  Today – Now: 2014.01 – ~30k minions 6
    7. 7. ©2013 LinkedIn Corporation. All Rights Reserved. How should you manage a service? 7
    8. 8. ©2013 LinkedIn Corporation. All Rights Reserved. That’s not much of an answer…  Depends on use – Home – School – Hack – Work  How you manage the service changes over time – Make it work – very manual long time to get it to work (more of a work of art…) – Reproducibly make it work – Script it out – And more? 8
    9. 9. ©2013 LinkedIn Corporation. All Rights Reserved. Apache Traffic Server
    10. 10. ©2013 LinkedIn Corporation. All Rights Reserved. ATS: Apache Traffic Server  Fast, scalable and extensible HTTP/1.1 compliant caching proxy server.  Non-blocking IO  Plugin architecture  This is the real logo
    11. 11. ©2013 LinkedIn Corporation. All Rights Reserved. Example: ATS deployment @ LinkedIn  When I started, deployment was less than ideal: – Check into SVN – SCP files to hosts – Manually remove host from rotation – Replace files and install RPMs – Restart trafficserver – Check some logs to see if its broken – Put it in rotation and hope you didn’t miss anything 11
    12. 12. ©2013 LinkedIn Corporation. All Rights Reserved. 12
    13. 13. ©2013 LinkedIn Corporation. All Rights Reserved. Example: ATS deployment @ LinkedIn  So many steps! – Manual config management – Manual rpm deployment – Manual * (<- seriously, you name it!)  Works for a while, but doesn’t scale  Very VERY error prone 13
    14. 14. ©2013 LinkedIn Corporation. All Rights Reserved. Solution? Automation with Salt!  Pillars, runners, and modules, Oh My!  States make this dead simple
    15. 15. ©2013 LinkedIn Corporation. All Rights Reserved. Obligatory SLS formulas ats: pkg: - installed - pkgs: - trafficserver: x.x.x-xx - trafficserver-plugin-header-rewrite: x.x.x-x ... (there are lots) service: - name: trafficserver - running /etc/trafficserver/records.config: file.managed: - makedirs: True - user: nobody - group: nobody - mode: 600 - source: http://repo/ats/records.config - source_hash: md5=20d90b82bb3a4f95d7f17d1be6257246 15
    16. 16. ©2013 LinkedIn Corporation. All Rights Reserved. Great, SLS– like I wasn’t going to see those @ SaltConf  Had to, sorry! 16
    17. 17. ©2013 LinkedIn Corporation. All Rights Reserved. What is Salt? 17
    18. 18. ©2013 LinkedIn Corporation. All Rights Reserved. What is Salt @ LinkedIn?  Remote execution – Salt * cmd.run date -s "`date`” (leap-pocalypse anyone?)  “Catchall” deployment system – ATS – Couchbase – Etc.  Automation platform – Remote execution behind LinkedIn’s new standardized deployment – Cache copy + torrent-style file distribution (in migration to Salt!) 18
    19. 19. ©2013 LinkedIn Corporation. All Rights Reserved. So what’s this about power tools?  Growing up my dad and I did a lot of cabinetry work  In the old days you did all this by hand  There are actually quite a few similarities 19
    20. 20. ©2013 LinkedIn Corporation. All Rights Reserved. Learning to be a carpenter  Learning in general you start with the basics and move up – Calculator-less math classes anyone?  Carpentry 101: learn the basic tools – Hand saws – Sandpaper – Hammer 20
    21. 21. ©2013 LinkedIn Corporation. All Rights Reserved. Learning to be a carpenter  As a kid I always thought it was ridiculous to use these since I could *see* the power tools my dad was using  With more experience you can use more tools, once you know how to use the ones you have – Tools need to be respected and used properly – Some tools aren’t worth learning the hard way (chainsaws!) 21
    22. 22. ©2013 LinkedIn Corporation. All Rights Reserved. So, SaltConf is about carpentry??  Well, not so much  Computers have lots of different tools – ssh – scp – Package managers – Etc.  As we scale it’s no longer practical to use all these manual tools, so we use power tools (automation) 22
    23. 23. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt?  Understand the problem  Learn the tool  Test the solution  Watch for the result 23
    24. 24. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Understand the problem  “If you can't explain it simply, you don't understand it well enough.” – Albert Einstein  What are you trying to automate? – Is this full stack? Or just the application? – What is already automated? – Should it be automated?  Learn how to do it without the tooling – Knowing how to do the deploy manually will help you when you need to debug 24
    25. 25. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Learn the tool  “99% of the time you don’t have to write modules to use salt” – *Most* things you want to do can be done with existing code – If you find something that you think needs new code, reach out to the community– someone else probably wants it too!  Learn what it can and can’t do  Keep up with new features coming out as well as coming up  Continually train yourself and your users  Little things can add up: – In your __virtual__ function check your dependencies(~5 lines x ~30K minions) 25
    26. 26. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Test the Solution  Don’t’ be that guy 26
    27. 27. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Test the Solution  Fact: “AUTOMATION IS CODE!”  It is common to set up extensive tests for code, but less so for automation  In many ways automation testing is just as if not more important! – This applies to SLS formulas, modules, runners, AND salt itself. – Staging is production for infrastructure! 27
    28. 28. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Test the Solution  How do we do this @ LinkedIn? – Code reviews – VM environment: a pre-staging environment for testing – Stress tests: pathological test cases – Canary process: careful code rollouts 28
    29. 29. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Watch for the result  Once we’ve tested our automation, we need to verify that it does what we expect. – Code can sometimes have unintended consequences 29
    30. 30. ©2013 LinkedIn Corporation. All Rights Reserved. Innocent enough right? @_withJMXConnection def domains(connection): ''' returns a list of domains available ''' domains = list(connection.getDomains()) domains.sort() return domains 30 Wait, what’s that decorator?
    31. 31. ©2013 LinkedIn Corporation. All Rights Reserved. See the problem? class _withJMXConnection(object): connection = None def __init__(self, fn, url): self.fn = fn if not _withJMXConnection.connection: # set up a jmx connection ... jpype.startJVM(“libjvm.so", "-Dcom.sun.management.jmxremote.authenticate=false", "-Xms20m", "-Xmx20m") jmxurl = jpype.javax.management.remote.JMXServiceURL(url) jmxsoc = jpype.javax.management.remote.JMXConnectorFactory.connect(jmxurl) _withJMXConnection.connection = jmxsoc.getMBeanServerConnection() self.connection = _withJMXConnection.connection 31 Spins up a JVM!
    32. 32. ©2013 LinkedIn Corporation. All Rights Reserved. How should you use Salt: Watch it  Once we’ve tested our automation, we need to verify that it does what we expect. – Code can sometimes have unintended consequences  What metrics do we watch? – CPU (load and utilization) – Memory (real AND virtual) – TCP sessions (and overflows!) – Event bus (MasterEvent and MinionEvent) – Etc. 32
    33. 33. ©2013 LinkedIn Corporation. All Rights Reserved. Now everything is AWESOME!!! 33
    34. 34. ©2013 LinkedIn Corporation. All Rights Reserved. NOPE! Still can have problems 34
    35. 35. ©2013 LinkedIn Corporation. All Rights Reserved. Problems @ scale  timeouts that didn’t work – (#3431) original implementation relied on the zmq poller timeout, which you never hit if the event bus was relatively busy  salt-master memory leaks (all gone now ) – Zeromq3 – Reaping master child processes which crash  Performance problems on master (we’ve dropped CPU usage by ~80%) – Change max open files check to not run per minion request – Don't load minion modules every pillar call  Slow yumpkg5 module – Went from 20s -> 60s! Now down to ~9s (for 55 packages) 35
    36. 36. ©2013 LinkedIn Corporation. All Rights Reserved. Other features we’ve added  yumpkg – support for specific versions (back in the day) – major performance enhancements to the yumpkg module  Compound matchers (range & minion data)  Prereq state  Client_acl_blacklist  Check and set (cas) to the data module  depends decorator  iterative file hashing in fileclient  hash cache for fileserver + hash cache reaping  limit memory consumption on module load in *nix  kwarg passing with types  Profiler within master process 36
    37. 37. ©2013 LinkedIn Corporation. All Rights Reserved. client_acl_blacklist (new in 0.13.0)  Salt had support for whitelisting, and per-user access control  Wanted to blacklist certain modules/users – No root (require sudo) – No cmd module (protect against fat-fingering) client_acl_blacklist: users: - root - '^(?!sudo_).*$' # all non sudo users modules: - cmd 37
    38. 38. ©2013 LinkedIn Corporation. All Rights Reserved. Prereq state (new in 0.16.0)  Came up as we started migrating our deployments to salt states  Motivation was to take hosts out of rotation before deployment  This feature lets us remove our own custom wrappers! graceful-down: cmd.run: - name: service apache graceful - prereq: - file: site-code site-code: file.recurse: - name: /opt/site_code - source: salt://site/code 38
    39. 39. ©2013 LinkedIn Corporation. All Rights Reserved. Kwarg passing with types  Found while trying to pass a pillar as a kwarg to a module (p.s. don’t)  Kwargs were cast as strings and passed as an arg – Fine if the __str__ representation == yaml – Problem if the __str__ representation != yaml  Put all kwargs in a single dict (marked as the kwarg dict) to maintain type 39
    40. 40. ©2013 LinkedIn Corporation. All Rights Reserved. Takeaways  Respect the tool! – Understand the problem – Learn the tool – Test the solution – Watch for the result  Be active in the community  Don’t just consume, Contribute!  Have FUN! 40
    41. 41. ©2013 LinkedIn Corporation. All Rights Reserved. Got more questions about Salt @ LinkedIn  Interested in how we manage Salt @ Scale? – Breakout session with Craig Sebenik @ 11:15 am in Sundance  Got questions? – Drop by our SaltConf booth! – Connect with me on LinkedIn www.linkedin.com/in/jacksontj – Jacksontj on #salt on freenode 41

    ×