SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

1,646 views
1,546 views

Published on

This talk will focus on the unique challenges of managing Web scale and an application stack that lives on tens of thousands of servers spread across multiple data centers. Learn more about LinkedIn's unique topology, about the development of an efficient build environment, and hear more about LinkedIn plans for a deployment system based on Salt. Also, all of the software that runs LinkedIn sends a LOT of data. In order to stay ahead of this tidal wave of data, the team must address scale challenges seen in very few environments through efficient use of monitoring and metrics systems. This talk will highlight best practices and user training necessary for the use of SaltStack in large environments.

Published in: Technology

SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

  1. 1. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Salt at Web Scale Craig Sebenik SRE 29 January 2014 SaltConf
  2. 2. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Who Am I? •Programming for 30-ish years •Scientific computing •Java and Perl Developer (web apps) •HATE doing the same thing more than once •Been at LinkedIn overy 3 years •From the very beginning of us using salt •Manage/architect the entire salt infrastructure at LinkedIn
  3. 3. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. What is LinkedIn? •Social media company connecting the world’s professionals •5000+ employees •Offices throughout the world •Based in Mountain View, CA
  4. 4. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. How Big Is lnkedin.com? •Several data centers •Customer facing apps (aka “production”) •Staging for production apps •Internal only apps •Several Hundred Apps •30+K Hosts •90+% Linux •Solaris •Mac and Linux Desktops
  5. 5. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. LinkedIn Operations •Several operations groups •Systems (eg. OS install/config, “rack and stack”) •Database Admins •Network •Application (i.e. SRE) •Different groups have different needs for automation
  6. 6. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. What Is An SRE? •Assist application developers deploy their apps •Advise on rollout plans •Coordinate rollouts •Generally, the group in-between all of operations and all of the developers •Lots of troubleshooting •SREs write code (automation)
  7. 7. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. SREs Use Salt •Using salt since 0.8.9 ! •Installation of new apps ! •Config management ! •Some troubleshooting
  8. 8. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Salt Architecture •Each physical data center •multiple “fabrics” (logical grouping of hosts) •single salt master (largest set of minions = 8+k) •warm backup (same private key) •minions configured with CNAME to master •Files stored in subversion •states, grains, modules •runners •reactor
  9. 9. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Building Salt •Internal fork from github •Add another number. E.g. 2014.01.0.0 •Allows for internal only patches •Create specific package for testing •same git repo, with same tags •LNKD-salt-dev-2014.01.0.0-12345.noarch.rpm •Allows for emergency changes elsewhere •salt-dev is deployed on a set of virtual machines •custom test suite is run
  10. 10. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Installing Salt •OS is managed by cfengine •cfengine will push new salt releases and restart minions •cfengine also manages minion configs •master is a set of RPMs •includes config •Solaris install is handled by systems team •Roll out to one data center at a time •Entire process can take over a week
  11. 11. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Salt Master •salt master is wrapped in a “runit” script •runit is a process supervisor •restarts the master if is dies/stops •salt API •use the reactor system to send metrics •metrics gathering is all home grown •trying to open source it •file updates (every 5 mins) •modules, states, grains
  12. 12. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Master Access •Logins to the host are managed via cfengine •Have to be in a whitelisted group to log on •Access to salt command controlled via sudo •sudo logs provide audit trail •Disable cmd.* from salt cli •If you want to automate; write a state and/or module •salt API access via a whitelist of IPs •Auth using LDAP •Only a handful of commands
  13. 13. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Minions •basic salt RPM •includes “salt” command (unfortunately) •module sync •every hr •small python script using client API •minion metrics •“age” of modules (via a tracker file) •uptime of minion
  14. 14. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Deployment With Salt •LinkedIn.com apps are deployed via a custom app •App is showing its age and needs to be replaced •Team outside of operations is writing new deployment app •Uses salt api •Has a lot of custom code •Not in salt •Needs to deploy locally (for testing) •This includes Mac desktop/laptops
  15. 15. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Custom Modules and States •couchbase management (via runner) •runit •Apache Traffic Server •metrics system •alerts •data collection •data display
  16. 16. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Module Promotion •Small oversight last year caused massizve issues •Developed process to “promote”modules •Salt environments: •dev -> vm -> test -> stage -> prod •different dirs in svn •sparse directories •minions are configured to look at certain environments •Changes are managed with “review board”
  17. 17. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Problems •Education! •Most salt customizations in 2 groups (out of 10) •Few power users •Corrupted keys •Syncing only every hour •No syncing on solaris •No highstate enforcement
  18. 18. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. More Problems •Lots of CPU issues on master •Key management •Reinstall of OS with same host name
  19. 19. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. Future •Multi master •shared job cache via file system isn’t what we want •investigating using a returner to share job info •More training •Whitelist of states •Non-ops users •Eg. devs that want to deploy just their code •Increase amount of data in grains
  20. 20. ©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved. More Future •Pillar data •Metrics •Better visibility when things go wrong •Tools to see job cache •Logs on master are too chatty •Ability to watch all traffic from a specific minion(s) •Key management •reactor system, possibly
  21. 21. Questions? http://www.linkedin.com/in/craigsebenik

×