Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Leveraging In-Memory Key
Value Stores for Large Scale
Operations with Redis and
CFEngine
Mike Svoboda
Staff Systems and Au...
My Background with
LinkedIn / CFEngine
 Hired at LinkedIn into System Operations in 2010
 When I started, our server cou...
What is Redis?
 Redis is an in-memory key value store, similar to
Memcached with additional features
 Offers on disk per...
What is CFEngine?
CFEngine:
 Is an IT infrastructure automation framework that helps
manage infrastructure throughout its...
How CFEngine works
CFEngine reduces
operational costs
 Using CFEngine automation is
more effective than hiring
additional headcount
 Stop f...
Why LinkedIn chose CFEngine
 Very mature codebase

 Not dependent on underlying virtual machines like
Ruby, Python, Perl...
What CFEngine has done for
LinkedIn
Since implementing CFEngine:
 Operations has become extremely agile
 Quickly respond...
How LinkedIn uses CFEngine
Functions we have automated:









Hardware failure detection
Account administration...
Two problems still existed for Linkedin that
automation didn’t address
 The company wanted to be able to answer any quest...
Problem #1: The company wants
questions answered. STAT!
 Management / Engineers want to have questions answered
immediate...
LinkedIn was hunting for data
What LinkedIn sysadmins were doing
• Questions about Infrastructure were answered by sysadmins
SSHing to machines to hunt ...
Forcing command execution on
remote machines doesn’t scale
 Machines were missed, data wasn’t collected

 Firewalls mang...
Problem #2: We didn’t want to break production
by pushing new automation changes.
 Ops was hesitant of using automation b...
Automation changes were
happening in the blind
 Sysadmins were under pressure from
 large ticket queues
 numerous chang...
To provide visibility, we had to
scale data collection
 We had to build a reliable system that was extremely fast,
which ...
We built a cache and populated it with
data to answer arbitrary questions
 Instead of executing commands remotely, we hav...
Architecture of the Cache

Step 1: Rely on CFEngine
execution to drive data
insertion
Step 2: Shard your data

Step 3: ...
Step 1: CFEngine drives data insertion
Leverage automation to change what you insert
or remove from the cache
The cache is a simple dictionary,
sharded over multiple Redis servers.
Step 2: Extract Sharded Data
 Determine scope. How much data do I need to answer
my question?
 For each CFEngine policy ...
Step 3: Use Software
Load Balancing!
 Have clients populate multiple Redis servers on
insertion - Pick a Redis server at ...
Local Scope
Example: Local cache extraction
$ time extract_sysops_cache.py 

--search /etc/passwd 
--contents | grep msvoboda | wc -l
...
Site (datacenter) Scope
Example: Site cache extraction
$ time extract_sysops_cache.py 

--site lva1 
--search /etc/passwd 
--contents | grep msvob...
Global Scope
Example: Global cache
extraction
$ time extract_sysops_cache.py 

--scope global 
--search /etc/passwd 
--contents | grep ...
Make it fast!
Become Multithreaded
Make it faster!
Build a Redis pipeline
Cache extraction with a pipeline
Extracting the Cache for Fun
and Profit
[msvoboda@esv4-infra01 ~]$ extract_sysops_cache.py 
--scope local 
--search mps*cm...
Make it fastest!
Compression is significant!
 Less network overhead on cache insertion
 Less network overhead on cache e...
Seconds for cache insertion
CPU cycles for cache insertion
Data size in megabytes of the cache
for an entire datacenter
Time for cross country complete
datacenter cache extraction
Drink from the firehose
With Redis API, you can now be confident in
pushing automation changes
 You know what systems will be affected before a c...
Summary
Before implementation
of CFEngine & Redis API
at LinkedIn

After implementation of
CFEngine & Redis API
at LinkedI...
Open Source
Questions?
msvoboda@linkedin.com
www.linkedin.com/in/mikesvoboda
You can download the code from this
presentat...
Upcoming SlideShare
Loading in …5
×

LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine

6,242 views

Published on

https://github.com/linkedin/sysops-api

sysops-api is a framework designed to provide visability from tens of thousands of machines in seconds. Instead of trying to SSH to remote machines to collect data (execute commands, grep through files), LinkedIn uses this framework to answer any arbitrary question about any infrastructure.

Published in: Technology
  • Be the first to comment

LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine

  1. 1. Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine Mike Svoboda Staff Systems and Automation Engineer www.linkedin.com/in/mikesvoboda msvoboda@linkedin.com https://github.com/linkedin/sysops-api
  2. 2. My Background with LinkedIn / CFEngine  Hired at LinkedIn into System Operations in 2010  When I started, our server count was 300 machines  Implemented CFEngine automation in 2010  Since then, we have grown 100 times that size  Created our Redis API in 2012 to provide visibility
  3. 3. What is Redis?  Redis is an in-memory key value store, similar to Memcached with additional features  Offers on disk persistence (snapshots to disk) - You can use this as a real database instead of just a volatile cache  Offers simple data structures out of the box and commands to work with them natively  dictionaries, lists, sets, sorted sets, etc.  Highly scalable data store - A single Redis server can satisfy hundreds of thousands of requests per second  Supports transactions - Group commands together so they are executed as a single transaction.
  4. 4. What is CFEngine? CFEngine:  Is an IT infrastructure automation framework that helps manage infrastructure throughout its lifecycle  Builds, deploys, and manages systems  Provides auditing  Maintains infrastructure by enforcing intended system state for compliance  Runs on the smallest embedded devices, servers, desktops, mainframes, and big iron. CFEngine easily supports tens of thousands of hosts. Provides horizontal scalability.
  5. 5. How CFEngine works
  6. 6. CFEngine reduces operational costs  Using CFEngine automation is more effective than hiring additional headcount  Stop fighting fires every day  Allow operations to focus on tomorrow’s problems  Stay ahead of the curve  Keeping the lights on is automated  Respond to outages rapidly
  7. 7. Why LinkedIn chose CFEngine  Very mature codebase  Not dependent on underlying virtual machines like Ruby, Python, Perl, etc.  Flexible architecture  Easily scale upwards to support thousands of machines  Just as simple to support smaller environments  Zero reported security vulnerabilities  Lightweight footprint
  8. 8. What CFEngine has done for LinkedIn Since implementing CFEngine:  Operations has become extremely agile  Quickly respond and resolve outages  System administration workload has reduced, even with 100x the amount of servers  Have built new datacenter in minutes with little effort  Real time visibility after creating our Redis infrastructure, driven by CFEngine execution  Can answer any question imaginable about all of our servers in seconds  Know every action that happens on our machines
  9. 9. How LinkedIn uses CFEngine Functions we have automated:         Hardware failure detection Account administration Privilege escalation Software deployment O/S configuration management Process / service management Software deployment System monitoring You never need to log into a machine to manage it
  10. 10. Two problems still existed for Linkedin that automation didn’t address  The company wanted to be able to answer any question imaginable about production.  We didn’t want to break production by pushing new automation changes. To solve both problems, we needed visibility.
  11. 11. Problem #1: The company wants questions answered. STAT!  Management / Engineers want to have questions answered immediately and ask several times a day interrupting your work.
  12. 12. LinkedIn was hunting for data
  13. 13. What LinkedIn sysadmins were doing • Questions about Infrastructure were answered by sysadmins SSHing to machines to hunt for data. • As our scale increased, we used a remote execution tool to parallelize some variant of SSH / DSH  Thousands of network connections were made to remote machines from a single host to fetch data.  Did I get results from everything?  Parse results after collection
  14. 14. Forcing command execution on remote machines doesn’t scale  Machines were missed, data wasn’t collected  Firewalls mangled packets  SSHD offline or didn’t spawn on the remote host  Depended on system accounts being valid  Network connections failed to the remote machine  Data collection shouldn’t be complicated  Unsure if we were able to collect all of the necessary data.
  15. 15. Problem #2: We didn’t want to break production by pushing new automation changes.  Ops was hesitant of using automation because they didn’t know where things would break  When automation was expanded, we didn’t know where systems need alternative behavior to work correctly (or where they have been modified by developers with root access)  Ops had to be agile. We have to work fast. The business needs us to modify production multiple times a day, but we had to make changes without breaking it
  16. 16. Automation changes were happening in the blind  Sysadmins were under pressure from  large ticket queues  numerous change requests  business needs to scale  Automation changes were being performed without fully understanding the impact before that change was executed  We realized that this could lead to mistakes, disasters, outages, and pink slips. To keep this from happening, I built our Redis API to provide visibility.
  17. 17. To provide visibility, we had to scale data collection  We had to build a reliable system that was extremely fast, which could give us results of remote command execution from tens of thousands of systems in seconds  Querying this data could not put load on production systems  The cache needed to be publically available to the company via an API so they could answer their own questions  We needed to quickly add new data into the cache before pushing automation changes to view production impact.
  18. 18. We built a cache and populated it with data to answer arbitrary questions  Instead of executing commands remotely, we have CFEngine populate the cache with commonly queried data  CFEngine executes expensive commands like lshw or dmidecode once and make the output available for everybody to use  Data collection becomes a scheduled event that happens once a day - This data collection becomes a cost of doing business  With the same data being gathered on all machines, it becomes trivial to compare two or more pieces of hardware
  19. 19. Architecture of the Cache Step 1: Rely on CFEngine execution to drive data insertion Step 2: Shard your data Step 3: Use software load balancing!
  20. 20. Step 1: CFEngine drives data insertion Leverage automation to change what you insert or remove from the cache
  21. 21. The cache is a simple dictionary, sharded over multiple Redis servers.
  22. 22. Step 2: Extract Sharded Data  Determine scope. How much data do I need to answer my question?  For each CFEngine policy server running Redis, search Redis for matching keys in the dictionary  For each key we find from a search, perform the relevant data extraction     Contents Md5sum os.stat() wordcount
  23. 23. Step 3: Use Software Load Balancing!  Have clients populate multiple Redis servers on insertion - Pick a Redis server at random on extraction (Load balancing)  If we don’t get a response from our first choice, pick another Redis server at random (failover)  Find randomized CFEngine policy servers with Redis from each level in the scope  If the CFEngine policy server responds, push it into a list of machines we need to query for data  If the CFEngine policy server doesn’t respond, pick another one at random (fail over)
  24. 24. Local Scope
  25. 25. Example: Local cache extraction $ time extract_sysops_cache.py --search /etc/passwd --contents | grep msvoboda | wc -l 487 real 0m1.813s user 0m1.484s sys 0m0.087s
  26. 26. Site (datacenter) Scope
  27. 27. Example: Site cache extraction $ time extract_sysops_cache.py --site lva1 --search /etc/passwd --contents | grep msvoboda | wc -l 8687 real 0m19.169s user 0m30.286s sys 0m1.271s
  28. 28. Global Scope
  29. 29. Example: Global cache extraction $ time extract_sysops_cache.py --scope global --search /etc/passwd --contents | grep msvoboda | wc -l 27344 real 0m44.827s user 1m39.532s sys 0m4.288s
  30. 30. Make it fast! Become Multithreaded
  31. 31. Make it faster! Build a Redis pipeline
  32. 32. Cache extraction with a pipeline
  33. 33. Extracting the Cache for Fun and Profit [msvoboda@esv4-infra01 ~]$ extract_sysops_cache.py --scope local --search mps*cm.conf --md5sum --prefix-hostnames esv4-2360-mps01.corp.linkedin.com#/etc/cm.conf esv4-2360-mps02.corp.linkedin.com#/etc/cm.conf esv4-2360-mps03.corp.linkedin.com#/etc/cm.conf esv4-2360-mps04.corp.linkedin.com#/etc/cm.conf 12721673715de3ee6b9dec487529355e 56b03a16c69e5b246a565dbcda44ba28 11e20e28ec60ac6c71cbb71b0a6c9b35 55402eda02e7f5c17dc7535455adc097
  34. 34. Make it fastest! Compression is significant!  Less network overhead on cache insertion  Less network overhead on cache extraction  More stuff we can put into the Cache  With less network I/O = faster results delivered  Less CPU usage on extraction
  35. 35. Seconds for cache insertion
  36. 36. CPU cycles for cache insertion
  37. 37. Data size in megabytes of the cache for an entire datacenter
  38. 38. Time for cross country complete datacenter cache extraction
  39. 39. Drink from the firehose
  40. 40. With Redis API, you can now be confident in pushing automation changes  You know what systems will be affected before a change  You aren’t hit with surprises in production  You have added visibility  You don’t have to log into machines to modify or update
  41. 41. Summary Before implementation of CFEngine & Redis API at LinkedIn After implementation of CFEngine & Redis API at LinkedIn Headcount 6 people supporting a few hundred machines 6 people supporting tens of thousands of machines Time spent Hours to build a single machine Build complete datacenters in minutes Productivity Hours spent collecting data before change, change itself causing outages Can focus on building infrastructure, team became proactive to fix future problems, not reactive / firefighting Ease of scaling server deployment Incredibly difficult to respond to change, low visibility into production Superior administration, rapid response to changing needs, complete system visibility
  42. 42. Open Source Questions? msvoboda@linkedin.com www.linkedin.com/in/mikesvoboda You can download the code from this presentation here: https://github.com/linkedin/sysops-api

×