Cloud Security At Netflix, October 2013


Published on

Netflix Cloud Security Architecture

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Netflix is at its core a video subscription service. We have DVD by mail in the us, and streaming in US, CA, 43 Latam, and UK/Ireland
  • Cloud Security At Netflix, October 2013

    1. 1. Cloud Security @ Netflix October 25, 2013 Jay Zarfoss (Cloud Security Guy @ Netflix)
    2. 2. This presentation • What it covers: – A discussion of what it means to fit security into the Netflix Cloud universe – A description of the the past, present, and future Netflix cloud security architecture • What it (mostly) skips: – The broader Netflix culture and architecture – For generally cloudy topics, see Adrian Cockcroft’s slideshare at – For general culture see
    3. 3. Netflix Company Profile now via self service* Instructions: Find your favorite BASH terminal and type the following: > UPDATED_SIZE=`curl | perl -ne 's/ / /g; if(/d+ million members in d+ countries/){print "$&";}’` > echo “Netflix is the world’s leading Internet subscription service for enjoying TV and movies, with more than ${UPDATED_SIZE}” *No whining; remember that you’ll never again need to wait for me to update this slide like you had to wait for database access when you started at your last job.
    4. 4. Our Cloudy Culture Agile Unsynchronized No waiting Dynamic Redundant *aaS Decentralized Ephemeral Open Source Freedom Self Service NoSQL Rapid Decoupled Chaotic Resilient *These are not terms that are normally associated with security, or security architectures, but yet we adopt all of these for security development; with some perspective (of course).
    5. 5. “But how can you trust the Cloud?” • This is simply an old question rephrased for the new generation of computing. – How can you trust the CPU? – How can you trust the OS? • Security design often requires trust of the lower layer. – Even through they’ve all let us down at some point before. – And “trust” does not mean “blind faith”
    6. 6. “But we have special requirements” • Frankly, they’re probably not that special – You can fail pretty much any requirement with or without using cloud methodologies – 67% of 670 surveyed companies fail PCI compliance* • The core AWS services (EC2, S3, ELB) meet PCI DSS 2.0 compliance** – It’s generally assumed that the more exotic features (DynamoDB) will be getting compliance sooner rather than later -- So why not offload some of that compliance work? * **
    7. 7. The Security Conflict • Goal: prevent us from hurting ourselves, while not preventing us from moving quickly and being flexible.
    8. 8. Perspective, Perspective, Perspective • No one will worry about you getting hurt playing paintball in a bomb disposal suit. But then, you’ll almost certainly lose the game. • Bomb technicians don’t wear paintball suits. Even if they are easier to work in.
    9. 9. Further Security Caveats Technology alone will never prevent malicious insiders from doing damage. (Never has this sentiment been more relevant.) Smart professionals will use safer tools when they’re available (so let’s give them those tools)!
    10. 10. What do good tools look like? • Intuitive yet powerful GUIs that shield you from stumbling over the secrets – Integrate with single sign-on to keep out your kids and track you down ifwhen you screw up • Powerful APIs to do just about everything… – Except what there’s no legitimate use case for
    11. 11. Reflections on Better APIs The Cloud Offers Incredible APIs so developers can call upon new hardware with a single line of code. With great power comes great responsibility.
    12. 12. Packets from the sky Don’t worry, it’s just rain… • Your own trust of software running on a cloud instance should ideally be predicated on some cryptographically authenticated material. – Ironically, your cloud provider wants to do the same thing, since they don’t want you denying your bill… • Not long ago, there was no way to do this other than deploying these keys yourself in your own build pipeline. – Thus, your security was only as nimble as your build and deployment system. Maybe ok. Probably much slower than you want/need it to be.
    13. 13. Deploying AWS keys, the Legacy Way “That was in the before time, in the long long ago… (alright, it was 2011)” Presumably, your machines in the cloud are running code that actually wants to do something against the Cloud Provider’s API. E.g. Read/write to a database. Legacy AWS paradigm is that all of these operations need to be authenticated by signing (HMACing) with access keys. (Amazon’s term: “credentials”; my term: “AWS Keys”). //fortunately, AWS provides helper objects that do most of the work BasicAWSCredentials cred = new BasicAWSCredentials("accessID", "secretKeyID"); AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred); //ugly HMAC generating code safely tucked away in here somewhere client.listDomains(); Sure.. But how did “accessID” and “secretKeyID” get on the machine?
    14. 14. 1st Attempt: Stick them in a system property // if it makes you feel better, let’s pretend I obfuscated this BasicAWSCredentials cred = new BasicAWSCredentials( System.getProperty(“accessID”), System.getProperty(“secretKeyID")); AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred); client.listDomains(); • This… works… I guess…, but what happens if the key gets out?*. – Rebake hundreds of AMIs – Redeploy thousands of Machines • Requires all hands on deck and a big fiasco. *Thanks to supplemental security controls, like ip-whitelisting, this may not be quite as horrible as it sounds. Still bad.
    15. 15. 2nd Try: Load Keys At Runtime (Better?) • Fits nicely into Cloud Platform “whatever”-aaS layer. – Security Groups can enforce who can make request. – And makes a pretty tidy REST call: GET server/getAWSKey <AWSKEY> <accessKeyID>open</aceessKeyID> <secretKey>sesame</secretKey> </AWSKEY> • What happens when the subaccount associated with the key gets accidentally deleted? – Update the key in AWS console and then swap the key in the key servers (technically easy; will still get your heart pumping when you do it for real – trust me!) – You may still have to reboot a lot of machines! But why?
    16. 16. Objects, like peaches, are sticky. (Still delicious.) RESTfulObj AWSKey = RESTService.get(“server/getAWSKey”); BasicAWSCredentials cred = new BasicAWSCredentials( AWSKey.getAccessID(), AWSKey.getSecretKey()); AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred); client.listDomains(); The mindful Object-Oriented programmer will tend to keep this object around rather than re-creating all of the time. (Trust me). Guess what object caches the AWS Keys.
    17. 17. Promote Safer Foods. // provider paradigm dynamically asks for keys every time AWSCredentialsProvider prov = new AWSCredentialsProvider(){ public AWSCredentials getCredentials(){ RESTfulObj AWSKey = RESTService.get(“server/getAWSKey”); return new BasicAWSCredentials( AWSKey.getAccessID(), AWSKey.getSecretKey()); } }; AmazonSimpleDBClient client = new AmazonSimpleDBClient(prov); client.listDomains(); No cached key (yay!). But…Goodluck chasing everyone around with a broomstick making them write their code this way.
    18. 18. Systematically enforce Refresh. Or: Revoke Privileges for unsafe food altogether • Only issue temporary keys good for a few hours (> your longest conceivable operation) – AWS Mechanism to do this: (AWSSecurityTokenService) GET server/getAWSKey <AWSKEY> <accessKeyID>open</aceessKeyID> <secretKey>sesame</secretKey> <expires>1352083995</expires> </AWSKEY> • Simple, but powerful consequences to this, i.e. Accidentally writing keys to logs and backup lost? – Disadvantages? (I would argue materially none)
    19. 19. Abracadabra at Runtime (Best) • June 11th 2012: Amazon introduces temporary AWS Security Credentials via Metadata Service – On-demand access keys via Amazon API; expire quickly – Effectively, Amazon is hosting the key server and only giving keys to your cloud instances. – Predefined “roles” determine the permissions of the keys – Wish we had had this when we first moved to the cloud. • Still useful to have your own key server, why? – For one, developers will chase you down with pitchforks if they can’t run against the cloud API at their desk. (And they’d have every right to…)
    20. 20. IAM Role configuration via Asgard View into Asgard Launch configuration assigning a Role which determines the permissions of the key an instance will receive via IAM paradigm.
    21. 21. New Ways to Hide All Your Keys • April 3, 2013: Amazon introduces variables in AWS access control policies. – Provides an obvious place to store sensitive nuggets your software needs to work // one ACL to rule them all { "Action": [ "s3:GetObject", "s3:PutObject" ], "Effect": "Allow", "Resource": ["arn:aws:s3:::mybucket/myclientsoftware.${aws:userid}.keystore"] } Just apply the right role to your auto-scale group and you’re done!
    22. 22. Secure Bootstrapping (still) frustrating // at least now there’s a reasonable place to put the file<file smartly loaded from ACL-limited store> • Options are better today with new ACL Rules • But… – What if I want to hot-swap these? Wait, you mean I have to write them to a file and restart?! Yuck!! • Unfortunate artifact of software designed for the datacenter where machines stay put for a long time – One mistake in the AWS console and my keystore file (complete with SSL private keys) is open to the world? • If your eyelid isn’t twitching, it should be.
    23. 23. So… we still want our own tools // whenever you find yourself writing code like this, // I hope you’re asking yourself if the keys aren’t // left sitting on the kitchen counter cipherContext = factory.getCipherContext(“algorithm”, “keyName”); • (Most) developers don’t want to think about where this key lives. So let’s have the library worry about that for them. • Some keys are more important than others – “oh, shit” vs. “OH SHIT”
    24. 24. Custom Cloud Key Management Don’t leave your child in the middle of a busy intersection.
    25. 25. Netflix Key Management • All sorts of business cases require keying material: – – – – Password reset tokens Encrypting sensitive databases Authenticating Netflix Ready Devices (NRDs) DRM keys • I’m not having the DRM debate here; so don’t try – Symmetric, Asymmetric, HMAC keys, …. • So how do you handle those keys? – Depends. (Paintballs or Pipebombs?)
    26. 26. Cryptex Service • Without going into too much detail, Cryptex is our *aaS for key management with associated client libraries in Java and Python. – We worry about where the keys live • So you (Mr./Ms. big data person) don’t have to – Flexible, dynamic, auto-scaling, fast moving • Except when it’s not supposed to be • Future/Ongoing work – Better integrating this into Datacenter-y software that wants fixed static things is a constant challenge and requires lots of new plumbing – wanna help?!
    27. 27. Variations in Key Handling • Low: Key is provided to the edge service instance – Virtually unlimited throughput, resistant to any backend service outages • Medium: Key stays on the single-purpose Netflix key management servers; each instigating crypto operation is a REST call (small data is better!) – Key never lives on a customer facing server (one nasty bug or “oops” won’t cause exposure) • High: Keys live in specialized hardware (HSM)
    28. 28. Netflix Global Crypto Ops/Sec • Low (< 1ms latency) – It’s a (really) big number. And highly variable. • Medium (~ 4ms latency) – Tens of thousands of operations/sec at daily peak (number is shrinking as we get smarter with our protocols which favor low sensitivity keys) • High (~ 10ms latency) – Over one thousand operations/sec at daily peak
    29. 29. (Fairly) Common Dialogue Big Data Developer: I’m working on super-cool new feature, X. And it will use some crypto and need some keys. Which sensitivity of key do I want? Me: Tell me the story of what happens when we lose the key somehow.
    30. 30. Various Key Loss Scenarios Low: We’d rotate the key via one button-push and customers wouldn’t notice an impact; minimal damage control. Medium: We’d rotate the key and the whole team would have to work for a week straight cleaning up the mess created. High: I don’t want to talk about it. Let the Cloud help you along the way….Early and automated detection, combined with fast-reaction means more keys can be low/medium sensitivity (less resource intensive). Design your new system to be able to use LOW keys for the bulk of the heavy lifting!!
    31. 31. AWS CloudHSM • March 26th 2013, AWS announces availability of Safenet-manufactured CloudHSMs to general cloud-computing public. – Old-skool industry standard security solution… without the need for your IT people to baby sit. – All the right acronyms: FIPS 140-2, CC EAL 4+ – Amazon has no way to recover your keys (do please take care not to lose them) – Single tenant • This is the new home for our high sensitivity keys.
    32. 32. Some Final Thoughts…
    33. 33. Why are we sharing? • In a sense, Netflix benefits when other cloud users and cloud venders follow common paths. – Problems will invariably pop up, but when these problems occur to industry standard practices, everyone shares the load of getting them fixed. • Example of a great benefit of common practice – TLS has become industry standard for secure transport, but has had its lumps lately (BEAST, RC4)* – Because it affects everyone, we’re all motivated to look for solutions and share those cost *
    34. 34. Security and Flexibility don’t have to be always at odds with each other… Security can fit in a fast-changing environment where flexibility is paramount. The trick is to leverage the same flexibility to allow the Security to keep up.
    35. 35. Sound Interesting? We’re hiring!