Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Secrets at Planet-scale: Engineering the Internal Google KMS

33 views

Published on

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/35Y3xoz.

Anvita Pandit covers the design choices and strategies that Google chose in order to build a highly reliable, highly scalable service. She talks about continued maintenance pain points and suggested practices for an internal key management service. Filmed at qconsf.com.

Anvita Pandit is a Software Engineer at Google, where she works on cryptographic key management and data protection. She has previously presented a talk at DEFCON 2019 on the fallabilities of genetics testing algorithms, and talks at Lesbians Who Tech in NYC and Montreal on digital currencies and society.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Secrets at Planet-scale: Engineering the Internal Google KMS

  1. 1. Secrets at Planet-Scale: Engineering the Internal Google Key Management System (KMS) QCon San Francisco 2019, Nov 11-13 Anvita Pandit Google LLC
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ high-availability-google-key- management/
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. Anvita Pandit - Software engineer in Data Protection / Security and Privacy org in Google for 2 years. - Engineering Resident. - DEFCON 2019 Biohacking village: co-presented “Hacking Race” workshop with @HerroAnneKim
  5. 5. Not the Google Cloud KMS
  6. 6. Agenda 1. Why use a KMS? 2. Essential product features 3. Walkthrough of encrypted storage use case 4. System specs and architectural decisions 5. Walkthrough of an outage 6. More architecture! 7. Challenge: safe key rotation
  7. 7. The Great Gmail Outage of 2014 https://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html
  8. 8. Why Use a KMS?
  9. 9. Why Use a KMS? Core motivation: code needs secrets!
  10. 10. Why Use a KMS? Core motivation: code needs secrets! Secrets like: ● Database passwords, third party API and OAuth tokens ● Cryptographic keys used for data encryption, signing, etc
  11. 11. Why Use a KMS? Core motivation: code needs secrets! Where?
  12. 12. Why Use a KMS? Core motivation: code needs secrets! Where? ● In code repository?
  13. 13. https://github.com/search?utf8=%E2%9C%93&q=remove+password&type=Commits&ref=searchresults
  14. 14. Why Use a KMS? Core motivation: code needs secrets! Where? ● In code repository? ● On production hard drives?
  15. 15. Why Use a KMS? Core motivation: code needs secrets! Where? ● In code repository? ● On production hard drives? Alternative: ● Use a KMS!
  16. 16. Centralized Key Management Solves key problems for everybody.
  17. 17. Centralized Key Management Solves key problems for everybody. Offers: ● Separate management of key-handling code
  18. 18. Centralized Key Management Solves key problems for everybody. Offers: ● Separate management of key-handling code ● Separation of trust
  19. 19. Centralized Key Management Solves key problems for everybody
  20. 20. Centralized Key Management Solves key problems for everybody 1. Access control lists (ACLs)
  21. 21. Centralized Key Management Solves key problems for everybody 1. Access control lists (ACLs) ● Who is allowed to use the key? Who is allowed to make updates to the key configuration?
  22. 22. Centralized Key Management Solves key problems for everybody 1. Access control lists (ACLs) ● Who is allowed to use the key? Who is allowed to make updates to the key configuration? ● Identities are specified with the internal authentication system (see ALTS)
  23. 23. Centralized Key Management Solves key problems for everybody. 2. Auditing aka Who touched my keys?
  24. 24. Centralized Key Management Solves key problems for everybody. 2. Auditing aka Who touched my keys? ● Binary verification
  25. 25. Centralized Key Management Solves key problems for everybody. 2. Auditing aka Who touched my keys? ● Binary verification ● Logging (but not the secrets!)
  26. 26. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  27. 27. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  28. 28. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  29. 29. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  30. 30. Google’s Root of Trust Storage Systems (Millions) Data encrypted with data keys (DEKs) KMS (Tens of Thousands) Master keys and passwords are stored in KMS Root KMS (Hundreds) KMS is protected with a KMS master key in Root KMS Root KMS master key distributor (Hundreds) Root KMS master key is distributed in memory Physical safes (a few) Root KMS master key is backed up on hardware devices
  31. 31. Category Requirement Availability 5 nines => 99.999% of requests are served Latency 99% of requests are served < 10 ms Scalability Planet-scale! Security Effortless key rotation Design Requirements
  32. 32. Decisions, decisions ● Not an encryption/decryption service.
  33. 33. Decisions, decisions ● Not an encryption/decryption service. ● Not a traditional database
  34. 34. Decisions, decisions ● Not an encryption/decryption service. ● Not a traditional database ● Key wrapping ● Stateless serving
  35. 35. Key Wrapping
  36. 36. Key Wrapping ● Fewer centrally-managed keys improves availability but requires more trust in the client
  37. 37. Insight: At the KMS layer, key material is not mutable state. Immutable key material + key wrapping ==> Stateless server ==> Trivial scaling Keys in RAM ==> Low latency serving Stateless Serving
  38. 38. What Could Go Wrong?
  39. 39. The Great Gmail Outage of 2014 https://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html
  40. 40. Each team maintains their own KMS configurations Each team maintains their own KMS configurations, all stored in Google’s monolithic repo Source Repository (holds encrypted configs) Individual Team Config Changes Config merge cron job Single Merged Config Update Data Pusher KMS KMS KMS KMS KMS KMS Many KMS Servers Each Local Config Client KMS Server Local Config Client Normal Operation Which get automatically merged into a combined config file Which is distributed to all KMS shards for serving Sees incorrect image of source repo 😵 Merging Problem 🐛 Truncated Config ☠ Client A bad config pushed globally means a global outage All Local Configs ☠
  41. 41. Lessons Learned The KMS had become ● a single point of failure ● a startup dependency for services ● often a runtime dependency ==> KMS Must Not Fail Globally
  42. 42. ● No more all-at-once global rollout of binaries and configuration ● Regional failure isolation and client isolation ● Minimize dependencies KMS Must Not Fail Globally
  43. 43. Google KMS Current Stats: ● No downtime since the Gmail outage in 2014 January: >> 99.9999% ● 99.9% of requests are served < 6 ms ● ~107 requests/sec (~10 M QPS) ● ~104 processes & cores
  44. 44. Challenge: Safe Key Rotation
  45. 45. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text
  46. 46. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text ● Broken ciphers ○ Access to cipher text is enough
  47. 47. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text ● Broken ciphers ○ Access to cipher text is enough ● Rotating keys limits the window of vulnerability
  48. 48. Make It Easy To Rotate Keys ● Key compromise ○ Also requires access to cipher text ● Broken ciphers ○ Access to cipher text is enough ● Rotating keys limits the window of vulnerability ● But rotating keys means there is potential for data loss
  49. 49. Goals 1. KMS users design with rotation in mind 2. Using multiple key versions is no harder than using a single key 3. Very hard to lose data Robust Key Rotation at Scale - 0
  50. 50. Robust Key Rotation at Scale - 1 Goal #1: KMS users design with rotation in mind ● Users choose ○ Frequency of rotation: e.g. every 30 days ○ TTL of cipher text: e.g. 30,90,180 days, 2 years, etc.
  51. 51. Robust Key Rotation at Scale - 1 Goal #1: KMS users design with rotation in mind ● Users choose ○ Frequency of rotation: e.g. every 30 days ○ TTL of cipher text: e.g. 30,90,180 days, 2 years, etc. ● KMS guarantees ‘Safety Condition’ ○ All ciphertext produced within the TTL can be deciphered using a keyset in the KMS.
  52. 52. Robust Key Rotation at Scale - 2 Goal #2: Using multiple key versions is no harder than using a single key
  53. 53. Robust Key Rotation at Scale - 2 Goal #2: Using multiple key versions is no harder than using a single key ● Tightly integrated with Google's standard cryptographic libraries: see Tink
  54. 54. Robust Key Rotation at Scale - 2 Goal #2: Using multiple key versions is no harder than using a single key ● Tightly integrated with Google's standard cryptographic libraries: see Tink ○ Keys support multiple key versions ○ Each of which can be a different cipher
  55. 55. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  56. 56. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  57. 57. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  58. 58. Time ⇢ Robust Key Rotation at Scale - 3 T0 T1 T2 T3 T4 T5 T6 T7 V1 A P P A A A SFR V2 A P P A A A A - Active P - Primary SFR - Scheduled for Revocation Goal #3: Very hard to lose data
  59. 59. Recap: Key Rotation
  60. 60. Recap: Key Rotation ● Presents an availability vs security tradeoff
  61. 61. Recap: Key Rotation ● Presents an availability vs security tradeoff ● KMS ○ Derives the number of key versions to retain
  62. 62. Recap: Key Rotation ● Presents an availability vs security tradeoff ● KMS ○ Derives the number of key versions to retain ○ Adds/Promotes/Demotes/Deletes Key Versions over time
  63. 63. Google KMS - Summary Implementing encryption at scale required highly available key management. At Google’s scale this means 5 9s of availability. To achieve all requirements, we use several strategies: ● Best practices for change management and staged rollouts ● Minimize dependencies and aggressively defend against their unavailability ● Isolate by region & client type ● Combine immutable keys + wrapping to achieve scale ● A declarative API for key rotation
  64. 64. We Are Hiring! anvita@google.com
  65. 65. ■ Google Cloud Encryption at Rest whitepaper: https://cloud.google.com/security/encryption-at-rest/default-encryption/ ■ Google Application Layer Transport Security: https://cloud.google.com/security/encryption-in-transit/application-layer-transp ort-security/ + Infographic https://cloud.withgoogle.com/infrastructure/data-encryption/step-7 ■ Tink cryptographic library https://github.com/google/tink ■ Site Reliability Engineering (SRE) handbook: https://landing.google.com/sre/book.html Further Reading
  66. 66. Bonus Content
  67. 67. Challenge: Data Integrity
  68. 68. Causes of Bit Errors ■ Corruption in transit as NICs (network cards) twiddle bits. ■ Corruption in memory from broken CPUs ■ Cosmic rays flip bits in DRAM ■ [not an exhaustive list]
  69. 69. ○ Crypto provides leverage ○ Key material corruption can render large chunks of data unusable. Hardware Faults
  70. 70. Software Mitigations ○ Verify correctness of crypto operations at start of a process ■ During a request, after using the KEK to wrap a DEK and before responding to the customer, we unwrap the same DEK ■ Storage services ● Read back plain text after writing encrypted data blocks ● Replicate/parity protect at a higher layer
  71. 71. Key Sensitivity Annotations
  72. 72. Key Sensitivity Annotations Users determine the consequence if their keys were to be compromised using the CIA triad
  73. 73. Sensitivity Annotations Users determine the consequence if their keys were to be compromised using the CIA triad: ● Confidentiality ● Integrity ● Availability
  74. 74. Sensitivity Annotations ● Each consequence has corresponding policy recommendations ● For example, only a verifiably built program can contact a key that could leak user data.
  75. 75. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ high-availability-google-key- management/

×