1© Cloudera, Inc. All rights reserved.
Charles Lamb
HDFS Transparent Encryption
SFHUG
2© Cloudera, Inc. All rights reserved.
Overview
• Done under open source (HDFS-6134)
• Data read from and written to certain directories is transparently encrypted
• No changes to user code
• Encryption/decryption always done by client
• HDFS never handles unencrypted data or unencrypted keys
• Helps applications be regulation-compliant (HIPAA, PCI DSS, FISMA, etc.)
3© Cloudera, Inc. All rights reserved.
Background
• Encryption can happen at any of several levels:
• Application: most secure and flexible, but hardest to do
• Adding encryption to legacy applications may be difficult
• Database: most DBMSs have this, but may incur performance penalties
• Secondary indices can not generally be encrypted
• Filesystem: high performance, transparent, but may not be flexible enough
• Multi-tenancy vs per-user encryption policies
• Disk: high performance but only really protects against physical theft
• HDFS encryption is somewhere between Filesystem and Database level
4© Cloudera, Inc. All rights reserved.
Design Goals
• Performance and scalability
• Transparent to applications, including legacy apps
• End-to-end
• Data should be encrypted on the network and ‘at-rest’
• Compartmentalization
• Key management independent of HDFS management
• Includes preventing HDFS admins and root users from accessing sensitive data
• Compatibility with HDFS access methods: WebHDFS, HttpFS, FUSE, NFS, hftp, har,
etc.
5© Cloudera, Inc. All rights reserved.
Architectural Concepts
• Key Management Server
• Encryption Zones
• Keys
6© Cloudera, Inc. All rights reserved.
Key Management Server
7© Cloudera, Inc. All rights reserved.
Key Management Server (KMS)
• KMS sits between client and key server
• E.g. Cloudera Navigator Key Trustee
• Provides a unified API and scalability
• REST API
• Does not actually store keys (backend does that), but does cache them
• ACLs on per-key basis
8© Cloudera, Inc. All rights reserved.
Encryption Zones
• An HDFS directory in which the contents (including subdirs) are encrypted on
write and decrypted on read.
• An EZ begins life as an empty directory
• Renames in/out of an EZ are prohibited
• Encryption is transparent to application with no code changes
9© Cloudera, Inc. All rights reserved.
Keys
• Every Encryption Zone has a key (“EZ Key”)
• Every file in an Encryption Zone has a unique key (“Data Encryption Key” or
“DEK”)
• The HDFS NameNode stores the name of the EZ Key in an Xattr of the EZ Dir
• The actual EZ Key is stored in the Key Server
• The NameNode stores the DEK in an Xattr of the file, but only in encrypted form
• Encrypted Data Encryption Key, or “EDEK”
• The NameNode never touches decrypted data or decrypted keys
10© Cloudera, Inc. All rights reserved.
EZ Keys, Data Encryption Keys, and Encrypted Data
Encryption Keys
11© Cloudera, Inc. All rights reserved.
Key Handling
12© Cloudera, Inc. All rights reserved.
Design
• End-to-end encryption
• Encryption occurs on the client and decrypted data is never touched by HDFS
• Protects against network sniffing, evil HDFS admins, and hard drive theft
• HDFS never touches key material (DEK’s or EZ keys)
• Compromising an HDFS daemon is not a viable attack vector
• HDFS handles encrypted Keys (EDEKs), but never in decrypted form (DEKs)
• Key permissions are handled by the KMS ACLs
• Each file is encrypted with a unique DEK
13© Cloudera, Inc. All rights reserved.
HDFS Encryption Configuration
• hadoop key create <keyname>
• hdfs dfs –mkdir <path>
• hdfs crypto –createZone –keyName <keyname> -path <path>
14© Cloudera, Inc. All rights reserved.
KMS Per-User ACL Configuration
• White lists (check for inclusion) and black lists (check for exclusion)
• etc/hadoop/kms-acls.xml
• hadoop.kms.acl.CREATE
• hadoop.kms.blacklist.CREATE
• … DELETE, ROLLOVER, GET, GET_KEYS, GET_METADATA,
GENERATE_EEK, DECRYPT_EEK
15© Cloudera, Inc. All rights reserved.
KMS Per-Key ACL Configuration
• etc/hadoop/kms-acls.xml
• hadoop.kms.acl.<keyname>.<operation>
• MANAGEMENT – createKey, deleteKey, rolloverNewVersion
• GENERATE_EEK – generateEncryptedKey,
warmUpEncryptedKeys
• DECRYPT_EEK – decryptEncryptedKey
• READ – getKeyVersion, getKeyVersions, getMetadata,
getKeysMetadata, getCurrentKey
• ALL – all of the above
16© Cloudera, Inc. All rights reserved.
Performance
• AES-CTR, 128 or 256 (with unlimited strength JCE installed)
• AES-NI available
• Negligible overhead on writes and 7.5% impact on reads for datasets larger
than memory
17© Cloudera, Inc. All rights reserved.
DistCp
• Encryption Zone to Encryption Zone
• use –update –skipcrccheck
• Admins use special /.reserved/raw path prefix
• /.reserved/raw is only available to root and provides the encrypted
contents
18© Cloudera, Inc. All rights reserved.
Exceptions
• Hive: may not be able to do a query that combines data from more than one
encryption zone
19© Cloudera, Inc. All rights reserved.
HDFS Encryption - Summary
• Good performance (4-10% hit)
• No mods to existing applications
• Prevents attacks at the filesystem and below
• OS and filesystem only see encrypted bytes
• Data is encrypted all the way to the client
• Secure ‘at rest’ and in transit
• Key management is independent of HDFS
• Key admin != HDFS admin
• Can prevent HDFS admin from accessing secure data
20© Cloudera, Inc. All rights reserved.
Questions

Overview of HDFS Transparent Encryption

  • 1.
    1© Cloudera, Inc.All rights reserved. Charles Lamb HDFS Transparent Encryption SFHUG
  • 2.
    2© Cloudera, Inc.All rights reserved. Overview • Done under open source (HDFS-6134) • Data read from and written to certain directories is transparently encrypted • No changes to user code • Encryption/decryption always done by client • HDFS never handles unencrypted data or unencrypted keys • Helps applications be regulation-compliant (HIPAA, PCI DSS, FISMA, etc.)
  • 3.
    3© Cloudera, Inc.All rights reserved. Background • Encryption can happen at any of several levels: • Application: most secure and flexible, but hardest to do • Adding encryption to legacy applications may be difficult • Database: most DBMSs have this, but may incur performance penalties • Secondary indices can not generally be encrypted • Filesystem: high performance, transparent, but may not be flexible enough • Multi-tenancy vs per-user encryption policies • Disk: high performance but only really protects against physical theft • HDFS encryption is somewhere between Filesystem and Database level
  • 4.
    4© Cloudera, Inc.All rights reserved. Design Goals • Performance and scalability • Transparent to applications, including legacy apps • End-to-end • Data should be encrypted on the network and ‘at-rest’ • Compartmentalization • Key management independent of HDFS management • Includes preventing HDFS admins and root users from accessing sensitive data • Compatibility with HDFS access methods: WebHDFS, HttpFS, FUSE, NFS, hftp, har, etc.
  • 5.
    5© Cloudera, Inc.All rights reserved. Architectural Concepts • Key Management Server • Encryption Zones • Keys
  • 6.
    6© Cloudera, Inc.All rights reserved. Key Management Server
  • 7.
    7© Cloudera, Inc.All rights reserved. Key Management Server (KMS) • KMS sits between client and key server • E.g. Cloudera Navigator Key Trustee • Provides a unified API and scalability • REST API • Does not actually store keys (backend does that), but does cache them • ACLs on per-key basis
  • 8.
    8© Cloudera, Inc.All rights reserved. Encryption Zones • An HDFS directory in which the contents (including subdirs) are encrypted on write and decrypted on read. • An EZ begins life as an empty directory • Renames in/out of an EZ are prohibited • Encryption is transparent to application with no code changes
  • 9.
    9© Cloudera, Inc.All rights reserved. Keys • Every Encryption Zone has a key (“EZ Key”) • Every file in an Encryption Zone has a unique key (“Data Encryption Key” or “DEK”) • The HDFS NameNode stores the name of the EZ Key in an Xattr of the EZ Dir • The actual EZ Key is stored in the Key Server • The NameNode stores the DEK in an Xattr of the file, but only in encrypted form • Encrypted Data Encryption Key, or “EDEK” • The NameNode never touches decrypted data or decrypted keys
  • 10.
    10© Cloudera, Inc.All rights reserved. EZ Keys, Data Encryption Keys, and Encrypted Data Encryption Keys
  • 11.
    11© Cloudera, Inc.All rights reserved. Key Handling
  • 12.
    12© Cloudera, Inc.All rights reserved. Design • End-to-end encryption • Encryption occurs on the client and decrypted data is never touched by HDFS • Protects against network sniffing, evil HDFS admins, and hard drive theft • HDFS never touches key material (DEK’s or EZ keys) • Compromising an HDFS daemon is not a viable attack vector • HDFS handles encrypted Keys (EDEKs), but never in decrypted form (DEKs) • Key permissions are handled by the KMS ACLs • Each file is encrypted with a unique DEK
  • 13.
    13© Cloudera, Inc.All rights reserved. HDFS Encryption Configuration • hadoop key create <keyname> • hdfs dfs –mkdir <path> • hdfs crypto –createZone –keyName <keyname> -path <path>
  • 14.
    14© Cloudera, Inc.All rights reserved. KMS Per-User ACL Configuration • White lists (check for inclusion) and black lists (check for exclusion) • etc/hadoop/kms-acls.xml • hadoop.kms.acl.CREATE • hadoop.kms.blacklist.CREATE • … DELETE, ROLLOVER, GET, GET_KEYS, GET_METADATA, GENERATE_EEK, DECRYPT_EEK
  • 15.
    15© Cloudera, Inc.All rights reserved. KMS Per-Key ACL Configuration • etc/hadoop/kms-acls.xml • hadoop.kms.acl.<keyname>.<operation> • MANAGEMENT – createKey, deleteKey, rolloverNewVersion • GENERATE_EEK – generateEncryptedKey, warmUpEncryptedKeys • DECRYPT_EEK – decryptEncryptedKey • READ – getKeyVersion, getKeyVersions, getMetadata, getKeysMetadata, getCurrentKey • ALL – all of the above
  • 16.
    16© Cloudera, Inc.All rights reserved. Performance • AES-CTR, 128 or 256 (with unlimited strength JCE installed) • AES-NI available • Negligible overhead on writes and 7.5% impact on reads for datasets larger than memory
  • 17.
    17© Cloudera, Inc.All rights reserved. DistCp • Encryption Zone to Encryption Zone • use –update –skipcrccheck • Admins use special /.reserved/raw path prefix • /.reserved/raw is only available to root and provides the encrypted contents
  • 18.
    18© Cloudera, Inc.All rights reserved. Exceptions • Hive: may not be able to do a query that combines data from more than one encryption zone
  • 19.
    19© Cloudera, Inc.All rights reserved. HDFS Encryption - Summary • Good performance (4-10% hit) • No mods to existing applications • Prevents attacks at the filesystem and below • OS and filesystem only see encrypted bytes • Data is encrypted all the way to the client • Secure ‘at rest’ and in transit • Key management is independent of HDFS • Key admin != HDFS admin • Can prevent HDFS admin from accessing secure data
  • 20.
    20© Cloudera, Inc.All rights reserved. Questions