How to Protect Big Data in a
Containerized Environment
Thomas Phelan
Chief Architect, BlueData
@tapbluedata
Outline
 Securing a Big Data Environment
 Data Protection
 Transparent Data Encryption
 Transparent Data Encryption in a Containerized Environment
 Takeaways
In the Beginning …
 Hadoop was used to process public web data
- No compelling need for security
• No user or service authentication
• No data security
Then Hadoop Became Popular
Security is important.
Layers of Security in Hadoop
 Access
 Authentication
 Authorization
 Data Protection
 Auditing
 Policy (protect from human error)
Hadoop Security: Data Protection
Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html
Focus on Data Security
 Confidentiality
- Confidentiality is lost when data is accessed by someone not
authorized to do so
 Integrity
- Integrity is lost when data is modified in unexpected ways
 Availability
- Availability is lost when data is erased or becomes inaccessible
Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
Hadoop Distributed File System (HDFS)
 Data Security Features
- Access Control
- Data Encryption
- Data Replication
Access Control
 Simple
- Identity determined by host operating system
 Kerberos
- Identity determined by Kerberos credentials
- One realm for both compute and storage
- Required for HDFS Transparent Data Encryption
Data Encryption
 Transforming data
Data Replication
 3 way replication
- Can survive any 2 failures
 Erasure Coding
- Can survive more than 2 failures depending on parity bit configuration
HDFS with End-to-End Encryption
 Confidentiality
- Data Access
 Integrity
- Data Access + Data Encryption
 Availability
- Data Access + Data Replication
Data Encryption
 How to transform the data?
10101110001001000101110
00101000111010101010101
00011101010101110
Cleartext
XXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXX
XXX
Ciphertext
Data Encryption – At Rest
 Data is encrypted while on persistent media (disk)
Data Encryption – In Transit
 Data is encrypted while traveling over the network
The Whole Process
Ciphertext
HDFS Transparent Data Encryption (TDE)
 End-to-end encryption
- Data is encrypted/decrypted at the client
• Data is protected at rest and in transit
 Transparent
- No application level code changes required
HDFS TDE – Design
 Goals:
- Only an authorized client/user can access cleartext
- HDFS never stores cleartext or unencrypted data encryption keys
HDFS TDE – Terminology
 Encryption Zone
- A directory whose file contents will be encrypted upon write and
decrypted upon read
- An EZKEY is generated for each zone
HDFS TDE – Terminology
 EZKEY – encryption zone key
 DEK – data encryption key
 EDEK – encrypted data encryption key
HDFS TDE - Data Encryption
 The same key is used to encrypt and decrypt data
 The size of the ciphertext is exactly the same as the size of the original
cleartext
- EZKEY + DEK => EDEK
- EDEK + EZKEY => DEK
HDFS TDE - Services
 HDFS NameNode (NN)
 Kerberos Key Distribution Center (KDC)
 Hadoop Key Management Server (KMS)
- Key Trustee Server
HDFS TDE – Security Concepts
 Division of Labor
- KMS creates the EZKEY & DEK
- KMS encrypts/decrypts the DEK/EDEK using the EZKEY
- HDFS NN communicates with the KMS to create EZKEYs &
EDEKs to store in the extended attributes in the encryption zone
- HDFS client communicates with the KMS to get the DEK using
the EZKEY and EDEK.
HDFS TDE – Security Concepts
 The name of the EZKEY is stored in the HDFS extended
attributes of the directory associated with the encryption zone
 The EDEK is stored in the HDFS extended attributes of the file in
the encryption zone
$ hadoop key …
$ hdfs crypto …
HDFS Examples
 Simplified for the sake of clarity:
- Kerberos actions not shown
- NameNode EDEK cache not shown
HDFS – Create Encryption Zone
/encrypted_dir
xattr: EZKEYNAME EZKEYNAME = KEY
3. Create EZKEY
HDFS – Create Encrypted File
3. Create EDEK
1. Create file 2. Create EDEK
/encrypted_dir/file
xattr: EDEK
4. Store EDEK5. Return Success
/encrypted_dir/file
encrypted data
HDFS TDE – File Write Work Flow
4. Decrypt DEK from EDEK
5. Return DEK
/encrypted_dir/file
write encrypted data
read
unencrypted data
/encrypted_dir/file
xattr: EDEK
3. Request DEK from EDEK & EZKEYNAME
HDFS TDE – File Read Work Flow
4. Decrypt DEK from EDEK
5. Return DEK
/encrypted_dir/file
read encrypted data
write
unencrypted data
/encrypted_dir/file
xattr: EDEK
3. Request DEK from EDEK & EZKEYNAME
Bring in the Containers (i.e. Docker)
 Issues with containers are the same for any virtualization platform
- Multiple compute clusters
- Multiple HDFS file systems
- Multiple Kerberos realms
- Cross-realm trust configuration
Containers as Virtual Machines
 Note – this is not about using containers to run Big Data tasks:
Containers as Virtual Machines
 This is about running Hadoop / Big Data clusters in containers:
cluster
Containers as Virtual Machines
 A true containerized Big Data environment:
KDC Cross-Realm Trust
 Different KDC realms for corporate, data, and compute
 Must interact correctly in order for the Big Data cluster to function
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
KDC Cross-Realm Trust
 Different KDC realms for corporate, data, and compute
- One-way trust
• Compute realm trusts the corporate realm
• Data realm trusts corporate realm
• Data realm trusts the compute realm
CORP.ENTERPRISE.COM Realm
COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm
KDC:
CORP.ENTERPRISE.COM
KDC:
DATALAKE.ENTERPRISE.COM
KDC:
COMPUTE.ENTERPRISE.COM
HDFS:
hdfs://remotedata/
Hadoop Cluster
rm@COMPUTE.ENTERPRISE.COM
user@CORP.ENTERPRISE.COM
Hadoop Key Management Service
KDC Cross-Realm Trust
Key Management Service
 Must be enterprise quality
- Key Trustee Server
• Java KeyStore KMS
• Cloudera Navigator Key Trustee Server
Containers as Virtual Machines
 A true containerized Big Data environment:
DataLake
DataLake
DataLake
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
Key Takeaways
 Hadoop has many security layers
- HDFS Transparent Data Encryption (TDE) is best of breed
- Security is hard (complex)
- Virtualization / containerization only makes it potentially harder
- Compute and storage separation with virtualization /
containerization can make it even harder still
Key Takeaways
 Be careful with a build vs. buy decision for containerized Big Data
- Recommendation: buy one already built
- There are turnkey solutions
(e.g. BlueData EPIC)
Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know
www.bluedata.com
BlueData Booth #1508
in Strata Expo Hall
@tapbluedata

How to Protect Big Data in a Containerized Environment

  • 1.
    How to ProtectBig Data in a Containerized Environment Thomas Phelan Chief Architect, BlueData @tapbluedata
  • 2.
    Outline  Securing aBig Data Environment  Data Protection  Transparent Data Encryption  Transparent Data Encryption in a Containerized Environment  Takeaways
  • 3.
    In the Beginning…  Hadoop was used to process public web data - No compelling need for security • No user or service authentication • No data security
  • 4.
    Then Hadoop BecamePopular Security is important.
  • 5.
    Layers of Securityin Hadoop  Access  Authentication  Authorization  Data Protection  Auditing  Policy (protect from human error)
  • 6.
    Hadoop Security: DataProtection Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html
  • 7.
    Focus on DataSecurity  Confidentiality - Confidentiality is lost when data is accessed by someone not authorized to do so  Integrity - Integrity is lost when data is modified in unexpected ways  Availability - Availability is lost when data is erased or becomes inaccessible Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
  • 8.
    Hadoop Distributed FileSystem (HDFS)  Data Security Features - Access Control - Data Encryption - Data Replication
  • 9.
    Access Control  Simple -Identity determined by host operating system  Kerberos - Identity determined by Kerberos credentials - One realm for both compute and storage - Required for HDFS Transparent Data Encryption
  • 10.
  • 11.
    Data Replication  3way replication - Can survive any 2 failures  Erasure Coding - Can survive more than 2 failures depending on parity bit configuration
  • 12.
    HDFS with End-to-EndEncryption  Confidentiality - Data Access  Integrity - Data Access + Data Encryption  Availability - Data Access + Data Replication
  • 13.
    Data Encryption  Howto transform the data? 10101110001001000101110 00101000111010101010101 00011101010101110 Cleartext XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXX Ciphertext
  • 14.
    Data Encryption –At Rest  Data is encrypted while on persistent media (disk)
  • 15.
    Data Encryption –In Transit  Data is encrypted while traveling over the network
  • 16.
  • 17.
    HDFS Transparent DataEncryption (TDE)  End-to-end encryption - Data is encrypted/decrypted at the client • Data is protected at rest and in transit  Transparent - No application level code changes required
  • 18.
    HDFS TDE –Design  Goals: - Only an authorized client/user can access cleartext - HDFS never stores cleartext or unencrypted data encryption keys
  • 19.
    HDFS TDE –Terminology  Encryption Zone - A directory whose file contents will be encrypted upon write and decrypted upon read - An EZKEY is generated for each zone
  • 20.
    HDFS TDE –Terminology  EZKEY – encryption zone key  DEK – data encryption key  EDEK – encrypted data encryption key
  • 21.
    HDFS TDE -Data Encryption  The same key is used to encrypt and decrypt data  The size of the ciphertext is exactly the same as the size of the original cleartext - EZKEY + DEK => EDEK - EDEK + EZKEY => DEK
  • 22.
    HDFS TDE -Services  HDFS NameNode (NN)  Kerberos Key Distribution Center (KDC)  Hadoop Key Management Server (KMS) - Key Trustee Server
  • 23.
    HDFS TDE –Security Concepts  Division of Labor - KMS creates the EZKEY & DEK - KMS encrypts/decrypts the DEK/EDEK using the EZKEY - HDFS NN communicates with the KMS to create EZKEYs & EDEKs to store in the extended attributes in the encryption zone - HDFS client communicates with the KMS to get the DEK using the EZKEY and EDEK.
  • 24.
    HDFS TDE –Security Concepts  The name of the EZKEY is stored in the HDFS extended attributes of the directory associated with the encryption zone  The EDEK is stored in the HDFS extended attributes of the file in the encryption zone $ hadoop key … $ hdfs crypto …
  • 25.
    HDFS Examples  Simplifiedfor the sake of clarity: - Kerberos actions not shown - NameNode EDEK cache not shown
  • 26.
    HDFS – CreateEncryption Zone /encrypted_dir xattr: EZKEYNAME EZKEYNAME = KEY 3. Create EZKEY
  • 27.
    HDFS – CreateEncrypted File 3. Create EDEK 1. Create file 2. Create EDEK /encrypted_dir/file xattr: EDEK 4. Store EDEK5. Return Success /encrypted_dir/file encrypted data
  • 28.
    HDFS TDE –File Write Work Flow 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file write encrypted data read unencrypted data /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME
  • 29.
    HDFS TDE –File Read Work Flow 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file read encrypted data write unencrypted data /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME
  • 30.
    Bring in theContainers (i.e. Docker)  Issues with containers are the same for any virtualization platform - Multiple compute clusters - Multiple HDFS file systems - Multiple Kerberos realms - Cross-realm trust configuration
  • 31.
    Containers as VirtualMachines  Note – this is not about using containers to run Big Data tasks:
  • 32.
    Containers as VirtualMachines  This is about running Hadoop / Big Data clusters in containers: cluster
  • 33.
    Containers as VirtualMachines  A true containerized Big Data environment:
  • 34.
    KDC Cross-Realm Trust Different KDC realms for corporate, data, and compute  Must interact correctly in order for the Big Data cluster to function CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
  • 35.
    KDC Cross-Realm Trust Different KDC realms for corporate, data, and compute - One-way trust • Compute realm trusts the corporate realm • Data realm trusts corporate realm • Data realm trusts the compute realm
  • 36.
    CORP.ENTERPRISE.COM Realm COMPUTE.ENTERPRISE.COM RealmDATALAKE.ENTERPRISE.COM Realm KDC: CORP.ENTERPRISE.COM KDC: DATALAKE.ENTERPRISE.COM KDC: COMPUTE.ENTERPRISE.COM HDFS: hdfs://remotedata/ Hadoop Cluster rm@COMPUTE.ENTERPRISE.COM user@CORP.ENTERPRISE.COM Hadoop Key Management Service KDC Cross-Realm Trust
  • 37.
    Key Management Service Must be enterprise quality - Key Trustee Server • Java KeyStore KMS • Cloudera Navigator Key Trustee Server
  • 38.
    Containers as VirtualMachines  A true containerized Big Data environment: DataLake DataLake DataLake CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
  • 39.
    Key Takeaways  Hadoophas many security layers - HDFS Transparent Data Encryption (TDE) is best of breed - Security is hard (complex) - Virtualization / containerization only makes it potentially harder - Compute and storage separation with virtualization / containerization can make it even harder still
  • 40.
    Key Takeaways  Becareful with a build vs. buy decision for containerized Big Data - Recommendation: buy one already built - There are turnkey solutions (e.g. BlueData EPIC) Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know
  • 41.
    www.bluedata.com BlueData Booth #1508 inStrata Expo Hall @tapbluedata

Editor's Notes

  • #40 Jason to briefly cover agenda
  • #41 Jason to briefly cover agenda