Data Security at Scale through Spark and Parquet Encryption

Apple logo is a trademark of Apple Inc.
Gidon Gershinsky, Tim Perelmutov | Data + AI Summit
Data Security at Scale through

Spark and Parquet Encryption
THIS IS NOT A CONTRIBUTION

Presenters
Gidon Gershinsky

• Designs and builds data security solutions at Apple

• Leading role in Apache Parquet community work on data encryption

Tim Perelmutov

• Data ingestion and analytics for iCloud

Agenda
Parquet Encryption: Goals and Features

Status in Apache projects

API and “Hello World” samples

Community Roadmap

Demo Learnings: using Parquet Encryption at Scale

Apache Parquet
• Popular columnar storage format

• Encoding, compression

• Advanced data filtering

• columnar projection: skip columns

• predicate push down: skip files, or row groups, or
data page
s 
• Performance benefits of Parquet filtering

- less data to fetch from storage: I/O, time

- less data to process: CPU, latenc
y 
• How to protect sensitive Parquet data?
=
+
Columnar Statistics
Read only the
data you need
Strata 2017 Parquet Arrow Roadmap

Parquet Modular Encryption: Goals
• data privacy/confidentiality

- hiding sensitive informatio
n 
•data integrity

- tamper-proofing sensitive information
+
Protect sensitive data-at-rest
Photo by Manuel Geissinger from Pexels

• Full Parquet capabilities (columnar
projection, predicate pushdown, etc) with
encrypted dat
a 
• Big Data challenge: Integrity protection

• signing full files will break Parquet filtering,
and slow analytic workloads down by
order(s) of magnitude
+
Read only the
data you need
Preserve performance of analytic engines
2017 Parquet Arrow Roadmap

Define open standard for safe storage of analytic data
• works the same in any storage

• cloud or private, file systems, object stores, archives

• untrusted storage!

•with any KMS (key management service)

•key-based access in any storage: private - cloud - archive

•enable per-column encryption keys

Big Data Challenges
Safe migration from one storage to another

• no need to import / decrypt / export / encrypt

• simply move the files
Sharing data subset / table column(s)

• no need to extract / encrypt a copy for each user

• simply provide column key access to eligible users

Data Privacy / Confidentiality
Full encryption mode

•all modules are hidden

Plaintext footer mode

•footer is exposed for legacy readers

•sensitive metadata is hidden

Separate keys for sensitive columns

•column access control

“Client-side” encryption

•storage backend / admin never see data or keys

Data Integrity
File contents not tampered with

File not replaced with wrong fil
e 
PME signs data and metadata modules

•with module ID and file I
D 
AES GCM: “authenticated encryption
” 
Framework for other encryption algorithms
customers-may-2021.part0.parquet customers-jan-2020.part0.parquet

Envelope Encryption
• Parquet file modules are encrypted with “Data Encryption Keys” (DEKs)
• DEKs are encrypted with “Master Encryption Keys” (MEKs)

• result is called “key material” and stored either in Parquet file footers, or in separate files in same folder
• MEKs are stored and managed in “Key Management Service” (KMS)

• access control verification
• Advanced mode in Parquet: Double Envelope Encryption

• DEKs are encrypted with “Key Encryption Keys” (KEKs)

• KEKs are encrypted with MEKs

• single KMS call in process lifetime / or one call in X minutes, configurable

Thank you to all contributors!
Current Status
Format

• PME specification
approved and released
in 2019 (v2.7)

MR

• Java implementation,
released in 2021
(v1.12.0)
• C++ implementation,
merged in 2021

• Python interface under
construction
• Parquet updated to 1.12.0 -
enables basic encryption
out-of-box

• Planned for Spark 3.2.0
releas
e 
Other analytic frameworks

ongoing work on integrating
Parquet encryption

Invoke encryption via Hadoop parameters
Spark with Parquet Encryption
• pass list of columns to encrypt

• specify IDs of master keys for these columns

• specify ID of master key for Parquet footers

• pass class name for client of your KMS

• activate encryption

• instructions at PARQUET-1854

• try today!

• clone Spark repo and build a runnable distribution
Spark App
KMS
Auth

HelloWorld: Writing Encrypted Files
•Run spark-shell

•“Arm” encryption

•Pass master encryption keys (demo only!)
sc.hadoopConfiguration.set(“parquet.crypto.factory.class ,  
“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory) 
sc.hadoopConfiguration.set(parquet.encryption.kms.client.class ,  
org.apache.parquet.crypto.keytools.mocks.InMemoryKMS
)

sc.hadoopConfiguration.set(“parquet.encryption.key.list” ,  
“k1:AAECAwQFBgcICQoLDA0ODw== , k2:AAECAAECAAECAAECAAECAA==)

HelloWorld: Writing Encrypted Files
•Write dataframe: “columnA” will be encrypted

•Column key format
sampleDF.write.

option(parquet.encryption.footer.key , k1).

option(parquet.encryption.column.keys , k2:columnA).

parquet(/path/to/table.parquet.encrypted)

masterKeyID:colName,colName;masterKeyID:colName, ..

HelloWorld: Reading Encrypted Files
•Run spark-shell

•“Arm” decryption

•Pass master encryption keys (demo only!)

•Read dataframe with encrypted columns
sc.hadoopConfiguration.set(parquet.crypto.factory.class ,  
“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory) 
org.apache.parquet.crypto.keytools.mocks.InMemoryKMS
)

sc.hadoopConfiguration.set(“parquet.encryption.key.list” ,  
k1:AAECAwQFBgcICQoLDA0ODw== , k2:AAECAAECAAECAAECAAECAA==)
val df = spark.read.parquet(“/path/to/table.parquet.encrypted)

Real World
•Master keys are kept in KMS

•Develop client for your KMS server

•Implement KMS client interfac
e 
public interface KmsClient { 
// encrypt e.g. data key with master key (envelope encryption
)

String wrapKey(byte[] keyBytes, String masterKeyIdenti
fi
er)  
// decrypt ke
y

byte[] unwrapKey(String wrappedKey, String masterKeyIdenti
fi
er)

}

parquet-mr-1.12.0
Example: Hashicorp Vault Client
•Search for VaultClient in github.com/apache/parquet-mr

•Set up encryption
sc.hadoopConfiguration.set(parquet.crypto.factory.class ,  
“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory
)

“org.apache.parquet.crypto.keytools.samples.VaultClient
)

sc.hadoopConfiguration.set(parquet.encryption.key.access.token , vault token
)

sc.hadoopConfiguration.set(parquet.encryption.kms.instance.url , vault server url)

parquet-mr-1.12.0
Example: Hashicorp Vault Client (continued)
•Write dataframe with encrypted columns

•Read dataframe with encrypted columns
 
import org.apache.parquet.crypto.keytools.KeyToolkit 
sampleDF.write.

option(parquet.encryption.footer.key , k1).

option(parquet.encryption.column.keys , k2:columnA).

parquet(/path/to/table.parquet.encrypted) 
 
val df = spark.read.parquet(/path/to/table.parquet.encrypted)

Minimization of KMS calls
Advanced Key Management Features
•“double envelope encryption”

• activated by default (can be disabled)
•single KMS call in process lifetime,
 
or one call in X minutes, configurable

• per master key

Key Rotation
Advanced Key Management Features
•Refresh master keys (periodically or on demand)

•Enable key rotation when writing data

•Rotate master keys in key management system

•re-wrap data keys in Parquet file
s 
 
sc.hadoopConfiguration.set(parquet.encryption.key.material.store.internally, “false”) 
import org.apache.parquet.crypto.keytools.KeyToolkit 
 
KeyToolkit.rotateMasterKeys(“/path/to/table.parquet.encrypted, sc.hadoopConfiguration)

Create encryption properties
Parquet encryption with raw Java
Configuration hadoopConfiguration = new Configuration()
;

hadoopConfiguration.set(parquet.crypto.factory.class ,  
org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory)
;

hadoopConfiguration.set(parquet.encryption.kms.client.class ,  
org.apache.parquet.crypto.keytools.mocks.InMemoryKMS)
;

hadoopConfiguration.set(parquet.encryption.key.list , 
k1:AAECAwQFBgcICQoLDA0ODw==,  
k2:AAECAAECAAECAAECAAECAA==)
;

hadoopConfiguration.set(parquet.encryption.footer.key , k1)
;

hadoopConfiguration.set(parquet.encryption.column.keys , “k2:columnA);

Write data
EncryptionPropertiesFactory cryptoFactory =  
EncryptionPropertiesFactory.loadFactory(hadoopConfiguration)
;

FileEncryptionProperties
fi
leEncryptionProperties =

cryptoFactory.getFileEncryptionProperties(hadoopCon
fi
guration,  
/path/to/folder/
fi
le, null)
;

ParquetWriter writer = ParquetWriter.builder(path/to/folder/
fi
le
)

.withConf(hadoopCon
fi
guration
)

…

.withEncryption(
fi
leEncryptionProperties
)

.build(); 
// write as usual

Read data
Similar, with:

No need to pass footer and column key properties
DecryptionPropertiesFactory,  
ParquetReader.builder.withDecryption

Performance effect of Parquet Encryption
AES ciphers implemented in CPU hardware (AES-NI)

• Gigabyte(s) per second in each thread

• Order(s) of magnitude faster than “software stack”

(App/Framework/Parquet/compression)

• C++: OpenSSL EVP librar
y 
Java AES-NI

• AES-NI support in HotSpot since Java 9

• Java 11.0.4 – enhanced AES GCM decryption

• Thanks Java community!
Bottom line: Encryption won’t be
your bottleneck

• app workload, data I/O, encoding, compression

Community Roadmap
Apache Spark: SPARK-33966: “Two-tier encryption key management
” 
Apache Parquet MR: New features for parquet-mr-1.13+, such as Uniform
encryption, CLI for encrypted data, local wrapping with key rotation, et
c 
Apache Iceberg, Presto, Hudi: Integration with Parquet encryptio
n 
Apache Arrow: ARROW-9947: “Python API for Parquet encryption”

Data Analytics
iCloud CloudKit Analytics
• Zeppelin and Jupyter on Spark

• Spark Batch Workflows

• Weekly Reports

• Ad-Hoc analytics
• Cohorts of iCloud Users

• iCloud-wide sample of all users 0.1%

• Semantic and geographic cohorts

• Ad-hoc

• Weekly Snapshot of Metadata DBs (No user
data)

• iCloud Server side activity (uploads, etc.) data
streams

• Anonymized and stripped of private data

• 100s of structured data types organized into
external Hive tables

iCloud CloudKit Analytics Use Cases
iCloud Storage

• Intelligent tiered storage optimizations uses combination of snapshot and streaming data

• Storage capacity forecasting

• Delete/compaction eligible data volume, la
g 
Service utilization and spike analysis

Seed builds monitoring and qualification

Data integrity verification

Quick ad-hoc analytics (minutes in CloudKit Analytics vs hours in Splunk)

Encryption Requirements
Master key rotation

Enforce 2^32 encryption operations with same key

Each encryption = 2^35 bytes (2^31 AES blocks)

Scalable to Petabytes of data

Reduce impact on performance of ingestion and analytics workflows

Ingestion Pipelines Modification Steps
PME in CloudKit Analytics
•Update Parquet dependency to PME-compatible version

•Set Hadoop Config Properties

parquetConf.set(EncryptionPropertiesFactory.CRYPTO_FACTORY_CLASS_PROPERTY_NAME, AppleCryptoFactory.class.getCanonicalName())
;

// KMS client class 
parquetConf.set(KeyToolkit.KMS_CLIENT_CLASS_PROPERTY_NAME, CustomerKmsBridge.class.getCanonicalName())
;

// with this property turned on, we do not need to specify the individual key ids per column 
parquetConf.setBoolean(AppleCryptoFactory.UNIFORM_ENCRYPTION_PROPERTY_NAME, true)
;

// key id for parquet.encryption.footer.key property value 
parquetConf.set(PropertiesDrivenCryptoFactory.FOOTER_KEY_PROPERTY_NAME, /*Key Name from Config*/)
;

// store key material externally (separate files) 
parquetConf.setBoolean(KeyToolkit.KEY_MATERIAL_INTERNAL_PROPERTY_NAME, false);

• Update spark configuration

…

properties:

…

spark.hadoop.parquet.crypto.factory.class: org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactor
y

# KMS Client class 
spark.hadoop.parquet.encryption.kms.client.class: com.apple.parquet.crypto.keytools.CustomerKmsBridge
PME in CloudKit Analytics
Spark Read Configuration

Write Performance and Storage Space Impact
•All columns encrypted!

•No impact on ingestion time and
resource utilization

•Minimal storage penalty

• Measurable only for datasets with parquet
small files

• Key Material Files: few KB eac
h 
$ hadoop fs -ls hdfs://.../bucket=0
/

10100101-ff5b0f56-4779-4aea-8765-2d406bcd70a3.parque
t

. .
.

_KEY_MATERIAL_FOR_10100102-33ef104e-3ab6-49ee-9a16-
b150f7da24ab.parquet.json
{
{
Ingestion

w/o encryption
Ingestion

w/ encryption

Running join with aggregation on 2 large tables. All columns encrypted!
No Significant Impact on Read Performance
25.1 sec

with encryption
23.4 sec

without encryption

Data Security at Scale through Spark and Parquet Encryption

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Security at Scale through Spark and Parquet Encryption

Similar to Data Security at Scale through Spark and Parquet Encryption (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Data Security at Scale through Spark and Parquet Encryption