SlideShare a Scribd company logo
Apple logo is a trademark of Apple Inc.
Gidon Gershinsky, Tim Perelmutov | Data + AI Summit
Data Security at Scale through


Spark and Parquet Encryption
THIS IS NOT A CONTRIBUTION
Presenters
Gidon Gershinsky


• Designs and builds data security solutions at Apple


• Leading role in Apache Parquet community work on data encryption

Tim Perelmutov


• Data ingestion and analytics for iCloud
Agenda
Parquet Encryption: Goals and Features


Status in Apache projects


API and “Hello World” samples


Community Roadmap


Demo  Learnings: using Parquet Encryption at Scale
Apache Parquet
• Popular columnar storage format


• Encoding, compression


• Advanced data filtering


• columnar projection: skip columns


• predicate push down: skip files, or row groups, or
data page
s

• Performance benefits of Parquet filtering


- less data to fetch from storage: I/O, time


- less data to process: CPU, latenc
y

• How to protect sensitive Parquet data?
=
+
Columnar Statistics
Read only the
data you need
Strata 2017 Parquet Arrow Roadmap
Parquet Modular Encryption: Goals
• data privacy/confidentiality


- hiding sensitive informatio
n

•data integrity


- tamper-proofing sensitive information
+
Protect sensitive data-at-rest
Photo by Manuel Geissinger from Pexels
Parquet Modular Encryption: Goals
• Full Parquet capabilities (columnar
projection, predicate pushdown, etc) with
encrypted dat
a

• Big Data challenge: Integrity protection


• signing full files will break Parquet filtering,
and slow analytic workloads down by
order(s) of magnitude
+
Read only the
data you need
Preserve performance of analytic engines
2017 Parquet Arrow Roadmap
Define open standard for safe storage of analytic data
• works the same in any storage


• cloud or private, file systems, object stores, archives


• untrusted storage!


•with any KMS (key management service)


•key-based access in any storage: private - cloud - archive


•enable per-column encryption keys
Parquet Modular Encryption: Goals
Big Data Challenges
Safe migration from one storage to another


• no need to import / decrypt / export / encrypt


• simply move the files
Sharing data subset / table column(s)


• no need to extract / encrypt a copy for each user


• simply provide column key access to eligible users
Parquet Modular Encryption: Goals
Data Privacy / Confidentiality
Full encryption mode


•all modules are hidden


Plaintext footer mode


•footer is exposed for legacy readers


•sensitive metadata is hidden

Separate keys for sensitive columns


•column access control


“Client-side” encryption


•storage backend / admin never see data or keys
Data Integrity
File contents not tampered with


File not replaced with wrong fil
e

PME signs data and metadata modules


•with module ID and file I
D

AES GCM: “authenticated encryption
”

Framework for other encryption algorithms
customers-may-2021.part0.parquet customers-jan-2020.part0.parquet
Envelope Encryption
• Parquet file modules are encrypted with “Data Encryption Keys” (DEKs)
• DEKs are encrypted with “Master Encryption Keys” (MEKs)


• result is called “key material” and stored either in Parquet file footers, or in separate files in same folder
• MEKs are stored and managed in “Key Management Service” (KMS)


• access control verification
• Advanced mode in Parquet: Double Envelope Encryption


• DEKs are encrypted with “Key Encryption Keys” (KEKs)


• KEKs are encrypted with MEKs


• single KMS call in process lifetime / or one call in X minutes, configurable
Thank you to all contributors!
Current Status
Format


• PME specification
approved and released
in 2019 (v2.7)


MR


• Java implementation,
released in 2021
(v1.12.0)
• C++ implementation,
merged in 2021


• Python interface under
construction
• Parquet updated to 1.12.0 -
enables basic encryption
out-of-box


• Planned for Spark 3.2.0
releas
e

Other analytic frameworks


ongoing work on integrating
Parquet encryption
Invoke encryption via Hadoop parameters
Spark with Parquet Encryption
• pass list of columns to encrypt


• specify IDs of master keys for these columns


• specify ID of master key for Parquet footers


• pass class name for client of your KMS


• activate encryption


• instructions at PARQUET-1854


• try today!


• clone Spark repo and build a runnable distribution
Spark App
KMS
Auth
HelloWorld: Writing Encrypted Files
•Run spark-shell


•“Arm” encryption


•Pass master encryption keys (demo only!)
sc.hadoopConfiguration.set(“parquet.crypto.factory.class , 

“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory)

sc.hadoopConfiguration.set(parquet.encryption.kms.client.class , 

org.apache.parquet.crypto.keytools.mocks.InMemoryKMS
)

sc.hadoopConfiguration.set(“parquet.encryption.key.list” , 

“k1:AAECAwQFBgcICQoLDA0ODw== ,  k2:AAECAAECAAECAAECAAECAA==)

HelloWorld: Writing Encrypted Files
•Write dataframe: “columnA” will be encrypted


•Column key format
sampleDF.write.   

option(parquet.encryption.footer.key , k1).

option(parquet.encryption.column.keys , k2:columnA).

parquet(/path/to/table.parquet.encrypted)

masterKeyID:colName,colName;masterKeyID:colName, ..
HelloWorld: Reading Encrypted Files
•Run spark-shell


•“Arm” decryption


•Pass master encryption keys (demo only!)


•Read dataframe with encrypted columns
sc.hadoopConfiguration.set(parquet.crypto.factory.class , 

“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory)

sc.hadoopConfiguration.set(parquet.encryption.kms.client.class , 

org.apache.parquet.crypto.keytools.mocks.InMemoryKMS
)

sc.hadoopConfiguration.set(“parquet.encryption.key.list” , 

k1:AAECAwQFBgcICQoLDA0ODw== ,  k2:AAECAAECAAECAAECAAECAA==)
val df = spark.read.parquet(“/path/to/table.parquet.encrypted)
Real World
•Master keys are kept in KMS


•Develop client for your KMS server


•Implement KMS client interfac
e

public interface KmsClient {

// encrypt e.g. data key with master key (envelope encryption
)

String wrapKey(byte[] keyBytes, String masterKeyIdenti
fi
er) 

// decrypt ke
y

byte[] unwrapKey(String wrappedKey, String masterKeyIdenti
fi
er)
 

}
parquet-mr-1.12.0
Example: Hashicorp Vault Client
•Search for VaultClient in github.com/apache/parquet-mr


•Set up encryption
sc.hadoopConfiguration.set(parquet.crypto.factory.class , 

“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory
)

sc.hadoopConfiguration.set(parquet.encryption.kms.client.class , 

“org.apache.parquet.crypto.keytools.samples.VaultClient
)

sc.hadoopConfiguration.set(parquet.encryption.key.access.token , vault token
)

sc.hadoopConfiguration.set(parquet.encryption.kms.instance.url , vault server url)
parquet-mr-1.12.0
Example: Hashicorp Vault Client (continued)
•Write dataframe with encrypted columns


•Read dataframe with encrypted columns
 

import org.apache.parquet.crypto.keytools.KeyToolkit

sampleDF.write.   

option(parquet.encryption.footer.key , k1).

option(parquet.encryption.column.keys , k2:columnA).

parquet(/path/to/table.parquet.encrypted)



val df = spark.read.parquet(/path/to/table.parquet.encrypted)
Minimization of KMS calls
Advanced Key Management Features
•“double envelope encryption”


• activated by default (can be disabled)
•single KMS call in process lifetime,
 

or one call in X minutes, configurable


• per master key
Key Rotation
Advanced Key Management Features
•Refresh master keys (periodically or on demand)


•Enable key rotation when writing data


•Rotate master keys in key management system


•re-wrap data keys in Parquet file
s



sc.hadoopConfiguration.set(parquet.encryption.key.material.store.internally, “false”)

import org.apache.parquet.crypto.keytools.KeyToolkit



KeyToolkit.rotateMasterKeys(“/path/to/table.parquet.encrypted, sc.hadoopConfiguration)

Create encryption properties
Parquet encryption with raw Java
Configuration hadoopConfiguration = new Configuration()
;

hadoopConfiguration.set(parquet.crypto.factory.class , 

org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory)
;

hadoopConfiguration.set(parquet.encryption.kms.client.class , 

org.apache.parquet.crypto.keytools.mocks.InMemoryKMS)
;

hadoopConfiguration.set(parquet.encryption.key.list ,

k1:AAECAwQFBgcICQoLDA0ODw==,  

k2:AAECAAECAAECAAECAAECAA==)
;

hadoopConfiguration.set(parquet.encryption.footer.key , k1)
;

hadoopConfiguration.set(parquet.encryption.column.keys , “k2:columnA);
Write data
Parquet encryption with raw Java
EncryptionPropertiesFactory cryptoFactory = 

EncryptionPropertiesFactory.loadFactory(hadoopConfiguration)
;

FileEncryptionProperties
fi
leEncryptionProperties =
 

cryptoFactory.getFileEncryptionProperties(hadoopCon
fi
guration, 

/path/to/folder/
fi
le, null)
;

ParquetWriter writer = ParquetWriter.builder(path/to/folder/
fi
le
)

.withConf(hadoopCon
fi
guration
)

…

.withEncryption(
fi
leEncryptionProperties
)

.build();

// write as usual

Read data
Similar, with:


No need to pass footer and column key properties
Parquet encryption with raw Java
DecryptionPropertiesFactory, 

ParquetReader.builder.withDecryption



Performance effect of Parquet Encryption
AES ciphers implemented in CPU hardware (AES-NI)


• Gigabyte(s) per second in each thread


• Order(s) of magnitude faster than “software stack”

(App/Framework/Parquet/compression)


• C++: OpenSSL EVP librar
y

Java AES-NI


• AES-NI support in HotSpot since Java 9


• Java 11.0.4 – enhanced AES GCM decryption


• Thanks Java community!
Bottom line: Encryption won’t be
your bottleneck


• app workload, data I/O, encoding, compression
Community Roadmap
Apache Spark: SPARK-33966: “Two-tier encryption key management
”

Apache Parquet MR: New features for parquet-mr-1.13+, such as Uniform
encryption, CLI for encrypted data, local wrapping with key rotation, et
c

Apache Iceberg, Presto, Hudi: Integration with Parquet encryptio
n

Apache Arrow: ARROW-9947: “Python API for Parquet encryption”
Data Analytics
iCloud CloudKit Analytics
• Zeppelin and Jupyter on Spark


• Spark Batch Workflows


• Weekly Reports


• Ad-Hoc analytics
• Cohorts of iCloud Users


• iCloud-wide sample of all users  0.1%


• Semantic and geographic cohorts


• Ad-hoc


• Weekly Snapshot of Metadata DBs (No user
data)


• iCloud Server side activity (uploads, etc.) data
streams


• Anonymized and stripped of private data


• 100s of structured data types organized into
external Hive tables
iCloud CloudKit Analytics Use Cases
iCloud Storage


• Intelligent tiered storage optimizations uses combination of snapshot and streaming data


• Storage capacity forecasting


• Delete/compaction eligible data volume, la
g

Service utilization and spike analysis


Seed builds monitoring and qualification


Data integrity verification


Quick ad-hoc analytics (minutes in CloudKit Analytics vs hours in Splunk)
Encryption Requirements
Master key rotation


Enforce  2^32 encryption operations with same key


Each encryption = 2^35 bytes (2^31 AES blocks)


Scalable to Petabytes of data


Reduce impact on performance of ingestion and analytics workflows
Ingestion Pipelines Modification Steps
PME in CloudKit Analytics
•Update Parquet dependency to PME-compatible version


•Set Hadoop Config Properties


parquetConf.set(EncryptionPropertiesFactory.CRYPTO_FACTORY_CLASS_PROPERTY_NAME, AppleCryptoFactory.class.getCanonicalName())
;

// KMS client class

parquetConf.set(KeyToolkit.KMS_CLIENT_CLASS_PROPERTY_NAME, CustomerKmsBridge.class.getCanonicalName())
;

// with this property turned on, we do not need to specify the individual key ids per column

parquetConf.setBoolean(AppleCryptoFactory.UNIFORM_ENCRYPTION_PROPERTY_NAME, true)
;

// key id for parquet.encryption.footer.key property value

parquetConf.set(PropertiesDrivenCryptoFactory.FOOTER_KEY_PROPERTY_NAME, /*Key Name from Config*/)
;

// store key material externally (separate files)

parquetConf.setBoolean(KeyToolkit.KEY_MATERIAL_INTERNAL_PROPERTY_NAME, false);
• Update spark configuration


…


properties:


…


spark.hadoop.parquet.crypto.factory.class: org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactor
y

# KMS Client class

spark.hadoop.parquet.encryption.kms.client.class: com.apple.parquet.crypto.keytools.CustomerKmsBridge
PME in CloudKit Analytics
Spark Read Configuration
Write Performance and Storage Space Impact
•All columns encrypted!


•No impact on ingestion time and
resource utilization


•Minimal storage penalty


• Measurable only for datasets with parquet
small files


• Key Material Files: few KB eac
h

$ hadoop fs -ls hdfs://.../bucket=0
/

10100101-ff5b0f56-4779-4aea-8765-2d406bcd70a3.parque
t

. .
.

_KEY_MATERIAL_FOR_10100102-33ef104e-3ab6-49ee-9a16-
b150f7da24ab.parquet.json
{
{
Ingestion


w/o encryption
Ingestion


w/ encryption
Running join with aggregation on 2 large tables. All columns encrypted!
No Significant Impact on Read Performance
25.1 sec


with encryption
23.4 sec


without encryption
TM and © 2021 Apple Inc. All rights reserved.

More Related Content

What's hot

Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
Owen O'Malley
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
Owen O'Malley
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Oracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API ExamplesOracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API Examples
Bobby Curtis
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
Databricks
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
HostedbyConfluent
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
HostedbyConfluent
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
 

What's hot (20)

Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Oracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API ExamplesOracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API Examples
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 

Similar to Data Security at Scale through Spark and Parquet Encryption

Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
Timothy Spann
 
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax
 
Managing your secrets in a cloud environment
Managing your secrets in a cloud environmentManaging your secrets in a cloud environment
Managing your secrets in a cloud environment
Taswar Bhatti
 
Transparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQLTransparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQL
Masahiko Sawada
 
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
RootedCON
 
Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017
Toni de la Fuente
 
Secret Management Architectures
Secret Management Architectures Secret Management Architectures
Secret Management Architectures
Stenio Ferreira
 
Apache spot 系統架構
Apache spot 系統架構Apache spot 系統架構
Apache spot 系統架構
Hua Chu
 
Design-Time Properties in Custom Pipeline Components
Design-Time Properties in Custom Pipeline ComponentsDesign-Time Properties in Custom Pipeline Components
Design-Time Properties in Custom Pipeline Components
Daniel Toomey
 
key aggregate cryptosystem for scalable data sharing in cloud
key aggregate cryptosystem for scalable data sharing in cloudkey aggregate cryptosystem for scalable data sharing in cloud
key aggregate cryptosystem for scalable data sharing in cloud
Sravan Narra
 
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and RestIBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
Sandeep Patil
 
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleLeveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Databricks
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
Sandeep Patil
 
Can puppet help you run docker on a T2.Micro?
Can puppet help you run docker on a T2.Micro?Can puppet help you run docker on a T2.Micro?
Can puppet help you run docker on a T2.Micro?
Neil Millard
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
LibbySchulze
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
4aa5 3404
4aa5 34044aa5 3404
4aa5 3404
Bloombase
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
Bill Liu
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
Timothy Spann
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
Karan Alang
 

Similar to Data Security at Scale through Spark and Parquet Encryption (20)

Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
 
Managing your secrets in a cloud environment
Managing your secrets in a cloud environmentManaging your secrets in a cloud environment
Managing your secrets in a cloud environment
 
Transparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQLTransparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQL
 
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
 
Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017
 
Secret Management Architectures
Secret Management Architectures Secret Management Architectures
Secret Management Architectures
 
Apache spot 系統架構
Apache spot 系統架構Apache spot 系統架構
Apache spot 系統架構
 
Design-Time Properties in Custom Pipeline Components
Design-Time Properties in Custom Pipeline ComponentsDesign-Time Properties in Custom Pipeline Components
Design-Time Properties in Custom Pipeline Components
 
key aggregate cryptosystem for scalable data sharing in cloud
key aggregate cryptosystem for scalable data sharing in cloudkey aggregate cryptosystem for scalable data sharing in cloud
key aggregate cryptosystem for scalable data sharing in cloud
 
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and RestIBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
 
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleLeveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
 
Can puppet help you run docker on a T2.Micro?
Can puppet help you run docker on a T2.Micro?Can puppet help you run docker on a T2.Micro?
Can puppet help you run docker on a T2.Micro?
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
4aa5 3404
4aa5 34044aa5 3404
4aa5 3404
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 

Data Security at Scale through Spark and Parquet Encryption

  • 1. Apple logo is a trademark of Apple Inc. Gidon Gershinsky, Tim Perelmutov | Data + AI Summit Data Security at Scale through Spark and Parquet Encryption THIS IS NOT A CONTRIBUTION
  • 2. Presenters Gidon Gershinsky • Designs and builds data security solutions at Apple • Leading role in Apache Parquet community work on data encryption Tim Perelmutov • Data ingestion and analytics for iCloud
  • 3. Agenda Parquet Encryption: Goals and Features Status in Apache projects API and “Hello World” samples Community Roadmap Demo Learnings: using Parquet Encryption at Scale
  • 4. Apache Parquet • Popular columnar storage format • Encoding, compression • Advanced data filtering • columnar projection: skip columns • predicate push down: skip files, or row groups, or data page s
 • Performance benefits of Parquet filtering - less data to fetch from storage: I/O, time - less data to process: CPU, latenc y
 • How to protect sensitive Parquet data? = + Columnar Statistics Read only the data you need Strata 2017 Parquet Arrow Roadmap
  • 5. Parquet Modular Encryption: Goals • data privacy/confidentiality - hiding sensitive informatio n
 •data integrity - tamper-proofing sensitive information + Protect sensitive data-at-rest Photo by Manuel Geissinger from Pexels
  • 6. Parquet Modular Encryption: Goals • Full Parquet capabilities (columnar projection, predicate pushdown, etc) with encrypted dat a
 • Big Data challenge: Integrity protection • signing full files will break Parquet filtering, and slow analytic workloads down by order(s) of magnitude + Read only the data you need Preserve performance of analytic engines 2017 Parquet Arrow Roadmap
  • 7. Define open standard for safe storage of analytic data • works the same in any storage • cloud or private, file systems, object stores, archives • untrusted storage! •with any KMS (key management service) •key-based access in any storage: private - cloud - archive •enable per-column encryption keys Parquet Modular Encryption: Goals
  • 8. Big Data Challenges Safe migration from one storage to another • no need to import / decrypt / export / encrypt • simply move the files Sharing data subset / table column(s) • no need to extract / encrypt a copy for each user • simply provide column key access to eligible users Parquet Modular Encryption: Goals
  • 9. Data Privacy / Confidentiality Full encryption mode •all modules are hidden Plaintext footer mode •footer is exposed for legacy readers •sensitive metadata is hidden Separate keys for sensitive columns •column access control “Client-side” encryption •storage backend / admin never see data or keys
  • 10. Data Integrity File contents not tampered with File not replaced with wrong fil e
 PME signs data and metadata modules •with module ID and file I D
 AES GCM: “authenticated encryption ”
 Framework for other encryption algorithms customers-may-2021.part0.parquet customers-jan-2020.part0.parquet
  • 11. Envelope Encryption • Parquet file modules are encrypted with “Data Encryption Keys” (DEKs) • DEKs are encrypted with “Master Encryption Keys” (MEKs) • result is called “key material” and stored either in Parquet file footers, or in separate files in same folder • MEKs are stored and managed in “Key Management Service” (KMS) • access control verification • Advanced mode in Parquet: Double Envelope Encryption • DEKs are encrypted with “Key Encryption Keys” (KEKs) • KEKs are encrypted with MEKs • single KMS call in process lifetime / or one call in X minutes, configurable
  • 12. Thank you to all contributors! Current Status Format • PME specification approved and released in 2019 (v2.7) MR • Java implementation, released in 2021 (v1.12.0) • C++ implementation, merged in 2021 • Python interface under construction • Parquet updated to 1.12.0 - enables basic encryption out-of-box • Planned for Spark 3.2.0 releas e
 Other analytic frameworks ongoing work on integrating Parquet encryption
  • 13. Invoke encryption via Hadoop parameters Spark with Parquet Encryption • pass list of columns to encrypt • specify IDs of master keys for these columns • specify ID of master key for Parquet footers • pass class name for client of your KMS • activate encryption • instructions at PARQUET-1854 • try today! • clone Spark repo and build a runnable distribution Spark App KMS Auth
  • 14. HelloWorld: Writing Encrypted Files •Run spark-shell •“Arm” encryption •Pass master encryption keys (demo only!) sc.hadoopConfiguration.set(“parquet.crypto.factory.class , 
 “org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory)
 sc.hadoopConfiguration.set(parquet.encryption.kms.client.class , 
 org.apache.parquet.crypto.keytools.mocks.InMemoryKMS ) sc.hadoopConfiguration.set(“parquet.encryption.key.list” , 
 “k1:AAECAwQFBgcICQoLDA0ODw== ,  k2:AAECAAECAAECAAECAAECAA==)

  • 15. HelloWorld: Writing Encrypted Files •Write dataframe: “columnA” will be encrypted •Column key format sampleDF.write.    option(parquet.encryption.footer.key , k1). option(parquet.encryption.column.keys , k2:columnA). parquet(/path/to/table.parquet.encrypted) masterKeyID:colName,colName;masterKeyID:colName, ..
  • 16. HelloWorld: Reading Encrypted Files •Run spark-shell •“Arm” decryption •Pass master encryption keys (demo only!) •Read dataframe with encrypted columns sc.hadoopConfiguration.set(parquet.crypto.factory.class , 
 “org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory)
 sc.hadoopConfiguration.set(parquet.encryption.kms.client.class , 
 org.apache.parquet.crypto.keytools.mocks.InMemoryKMS ) sc.hadoopConfiguration.set(“parquet.encryption.key.list” , 
 k1:AAECAwQFBgcICQoLDA0ODw== ,  k2:AAECAAECAAECAAECAAECAA==) val df = spark.read.parquet(“/path/to/table.parquet.encrypted)
  • 17. Real World •Master keys are kept in KMS •Develop client for your KMS server •Implement KMS client interfac e
 public interface KmsClient {
 // encrypt e.g. data key with master key (envelope encryption ) String wrapKey(byte[] keyBytes, String masterKeyIdenti fi er) 
 // decrypt ke y byte[] unwrapKey(String wrappedKey, String masterKeyIdenti fi er) }
  • 18. parquet-mr-1.12.0 Example: Hashicorp Vault Client •Search for VaultClient in github.com/apache/parquet-mr •Set up encryption sc.hadoopConfiguration.set(parquet.crypto.factory.class , 
 “org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory ) sc.hadoopConfiguration.set(parquet.encryption.kms.client.class , 
 “org.apache.parquet.crypto.keytools.samples.VaultClient ) sc.hadoopConfiguration.set(parquet.encryption.key.access.token , vault token ) sc.hadoopConfiguration.set(parquet.encryption.kms.instance.url , vault server url)
  • 19. parquet-mr-1.12.0 Example: Hashicorp Vault Client (continued) •Write dataframe with encrypted columns •Read dataframe with encrypted columns 
 import org.apache.parquet.crypto.keytools.KeyToolkit
 sampleDF.write.    option(parquet.encryption.footer.key , k1). option(parquet.encryption.column.keys , k2:columnA). parquet(/path/to/table.parquet.encrypted)
 
 val df = spark.read.parquet(/path/to/table.parquet.encrypted)
  • 20. Minimization of KMS calls Advanced Key Management Features •“double envelope encryption” • activated by default (can be disabled) •single KMS call in process lifetime, 
 or one call in X minutes, configurable • per master key
  • 21. Key Rotation Advanced Key Management Features •Refresh master keys (periodically or on demand) •Enable key rotation when writing data •Rotate master keys in key management system •re-wrap data keys in Parquet file s
 
 sc.hadoopConfiguration.set(parquet.encryption.key.material.store.internally, “false”)
 import org.apache.parquet.crypto.keytools.KeyToolkit
 
 KeyToolkit.rotateMasterKeys(“/path/to/table.parquet.encrypted, sc.hadoopConfiguration)

  • 22. Create encryption properties Parquet encryption with raw Java Configuration hadoopConfiguration = new Configuration() ; hadoopConfiguration.set(parquet.crypto.factory.class , 
 org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory) ; hadoopConfiguration.set(parquet.encryption.kms.client.class , 
 org.apache.parquet.crypto.keytools.mocks.InMemoryKMS) ; hadoopConfiguration.set(parquet.encryption.key.list ,
 k1:AAECAwQFBgcICQoLDA0ODw==,  
 k2:AAECAAECAAECAAECAAECAA==) ; hadoopConfiguration.set(parquet.encryption.footer.key , k1) ; hadoopConfiguration.set(parquet.encryption.column.keys , “k2:columnA);
  • 23. Write data Parquet encryption with raw Java EncryptionPropertiesFactory cryptoFactory = 
 EncryptionPropertiesFactory.loadFactory(hadoopConfiguration) ; FileEncryptionProperties fi leEncryptionProperties =   cryptoFactory.getFileEncryptionProperties(hadoopCon fi guration, 
 /path/to/folder/ fi le, null) ; ParquetWriter writer = ParquetWriter.builder(path/to/folder/ fi le ) .withConf(hadoopCon fi guration ) … .withEncryption( fi leEncryptionProperties ) .build();
 // write as usual

  • 24. Read data Similar, with: No need to pass footer and column key properties Parquet encryption with raw Java DecryptionPropertiesFactory, 
 ParquetReader.builder.withDecryption
 

  • 25. Performance effect of Parquet Encryption AES ciphers implemented in CPU hardware (AES-NI) • Gigabyte(s) per second in each thread • Order(s) of magnitude faster than “software stack” (App/Framework/Parquet/compression) • C++: OpenSSL EVP librar y
 Java AES-NI • AES-NI support in HotSpot since Java 9 • Java 11.0.4 – enhanced AES GCM decryption • Thanks Java community! Bottom line: Encryption won’t be your bottleneck • app workload, data I/O, encoding, compression
  • 26. Community Roadmap Apache Spark: SPARK-33966: “Two-tier encryption key management ”
 Apache Parquet MR: New features for parquet-mr-1.13+, such as Uniform encryption, CLI for encrypted data, local wrapping with key rotation, et c
 Apache Iceberg, Presto, Hudi: Integration with Parquet encryptio n
 Apache Arrow: ARROW-9947: “Python API for Parquet encryption”
  • 27. Data Analytics iCloud CloudKit Analytics • Zeppelin and Jupyter on Spark • Spark Batch Workflows • Weekly Reports • Ad-Hoc analytics • Cohorts of iCloud Users • iCloud-wide sample of all users 0.1% • Semantic and geographic cohorts • Ad-hoc • Weekly Snapshot of Metadata DBs (No user data) • iCloud Server side activity (uploads, etc.) data streams • Anonymized and stripped of private data • 100s of structured data types organized into external Hive tables
  • 28. iCloud CloudKit Analytics Use Cases iCloud Storage • Intelligent tiered storage optimizations uses combination of snapshot and streaming data • Storage capacity forecasting • Delete/compaction eligible data volume, la g
 Service utilization and spike analysis Seed builds monitoring and qualification Data integrity verification Quick ad-hoc analytics (minutes in CloudKit Analytics vs hours in Splunk)
  • 29. Encryption Requirements Master key rotation Enforce 2^32 encryption operations with same key Each encryption = 2^35 bytes (2^31 AES blocks) Scalable to Petabytes of data Reduce impact on performance of ingestion and analytics workflows
  • 30. Ingestion Pipelines Modification Steps PME in CloudKit Analytics •Update Parquet dependency to PME-compatible version •Set Hadoop Config Properties parquetConf.set(EncryptionPropertiesFactory.CRYPTO_FACTORY_CLASS_PROPERTY_NAME, AppleCryptoFactory.class.getCanonicalName()) ; // KMS client class
 parquetConf.set(KeyToolkit.KMS_CLIENT_CLASS_PROPERTY_NAME, CustomerKmsBridge.class.getCanonicalName()) ; // with this property turned on, we do not need to specify the individual key ids per column
 parquetConf.setBoolean(AppleCryptoFactory.UNIFORM_ENCRYPTION_PROPERTY_NAME, true) ; // key id for parquet.encryption.footer.key property value
 parquetConf.set(PropertiesDrivenCryptoFactory.FOOTER_KEY_PROPERTY_NAME, /*Key Name from Config*/) ; // store key material externally (separate files)
 parquetConf.setBoolean(KeyToolkit.KEY_MATERIAL_INTERNAL_PROPERTY_NAME, false);
  • 31. • Update spark configuration … properties: … spark.hadoop.parquet.crypto.factory.class: org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactor y # KMS Client class
 spark.hadoop.parquet.encryption.kms.client.class: com.apple.parquet.crypto.keytools.CustomerKmsBridge PME in CloudKit Analytics Spark Read Configuration
  • 32. Write Performance and Storage Space Impact •All columns encrypted! •No impact on ingestion time and resource utilization •Minimal storage penalty • Measurable only for datasets with parquet small files • Key Material Files: few KB eac h
 $ hadoop fs -ls hdfs://.../bucket=0 / 10100101-ff5b0f56-4779-4aea-8765-2d406bcd70a3.parque t . . . _KEY_MATERIAL_FOR_10100102-33ef104e-3ab6-49ee-9a16- b150f7da24ab.parquet.json { { Ingestion w/o encryption Ingestion w/ encryption
  • 33. Running join with aggregation on 2 large tables. All columns encrypted! No Significant Impact on Read Performance 25.1 sec with encryption 23.4 sec without encryption
  • 34. TM and © 2021 Apple Inc. All rights reserved.