SlideShare a Scribd company logo
1 of 37
Download to read offline
Big Data Security
in Apache Projects
Gidon Gershinsky
ApacheCon North America 2022
THIS IS NOT A CONTRIBUTION
Presenter
Gidon Gershinsky
•Committer and PMC member in Apache Parquet
•Works on big data security in Apache Parquet, Spark, Iceberg and other
projects
•Designs and builds data security solutions at Apple
Contributors
Apache community!
• Data security work: 2018 - present
• Projects: Parquet, Spark, Arrow, Iceberg, Flink, Trino
Antoine Pitrou, DB Tsai, Eliot Salant, Jack Ye, Jian Tang, Julien LeDem,
Gabor Szadovszky, Gyula Fora, Huaxin Gao, Itamar Turner-Trauring, Maya
Anderson, Muhammad Islam, Revital Eres, Roee Shlomo, Russell Spitzer,
Ryan Blue, Steven Wu, Tim Perelmutov, Tham Ha, Tomer Solomon, Vinitha
Gankidi, Xinli Shang, Yufei Gu, Zoltan Ivan
f
Thank you!!
Agenda
Big Data Security
• Our focus and goals
Foundation: Apache Parquet
• Security features; APIs and “HelloWorld” samples
• Performance e
ff
ect of encryption
Usage in Apache Spark, Iceberg, Arrow, Flink, and Trino
Integration options in other Apache Big Data projects
SQL or ML
Ingest
1. Files (CSV, Avro, Parquet, ORC, JSON)
2. Streams (Kafka messages, etc)
3. Custom data input
1. Notebook users
2. Spark, Iceberg,
Trino, Java applications
Untrusted
Storage Backend
Custom
App
Data Lake
Big Data
Key Manager Access Control
􀎡􀟛
Flink
Trino
Goal: Protect Sensitive Data-at-Rest
Keep the data Con
fi
dential
-hiding sensitive information in storage
-via encryption
Keep the data Tamper-Proof
-protecting integrity of sensitive information in storage
-via crypto-signatures and module IDs
Apache Parquet
•Popular columnar storage format
•Encoding, compression
•Advanced data
fi
ltering
•columnar projection: skip columns
•predicate push down: skip
fi
les, or row
groups, or data pages
•Built-in encryption since 2021
=
+
Columnar Statistics
Read only the
data you need
Strata 2017 Parquet Arrow Roadmap
Parquet Modular Encryption
• Encrypt and sign all modules in Parquet
fi
les (data and metadata modules)
• Preserve full Parquet capabilities (columnar
projection, predicate pushdown,
compression, etc) in encrypted
fi
les
• Preserve performance of analytic engines
with encrypted
fi
les
Read only the
data you need
Goals
2017 Parquet Arrow Roadmap
Parquet Modular Encryption
Open standard for safe storage of analytic data
• Parquet Format spec for encryption approved in 2019
• Works the same in any storage
• cloud or private,
fi
le systems, object stores, archives
• Integrated in a number of Apache Big Data frameworks
• write (encrypt) with one framework, read (decrypt) with another
• Supports any KMS (key management service)
• Per-
fi
le and per-column encryption keys
Parquet Modular Encryption
Use-cases
•Encryption and tamper-proo
fi
dential
• Full encryption mode
• all modules are hidden
• Plaintext footer mode
• footer is exposed for legacy readers
• sensitive metadata is hidden
• Separate keys for sensitive columns
• column access control
• New: uniform encryption key (all
fi
le columns)
• “Client-side” encryption
• storage backend / admin never see data or keys
Keep Data Tamper-proof
• File contents not tampered with
• File not replaced with wrong
fi
le
• Signs data and metadata modules
• with unique module IDs and
fi
le IDs
• AES GCM: “authenticated encryption”
customers-oct-2022.part0.parquet customers-jan-2020.part0.parquet
Performance of Encryption
AES ciphers implemented in
CPU hardware (AES-NI)
• Gigabyte(s) per second in each thread
• Order(s) of magnitude faster than “software
stack” (App/Framework/Parquet/
compression)
• C++: OpenSSL EVP library
Bottom line: Encryption won’t be your bottleneck
app workload, data I/O, compression, etc
Java AES-NI
• AES-NI support in HotSpot since Java 9
• Java 11.0.4 – enhanced AES GCM
decryption
• Thanks Java community!
􀍖
Envelope Encryption
•Parquet
fi
le modules are encrypted with “Data Encryption Keys” (DEKs)
• “Big Data” - lots of DEKs (one per
fi
le or even
fi
le/column - per NIST requirements for AES GCM)
•DEKs are encrypted with “Master Encryption Keys” (MEKs)
• Result is called “key material” and stored either in Parquet
fi
le footers, or in separate
fi
les “close
to data”
•MEKs are stored and managed in “Key Management Service” (KMS)
• Access control veri
fi
cation
• Number of MEKs << number of DEKs (1 MEK can encrypt ~ 2 billion DEKs)
􀟖
Apache Parquet-MR Encryption API
• High-level API
• Encryption parameters passed via Hadoop con
fi
guration
• Envelope encryption and DEK storage handled by Parquet
• Out-of-box integration (just update parquet version to 1.12+, and plug your KMS)
• Low-level API
• Direct calls on low-level Parquet API (before encryption)
• Direct DEK handling
Spark with Parquet Encryption
• Con
fi
gure encryption via Hadoop parameters
• Pass list of columns to encrypt
• Specify IDs of master keys for these columns
• Specify ID of master key for Parquet footers
• Pass class name for client of your KMS
• Activate encryption
• Since Spark 3.2.0
• doc: spark.apache.org/docs/latest/sql-data-sources-
parquet.html#columnar-encryption
• Coming up: uniform
fi
le encryption
• Released in Parquet, and merged in Spark master branch
Spark App
KMS
Auth
HelloWorld: Writing Encrypted Files
• Run spark-shell
• “Activate” encryption
• Pass master encryption keys (demo only!)
sc.hadoopConfiguration.set(“parquet.crypto.factory.class" ,
“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory”)
sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
sc.hadoopConfiguration.set(“parquet.encryption.key.list” ,
“k1:AAECAwQFBgcICQoLDA0ODw== , k2:AAECAAECAAECAAECAAECAA==")
HelloWorld: Writing Encrypted Files
• Write dataframe: “columnA” will be encrypted
• Column key format
sampleDF.write.
option("parquet.encryption.footer.key" , "k1").
option("parquet.encryption.column.keys" , "k2:columnA").
parquet("/path/to/table.parquet.encrypted")
"<masterKeyID>:<colName>,<colName>;<masterKeyID>:<colName>, ..
HelloWorld: Reading Encrypted Files
val df = spark.read.parquet(“/path/to/table.parquet.encrypted")
• Run spark-shell
• “Activate” decryption
• Pass master encryption keys (demo only!)
• Read encrypted dataframe
sc.hadoopConfiguration.set(“parquet.crypto.factory.class" ,
“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory”)
sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
sc.hadoopConfiguration.set(“parquet.encryption.key.list” ,
“k1:AAECAwQFBgcICQoLDA0ODw== , k2:AAECAAECAAECAAECAAECAA==")
• Set same Hadoop con
fi
guration as before
• Write dataframe with uniform encryption
• Read encrypted dataframe
sampleDF.write.
option("parquet.encryption.uniform.key" , "k1").
parquet("/path/to/table.parquet.encrypted")
val df = spark.read.parquet(“/path/to/table.parquet.encrypted")
Coming up: Uniform encryption
sampleDF.write.
option("parquet.encryption.uniform.key" , "k1").
parquet("/path/to/table.parquet.encrypted")
HelloWorld: Parquet API
Configuration hadoopConfiguration = new Configuration();
hadoopConfiguration.set("parquet.crypto.factory.class" ,
"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");
hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");
hadoopConfiguration.set(“parquet.encryption.key.list" ,
“k1:AAECAwQFBgcICQoLDA0ODw==, k2:AAECAAECAAECAAECAAECAA==");
hadoopConfiguration.set("parquet.encryption.footer.key" , "k1");
hadoopConfiguration.set("parquet.encryption.column.keys" , “k2:columnA”);
Create encryption properties
HelloWorld: Parquet API
Write data
EncryptionPropertiesFactory cryptoFactory =
EncryptionPropertiesFactory.loadFactory(hadoopConfiguration);
FileEncryptionProperties fileEncryptionProperties =
cryptoFactory.getFileEncryptionProperties(hadoopConfiguration,
</path/to/file>, null);
ParquetWriter writer = ParquetWriter.builder(<path/to/file>)
.withConf(hadoopConfiguration)
…
.withEncryption(fileEncryptionProperties)
.build();
// write as usual
HelloWorld: Parquet API
Read data
DecryptionPropertiesFactory cryptoFactory =
DecryptionPropertiesFactory.loadFactory(hadoopConfiguration);
FileDecryptionProperties fileDecryptionProperties =
cryptoFactory.getFileDecryptionProperties(hadoopConfiguration,
</path/to/file>);
ParquetReader reader = ParquetReader.read(InputFile)
…
.withDecryption(fileDecryptionProperties)
.build();
// read as usual
* No need to pass footer and column key properties
Real World
• Master keys are kept in your KMS (Key Management Service) in your platform
• controls key access for users in your organization
• if your org doesn’t have KMS: use cloud KMS, or install an open source KMS
• Develop client for your KMS server
• Implement KMS client interface
public interface KmsClient {
// encrypt e.g. data key with master key (envelope encryption)
String wrapKey(byte[] keyBytes, String masterKeyIdentifier)
// decrypt key
byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier)
}
Example: Hashicorp Vault
• Open source KMS
• Search for VaultClient in github.com/apache/parquet-mr
• Set up encryption
sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory”)
sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
“org.apache.parquet.crypto.keytools.samples.VaultClient”)
sc.hadoopConfiguration.set("parquet.encryption.key.access.token" , "<vault token>")
sc.hadoopConfiguration.set("parquet.encryption.kms.instance.url" , "<vault server url>")
Example: Hashicorp Vault
• Write dataframe with encrypted columns
• Read dataframe with encrypted columns
import
sampleDF.write.
option("parquet.encryption.footer.key" , "k1").
option("parquet.encryption.column.keys" , "k2:columnA").
parquet("/path/to/table.parquet.encrypted")
val df = spark.read.parquet("/path/to/table.parquet.encrypted")
Where to Store “Key Material”
•“key material”: wrapped DEK plus MEK ID. No secret info!
•Storage options
1: Inside parquet
fi
le (footer)
• straightforward
• default mode
• shortcomings: can’t rewrap DEKs (change MEK), or change KMS
• Storage options (continued)
2: Small separate “key_material”
fi
les (one per parquet
fi
le, same folder)
- allows to re-wrap / re-encrypt DEKs
- with new MEK or MEK version: if compromised, or change in user access rights
- migrate to new KMS
- shortcomings: addition cost of reading the small key_material
fi
le, for each Parquet
fi
le
sc.hadoopConfiguration.set(“parquet.encryption.key.material.store.internally" ,
“false”)
Where to Store “Key Material”
•Storage options (continued)
3: Apache Iceberg: native metadata
fi
les!
Wrapped DEKs e.g. in manifest
fi
les
• no shortcomings! (re-wrapping is possible, no extra
fi
les)
Where to Store “Key Material”
Apache Spark
Standalone Spark
- uses high-level Parquet-MR API
- parquet encryption works out of box
- parquet-mr updated to 1.12 in Spark version
3.2.0
- encryption activation and con
fi
guration via
Hadoop parameters
- DEK storage in either parquet
fi
le footers or
external key material mode
Spark with Iceberg
- WIP on Table encryption in Apache
Iceberg community
- DEK storage in Iceberg metadata
fi
les
Apache Flink
Standalone Flink
- Uses high-level Parquet-MR API
- Parquet encryption works out of box, like in Apache
Spark
- Parquet-mr updated to 1.12.2 in Flink version 1.15.0
- Encryption activation and con
fi
guration via Hadoop
parameters
- DEK storage only in parquet
fi
le footers (no external
key material mode)
- if need external key material, use Spark to collect/
compact Flink output
- or update Apache Flink code to enable external mode
Flink with Iceberg
- WIP on Table encryption in Apache
Iceberg community
- DEK storage in Iceberg metadata
fi
les
Apache Trino
Standalone Trino
- Uses its own parquet implementation (with parts
of parquet-MR)
- Xinli Shang: PR for read part https://github.com/
trinodb/trino/pull/9871
- Apache Trino community: please review and
merge!
- Decryption works via Hadoop parameters
- DEKs stored either in parquet
fi
le footers or in
external key material
fi
les
- Interoperable with Spark and Flink
Trino with Iceberg
- WIP on Table decryption in Apache
Trino and Iceberg communities
- DEK storage in Iceberg metadata
fi
les
Apache Arrow
• C++ parquet implementation (including low-level encryption API)
• Python implementation of high-level parquet encryption API
- encryption activation and con
fi
guration via PyArrow parameters (similar to Hadoop examples)
- supports envelope encryption
- currently, DEK storage only in parquet
fi
le footers
- external key material mode WIP (PR)
- interoperable with Spark and Flink
Apache Iceberg
• Has metadata
fi
les!!
- allows for optimal storage and management of encryption keys (DEKs)
• Based on parquet-MR, with some low-level calls and changes
• Encryption project (github.com/apache/iceberg/projects/5)
- use parquet encryption standard format
- build key management: based on envelope encryption, and DEK storage in metadata
fi
les
• MVP version
- Uniform data encryption : Table key
- all table columns encrypted with same MEK
Apache Iceberg
• Data security roadmap
- MVP version (some PRs are merged, some WIP)
- uniform data encryption : Table key
- Manifest encryption
- Column encryption
- Manifest list encryption
- Beyond parquet: ORC and Avro data files
- KMS optimization
- double-wrapping (no KMS calls per file / column)
- two-tier key management (only driver node talks to KMS, not worker nodes)
- Table integrity / tamper-proofing
- encrypting and signing each data and metadata file
- end to end protection of big data tables! (confidential AND tamper-proof)
Other Apache Big Data projects
• Adding encryption can range from 0-e
ff
ort (just turn it on) to a sizable e
ff
ort
• Encryption checklist for each project
- supports Parquet format?
- which implementation? (own? parquet-mr? pyarrow? ..)
- uses high-level or low-level parquet API?
- has its own metadata
fi
les?
• Feel free to contact me! gershinsky@apache.org
Questions!
gershinsky@apache.org

More Related Content

What's hot

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPRYugabyte
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringHadi Fadlallah
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips confluent
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...confluent
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 

What's hot (20)

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPR
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 

Similar to Big Data Security in Apache Projects by Gidon Gershinsky

Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Transparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQLTransparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQLMasahiko Sawada
 
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and RestIBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and RestSandeep Patil
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of ViewKaran Alang
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Timothy Spann
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security Sandeep Patil
 
Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Toni de la Fuente
 
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...RootedCON
 
Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersSematext Group, Inc.
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewLuciano Resende
 
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleLeveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleDatabricks
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Masahiko Sawada
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg
 
Securing your azure web app with asp.net core data protection
Securing your azure web app with asp.net core data protectionSecuring your azure web app with asp.net core data protection
Securing your azure web app with asp.net core data protectionMike Melusky
 
Hacking Tizen : The OS of Everything - Nullcon Goa 2015
Hacking Tizen : The OS of Everything - Nullcon Goa 2015Hacking Tizen : The OS of Everything - Nullcon Goa 2015
Hacking Tizen : The OS of Everything - Nullcon Goa 2015Ajin Abraham
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
WarsawITDays_ ApacheNiFi202
WarsawITDays_ ApacheNiFi202WarsawITDays_ ApacheNiFi202
WarsawITDays_ ApacheNiFi202Timothy Spann
 
Managing your secrets in a cloud environment
Managing your secrets in a cloud environmentManaging your secrets in a cloud environment
Managing your secrets in a cloud environmentTaswar Bhatti
 

Similar to Big Data Security in Apache Projects by Gidon Gershinsky (20)

Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
4aa5 3404
4aa5 34044aa5 3404
4aa5 3404
 
Transparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQLTransparent Data Encryption in PostgreSQL
Transparent Data Encryption in PostgreSQL
 
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and RestIBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
 
Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017
 
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...
 
Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch Clusters
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
 
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleLeveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
Securing your azure web app with asp.net core data protection
Securing your azure web app with asp.net core data protectionSecuring your azure web app with asp.net core data protection
Securing your azure web app with asp.net core data protection
 
Hacking Tizen : The OS of Everything - Nullcon Goa 2015
Hacking Tizen : The OS of Everything - Nullcon Goa 2015Hacking Tizen : The OS of Everything - Nullcon Goa 2015
Hacking Tizen : The OS of Everything - Nullcon Goa 2015
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
WarsawITDays_ ApacheNiFi202
WarsawITDays_ ApacheNiFi202WarsawITDays_ ApacheNiFi202
WarsawITDays_ ApacheNiFi202
 
Managing your secrets in a cloud environment
Managing your secrets in a cloud environmentManaging your secrets in a cloud environment
Managing your secrets in a cloud environment
 

Recently uploaded

React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 

Recently uploaded (20)

React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 

Big Data Security in Apache Projects by Gidon Gershinsky

  • 1. Big Data Security in Apache Projects Gidon Gershinsky ApacheCon North America 2022 THIS IS NOT A CONTRIBUTION
  • 2. Presenter Gidon Gershinsky •Committer and PMC member in Apache Parquet •Works on big data security in Apache Parquet, Spark, Iceberg and other projects •Designs and builds data security solutions at Apple
  • 3. Contributors Apache community! • Data security work: 2018 - present • Projects: Parquet, Spark, Arrow, Iceberg, Flink, Trino Antoine Pitrou, DB Tsai, Eliot Salant, Jack Ye, Jian Tang, Julien LeDem, Gabor Szadovszky, Gyula Fora, Huaxin Gao, Itamar Turner-Trauring, Maya Anderson, Muhammad Islam, Revital Eres, Roee Shlomo, Russell Spitzer, Ryan Blue, Steven Wu, Tim Perelmutov, Tham Ha, Tomer Solomon, Vinitha Gankidi, Xinli Shang, Yufei Gu, Zoltan Ivan f Thank you!!
  • 4. Agenda Big Data Security • Our focus and goals Foundation: Apache Parquet • Security features; APIs and “HelloWorld” samples • Performance e ff ect of encryption Usage in Apache Spark, Iceberg, Arrow, Flink, and Trino Integration options in other Apache Big Data projects
  • 5. SQL or ML Ingest 1. Files (CSV, Avro, Parquet, ORC, JSON) 2. Streams (Kafka messages, etc) 3. Custom data input 1. Notebook users 2. Spark, Iceberg, Trino, Java applications Untrusted Storage Backend Custom App Data Lake Big Data Key Manager Access Control 􀎡􀟛 Flink Trino
  • 6. Goal: Protect Sensitive Data-at-Rest Keep the data Con fi dential -hiding sensitive information in storage -via encryption Keep the data Tamper-Proof -protecting integrity of sensitive information in storage -via crypto-signatures and module IDs
  • 7. Apache Parquet •Popular columnar storage format •Encoding, compression •Advanced data fi ltering •columnar projection: skip columns •predicate push down: skip fi les, or row groups, or data pages •Built-in encryption since 2021 = + Columnar Statistics Read only the data you need Strata 2017 Parquet Arrow Roadmap
  • 8. Parquet Modular Encryption • Encrypt and sign all modules in Parquet fi les (data and metadata modules) • Preserve full Parquet capabilities (columnar projection, predicate pushdown, compression, etc) in encrypted fi les • Preserve performance of analytic engines with encrypted fi les Read only the data you need Goals 2017 Parquet Arrow Roadmap
  • 9. Parquet Modular Encryption Open standard for safe storage of analytic data • Parquet Format spec for encryption approved in 2019 • Works the same in any storage • cloud or private, fi le systems, object stores, archives • Integrated in a number of Apache Big Data frameworks • write (encrypt) with one framework, read (decrypt) with another • Supports any KMS (key management service) • Per- fi le and per-column encryption keys
  • 11. fi dential • Full encryption mode • all modules are hidden • Plaintext footer mode • footer is exposed for legacy readers • sensitive metadata is hidden • Separate keys for sensitive columns • column access control • New: uniform encryption key (all fi le columns) • “Client-side” encryption • storage backend / admin never see data or keys
  • 12. Keep Data Tamper-proof • File contents not tampered with • File not replaced with wrong fi le • Signs data and metadata modules • with unique module IDs and fi le IDs • AES GCM: “authenticated encryption” customers-oct-2022.part0.parquet customers-jan-2020.part0.parquet
  • 13. Performance of Encryption AES ciphers implemented in CPU hardware (AES-NI) • Gigabyte(s) per second in each thread • Order(s) of magnitude faster than “software stack” (App/Framework/Parquet/ compression) • C++: OpenSSL EVP library Bottom line: Encryption won’t be your bottleneck app workload, data I/O, compression, etc Java AES-NI • AES-NI support in HotSpot since Java 9 • Java 11.0.4 – enhanced AES GCM decryption • Thanks Java community!
  • 14. 􀍖 Envelope Encryption •Parquet fi le modules are encrypted with “Data Encryption Keys” (DEKs) • “Big Data” - lots of DEKs (one per fi le or even fi le/column - per NIST requirements for AES GCM) •DEKs are encrypted with “Master Encryption Keys” (MEKs) • Result is called “key material” and stored either in Parquet fi le footers, or in separate fi les “close to data” •MEKs are stored and managed in “Key Management Service” (KMS) • Access control veri fi cation • Number of MEKs << number of DEKs (1 MEK can encrypt ~ 2 billion DEKs) 􀟖
  • 15. Apache Parquet-MR Encryption API • High-level API • Encryption parameters passed via Hadoop con fi guration • Envelope encryption and DEK storage handled by Parquet • Out-of-box integration (just update parquet version to 1.12+, and plug your KMS) • Low-level API • Direct calls on low-level Parquet API (before encryption) • Direct DEK handling
  • 16. Spark with Parquet Encryption • Con fi gure encryption via Hadoop parameters • Pass list of columns to encrypt • Specify IDs of master keys for these columns • Specify ID of master key for Parquet footers • Pass class name for client of your KMS • Activate encryption • Since Spark 3.2.0 • doc: spark.apache.org/docs/latest/sql-data-sources- parquet.html#columnar-encryption • Coming up: uniform fi le encryption • Released in Parquet, and merged in Spark master branch Spark App KMS Auth
  • 17. HelloWorld: Writing Encrypted Files • Run spark-shell • “Activate” encryption • Pass master encryption keys (demo only!) sc.hadoopConfiguration.set(“parquet.crypto.factory.class" , “org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory”) sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") sc.hadoopConfiguration.set(“parquet.encryption.key.list” , “k1:AAECAwQFBgcICQoLDA0ODw== , k2:AAECAAECAAECAAECAAECAA==")
  • 18. HelloWorld: Writing Encrypted Files • Write dataframe: “columnA” will be encrypted • Column key format sampleDF.write. option("parquet.encryption.footer.key" , "k1"). option("parquet.encryption.column.keys" , "k2:columnA"). parquet("/path/to/table.parquet.encrypted") "<masterKeyID>:<colName>,<colName>;<masterKeyID>:<colName>, ..
  • 19. HelloWorld: Reading Encrypted Files val df = spark.read.parquet(“/path/to/table.parquet.encrypted") • Run spark-shell • “Activate” decryption • Pass master encryption keys (demo only!) • Read encrypted dataframe sc.hadoopConfiguration.set(“parquet.crypto.factory.class" , “org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory”) sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") sc.hadoopConfiguration.set(“parquet.encryption.key.list” , “k1:AAECAwQFBgcICQoLDA0ODw== , k2:AAECAAECAAECAAECAAECAA==")
  • 20. • Set same Hadoop con fi guration as before • Write dataframe with uniform encryption • Read encrypted dataframe sampleDF.write. option("parquet.encryption.uniform.key" , "k1"). parquet("/path/to/table.parquet.encrypted") val df = spark.read.parquet(“/path/to/table.parquet.encrypted") Coming up: Uniform encryption sampleDF.write. option("parquet.encryption.uniform.key" , "k1"). parquet("/path/to/table.parquet.encrypted")
  • 21. HelloWorld: Parquet API Configuration hadoopConfiguration = new Configuration(); hadoopConfiguration.set("parquet.crypto.factory.class" , "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory"); hadoopConfiguration.set("parquet.encryption.kms.client.class" , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS"); hadoopConfiguration.set(“parquet.encryption.key.list" , “k1:AAECAwQFBgcICQoLDA0ODw==, k2:AAECAAECAAECAAECAAECAA=="); hadoopConfiguration.set("parquet.encryption.footer.key" , "k1"); hadoopConfiguration.set("parquet.encryption.column.keys" , “k2:columnA”); Create encryption properties
  • 22. HelloWorld: Parquet API Write data EncryptionPropertiesFactory cryptoFactory = EncryptionPropertiesFactory.loadFactory(hadoopConfiguration); FileEncryptionProperties fileEncryptionProperties = cryptoFactory.getFileEncryptionProperties(hadoopConfiguration, </path/to/file>, null); ParquetWriter writer = ParquetWriter.builder(<path/to/file>) .withConf(hadoopConfiguration) … .withEncryption(fileEncryptionProperties) .build(); // write as usual
  • 23. HelloWorld: Parquet API Read data DecryptionPropertiesFactory cryptoFactory = DecryptionPropertiesFactory.loadFactory(hadoopConfiguration); FileDecryptionProperties fileDecryptionProperties = cryptoFactory.getFileDecryptionProperties(hadoopConfiguration, </path/to/file>); ParquetReader reader = ParquetReader.read(InputFile) … .withDecryption(fileDecryptionProperties) .build(); // read as usual * No need to pass footer and column key properties
  • 24. Real World • Master keys are kept in your KMS (Key Management Service) in your platform • controls key access for users in your organization • if your org doesn’t have KMS: use cloud KMS, or install an open source KMS • Develop client for your KMS server • Implement KMS client interface public interface KmsClient { // encrypt e.g. data key with master key (envelope encryption) String wrapKey(byte[] keyBytes, String masterKeyIdentifier) // decrypt key byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier) }
  • 25. Example: Hashicorp Vault • Open source KMS • Search for VaultClient in github.com/apache/parquet-mr • Set up encryption sc.hadoopConfiguration.set("parquet.crypto.factory.class" , “org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory”) sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" , “org.apache.parquet.crypto.keytools.samples.VaultClient”) sc.hadoopConfiguration.set("parquet.encryption.key.access.token" , "<vault token>") sc.hadoopConfiguration.set("parquet.encryption.kms.instance.url" , "<vault server url>")
  • 26. Example: Hashicorp Vault • Write dataframe with encrypted columns • Read dataframe with encrypted columns import sampleDF.write. option("parquet.encryption.footer.key" , "k1"). option("parquet.encryption.column.keys" , "k2:columnA"). parquet("/path/to/table.parquet.encrypted") val df = spark.read.parquet("/path/to/table.parquet.encrypted")
  • 27. Where to Store “Key Material” •“key material”: wrapped DEK plus MEK ID. No secret info! •Storage options 1: Inside parquet fi le (footer) • straightforward • default mode • shortcomings: can’t rewrap DEKs (change MEK), or change KMS
  • 28. • Storage options (continued) 2: Small separate “key_material” fi les (one per parquet fi le, same folder) - allows to re-wrap / re-encrypt DEKs - with new MEK or MEK version: if compromised, or change in user access rights - migrate to new KMS - shortcomings: addition cost of reading the small key_material fi le, for each Parquet fi le sc.hadoopConfiguration.set(“parquet.encryption.key.material.store.internally" , “false”) Where to Store “Key Material”
  • 29. •Storage options (continued) 3: Apache Iceberg: native metadata fi les! Wrapped DEKs e.g. in manifest fi les • no shortcomings! (re-wrapping is possible, no extra fi les) Where to Store “Key Material”
  • 30. Apache Spark Standalone Spark - uses high-level Parquet-MR API - parquet encryption works out of box - parquet-mr updated to 1.12 in Spark version 3.2.0 - encryption activation and con fi guration via Hadoop parameters - DEK storage in either parquet fi le footers or external key material mode Spark with Iceberg - WIP on Table encryption in Apache Iceberg community - DEK storage in Iceberg metadata fi les
  • 31. Apache Flink Standalone Flink - Uses high-level Parquet-MR API - Parquet encryption works out of box, like in Apache Spark - Parquet-mr updated to 1.12.2 in Flink version 1.15.0 - Encryption activation and con fi guration via Hadoop parameters - DEK storage only in parquet fi le footers (no external key material mode) - if need external key material, use Spark to collect/ compact Flink output - or update Apache Flink code to enable external mode Flink with Iceberg - WIP on Table encryption in Apache Iceberg community - DEK storage in Iceberg metadata fi les
  • 32. Apache Trino Standalone Trino - Uses its own parquet implementation (with parts of parquet-MR) - Xinli Shang: PR for read part https://github.com/ trinodb/trino/pull/9871 - Apache Trino community: please review and merge! - Decryption works via Hadoop parameters - DEKs stored either in parquet fi le footers or in external key material fi les - Interoperable with Spark and Flink Trino with Iceberg - WIP on Table decryption in Apache Trino and Iceberg communities - DEK storage in Iceberg metadata fi les
  • 33. Apache Arrow • C++ parquet implementation (including low-level encryption API) • Python implementation of high-level parquet encryption API - encryption activation and con fi guration via PyArrow parameters (similar to Hadoop examples) - supports envelope encryption - currently, DEK storage only in parquet fi le footers - external key material mode WIP (PR) - interoperable with Spark and Flink
  • 34. Apache Iceberg • Has metadata fi les!! - allows for optimal storage and management of encryption keys (DEKs) • Based on parquet-MR, with some low-level calls and changes • Encryption project (github.com/apache/iceberg/projects/5) - use parquet encryption standard format - build key management: based on envelope encryption, and DEK storage in metadata fi les • MVP version - Uniform data encryption : Table key - all table columns encrypted with same MEK
  • 35. Apache Iceberg • Data security roadmap - MVP version (some PRs are merged, some WIP) - uniform data encryption : Table key - Manifest encryption - Column encryption - Manifest list encryption - Beyond parquet: ORC and Avro data files - KMS optimization - double-wrapping (no KMS calls per file / column) - two-tier key management (only driver node talks to KMS, not worker nodes) - Table integrity / tamper-proofing - encrypting and signing each data and metadata file - end to end protection of big data tables! (confidential AND tamper-proof)
  • 36. Other Apache Big Data projects • Adding encryption can range from 0-e ff ort (just turn it on) to a sizable e ff ort • Encryption checklist for each project - supports Parquet format? - which implementation? (own? parquet-mr? pyarrow? ..) - uses high-level or low-level parquet API? - has its own metadata fi les? • Feel free to contact me! gershinsky@apache.org