SlideShare a Scribd company logo
Gidon Gershinsky, IBM Research - Haifa
gidon@il.ibm.com
Efficient Spark Analytics
on Encrypted Data
#SAISDev14
Overview
• What problem we are solving?
• Parquet modular encryption
• Spark encryption with Parquet
– connected car scenario: demo screenshots
– performance implications of encryption
– getting from prototype to real thing
2#SAISDev14
What Problem Are We Solving?
3
• Protect sensitive data-at-rest
– data confidentiality: encryption
– data integrity
– in any storage - untrusted, cloud or private, file system, object
store, archives
• Preserve performance of Spark analytics
– advanced data filtering (projection, predicate) with encrypted data
• Leverage encryption for fine-grained access control
– per-column encryption keys
– key-based access in any storage: private -> cloud -> archive
#SAISDev14
Background
• EU Horizon2020 research project
• EU partners: Adaptant, IT Innovation, OCC, Thales, UDE
– usage-based car insurance, social services
• Collaboration with UC Berkeley RISELab
– Opaque technology
• Secure cloud analytics (Spark and H/W Enclaves)
– how to protect “data-in-use”?
– how to protect “data-at-rest” – without losing analytics efficiency?
4#SAISDev14
Apache Parquet
• Integral part of Apache Spark
• Encoding, compression
• Advanced data filtering
– columnar projection: skip columns
– predicate push down: skip files, or row groups,
or data pages
• Performance benefits
– less data to fetch from storage: I/O, latency
– less data to process: CPU, latency
• How to protect sensitive Parquet data?
– in any storage - keeping performance, supporting column access control, data tamper-proofing etc.
5#SAISDev14
Parquet Modular Encryption
• Apache Parquet community work
• Full encryption: all data and metadata
modules
– metadata: indexes (min/max), schema, size and
number of secret values, encryption key metadata,
etc ..
– encrypted and tamper-proof
• Enables columnar projection and predicate
pushdown
• Storage server / admin never sees
encryption keys or cleartext data
– “client-side” encryption (e.g., Spark-side)
6#SAISDev14
Parquet Modular Encryption
• Works in any storage
– supporting range-read
• Multiple encryption algorithms
– different security and performance requirements
• Data integrity verification
– AES GCM: “authenticated encryption”
– data not tampered with
– file not replaced with wrong version
• Column access control
– encryption with column-specific keys
7#SAISDev14
AES Modes: GCM and CTR
• AES: symmetric encryption standard supported in
– Java and other languages
– CPUs (x86 and others): hardware acceleration of AES
• GCM (Galois Counter Mode)
– encryption
– integrity verification
• basic mode: make sure data not modified
• AAD mode: “additional authentication data”, e.g. file name with a version/timestamp
– prevent replacing with wrong (old) version
• CTR (Counter Mode)
– encryption only, no integrity verification
– useful when AES hardware acceleration is not available (e.g. Java 8)
8#SAISDev14
Status & Roadmap
• PARQUET-1178
– full encryption functionality layer, with API access
– format definition PR merged in Apache Parquet feature branch
– Java and C++ implementation
• one PR merged, others being reviewed
• Next: tools on top
➢ PARQUET-1373: key management tools
▪ number of different mechanisms
▪ options for storing key material
➢ PARQUET-1376: data obfuscation layer
▪ per-cell masking and aggregated anonymization
▪ reader alerts and choice of obfuscated data
9
➢ PARQUET-1396: schema service interface
▪ “no-API” interface to encryption
▪ all parameters fetched via loaded class
➢ Demo feature: Property interface
▪ “no-API” interface to encryption
▪ most parameters passed via (Hadoop) config properties
▪ key server access via loaded class
#SAISDev14
Spark Integration Prototype
• Standard Spark 2.3.0 is built with Parquet 1.8.2
• Encryption prototype: standard Spark 2.3.0 with Parquet 1.8.2-E
– Spark-side encryption and decryption
• storage (admin) never sees data key or plain data
– Column access control (key-per-column)
– Cryptographic verification of data integrity
• replaced with wrong version, tampered with
– Filtering of encrypted data
• column projection, predicate push-down
10#SAISDev14
Connected Car Usecase
11
Car Monitoring Events
- car ID (plate #)
- driver ID (license #)
- speed
- location
Personal Data of Drivers
- driver ID (license #)
- name
- home address
- email address
- credit card
Gateway
Usage-based
Insurance
Service
Spark
Parquet
who is speeding at night?
(20% and more
above limit)
Raise premium,
Notify by email
#SAISDev14
Car Event Schema
12
Secret data, to
be encrypted
Unencrypted
columns
#SAISDev14
Driver Table Schema
13
Secret data, to
be encrypted
#SAISDev14
Masked data
Writing Encrypted Parquet Files
14
Data Keys
(base64)
#SAISDev14
Writing Encrypted Parquet Files
15#SAISDev14
Reading Encrypted Parquet Files
16
location key: not eligible
credit card key: not eligible
#SAISDev14
table schema and other metadata are protected
Hidden Columns
17#SAISDev14
Query Execution
18
who is
speeding
at night?
#SAISDev14
Tampering with Data
19
missed Jack
Wilson, 80%
above speed
limit !!
#SAISDev14
Parquet Data Integrity Protection
20#SAISDev14
Key Management
21#SAISDev14
Spark
Parquet
Client
AUTH
Service
Token
KMS
• PARQUET-1373
– randomly generated Data keys
– wrapped in KMS with Master keys
– storage options for wrapped keys
• in file itself: easier to find
• in separate location: easier to re-key
– compromised Master key
– client: specify Master key ID per column
IBM Spark-based Services
• IBM Analytics Engine
– on-demand Spark (and Hadoop) clusters in IBM cloud
• Watson Studio Spark Environments
– cloud tools for data scientists and application developers
– dedicated Spark cluster per Notebook
• DB2 Event Store
– rapidly ingest, store and analyze streamed data in event-driven
applications
– analyze with a combination of Spark and C++ engines,
store with C++ Parquet
• SQL Query Service
– serverless scale-out SQL execution on TB of data
– Spark SQL engine, running in IBM Cloud
22#SAISDev14
Column Access Control with
Encryption @Uber
• Access control to sensitive columns
only
• PARQUET-1178 provides
fundamentals to encrypt columns
• PARQUET-1396 provides
schema and key retrieval interface
• Currently under development - stay
tuned!
23#SAISDev14
Encryption Performance
• Ubuntu 16.04 box
– 8-core Intel Core i7 6820HQ CPU / 2.70GHz
– 32GB memory
• Raw Java AES-GCM speed
– no Spark or Parquet, just encrypting buffers in loop
• 8 threads, 1MB buffers
– Java 8: ~ 250 MB/sec
– Java 9 and later: ~ 2-5 GB/sec
• AES-NI in Hotspot (hardware acceleration)
• Slow warmup in decryption: Bug report JDK-8201633
– can be bypassed with a workaround
– if you know someone who knows someone in Java Hotspot team…
24
Not a Benchmark
#SAISDev14
Spark Encryption Performance
• Local mode, 1 executor: 8 cores, 32GB memory
• Storage: Local RamDisk
• Write / encryption test
– load carEvents.csv files (10GB)
– cache in Spark memory!
– write as parquet{.encrypted}: measure time (of second write)
• Query / decryption test
– ‘read’ carEvents.parquet{.encrypted} files (2.7GB)
– run “speeding at night” query: measure time
25#SAISDev14
Spark Encryption Performance
• Typical usecases - only 5-10% of columns are sensitive (encrypted)
• Java 8… If needed, use GCM_CTR instead of GCM
• Next Java version in Spark will support AES hardware acceleration
• Encryption is not a bottleneck
– app workload, data I/O, encoding, compression
26
Test Write (sec) Query (sec) Notes
no encryption 26 2 query on 4 columns: input ~12% of data
encryption (GCM) 28.5 2.5 GCM on data and metadata
encryption (GCM_CTR) 26.8 2.2 CTR on data, GCM on metadata
To Be Benchmarked!
#SAISDev14
From Prototype to Real Thing
• Spark encryption prototype
– standard Spark code, built with modified Parquet
– encryption is triggered via Hadoop config
• transparent to Spark
• That’s it? Just upgrade to new Parquet version? Not really ..
– partitions!
• exception on encrypted columns? obfuscate or encrypt sensitive partition values?
– Spark support for other things above file format
• TBD!
27#SAISDev14
Parquet encryption: Community work. On track. Feedback welcome!
Spark encryption: Prototype, with gaps. Open challenge.
Backup
28#SAISDev14
Existing Solutions
• Bottom line: Need data protection mechanism built in the file format
29
Flat file
encryption
Storage server
encryption with
range read
Storage client
encryption with
range read
Preserves analytics performance X V V
Works in any storage system V X X
Hides data and keys from storage system V X V
Supports data integrity verification V V X
Enables column access control (column keys) X X X
#SAISDev14
Obfuscated Data
30
no keys for
personal data
where the worst
drivers live?
anti-speeding ad campaign:
budget per State?
Transportation
Safety
Agency
Kills!
Speeding
#SAISDev14

More Related Content

What's hot

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
DataWorks Summit
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
Amazon Web Services
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
Chicago Hadoop Users Group
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Apache Spark チュートリアル
Apache Spark チュートリアルApache Spark チュートリアル
Apache Spark チュートリアル
K Yamaguchi
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
Amazon Web Services
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
DataWorks Summit
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documentsREMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
Rhydham Joshi
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 

What's hot (20)

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Apache Spark チュートリアル
Apache Spark チュートリアルApache Spark チュートリアル
Apache Spark チュートリアル
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documentsREMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 

Similar to Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky

Big Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon GershinskyBig Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon Gershinsky
GidonGershinsky
 
ApacheCon: Apache Commons Crypto 2017
ApacheCon: Apache Commons Crypto 2017ApacheCon: Apache Commons Crypto 2017
ApacheCon: Apache Commons Crypto 2017
Dapeng Sun
 
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
Tom Diederich
 
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Ramesh Nagappan
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Databricks
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Eugene
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 serveryOracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
MarketingArrowECS_CZ
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
Deploying Cassandra Multi-cloud
Deploying Cassandra Multi-cloudDeploying Cassandra Multi-cloud
Deploying Cassandra Multi-cloud
Jeffrey Carpenter
 
DPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationDPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway Application
Michelle Holley
 
Spark in the Maritime Domain
Spark in the Maritime DomainSpark in the Maritime Domain
Spark in the Maritime Domain
Demi Ben-Ari
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
Cloudera, Inc.
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
Databricks
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
karthik_krk
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 

Similar to Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky (20)

Big Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon GershinskyBig Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon Gershinsky
 
ApacheCon: Apache Commons Crypto 2017
ApacheCon: Apache Commons Crypto 2017ApacheCon: Apache Commons Crypto 2017
ApacheCon: Apache Commons Crypto 2017
 
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
 
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 serveryOracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
Deploying Cassandra Multi-cloud
Deploying Cassandra Multi-cloudDeploying Cassandra Multi-cloud
Deploying Cassandra Multi-cloud
 
DPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationDPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway Application
 
Spark in the Maritime Domain
Spark in the Maritime DomainSpark in the Maritime Domain
Spark in the Maritime Domain
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 

Recently uploaded (20)

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 

Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky

  • 1. Gidon Gershinsky, IBM Research - Haifa gidon@il.ibm.com Efficient Spark Analytics on Encrypted Data #SAISDev14
  • 2. Overview • What problem we are solving? • Parquet modular encryption • Spark encryption with Parquet – connected car scenario: demo screenshots – performance implications of encryption – getting from prototype to real thing 2#SAISDev14
  • 3. What Problem Are We Solving? 3 • Protect sensitive data-at-rest – data confidentiality: encryption – data integrity – in any storage - untrusted, cloud or private, file system, object store, archives • Preserve performance of Spark analytics – advanced data filtering (projection, predicate) with encrypted data • Leverage encryption for fine-grained access control – per-column encryption keys – key-based access in any storage: private -> cloud -> archive #SAISDev14
  • 4. Background • EU Horizon2020 research project • EU partners: Adaptant, IT Innovation, OCC, Thales, UDE – usage-based car insurance, social services • Collaboration with UC Berkeley RISELab – Opaque technology • Secure cloud analytics (Spark and H/W Enclaves) – how to protect “data-in-use”? – how to protect “data-at-rest” – without losing analytics efficiency? 4#SAISDev14
  • 5. Apache Parquet • Integral part of Apache Spark • Encoding, compression • Advanced data filtering – columnar projection: skip columns – predicate push down: skip files, or row groups, or data pages • Performance benefits – less data to fetch from storage: I/O, latency – less data to process: CPU, latency • How to protect sensitive Parquet data? – in any storage - keeping performance, supporting column access control, data tamper-proofing etc. 5#SAISDev14
  • 6. Parquet Modular Encryption • Apache Parquet community work • Full encryption: all data and metadata modules – metadata: indexes (min/max), schema, size and number of secret values, encryption key metadata, etc .. – encrypted and tamper-proof • Enables columnar projection and predicate pushdown • Storage server / admin never sees encryption keys or cleartext data – “client-side” encryption (e.g., Spark-side) 6#SAISDev14
  • 7. Parquet Modular Encryption • Works in any storage – supporting range-read • Multiple encryption algorithms – different security and performance requirements • Data integrity verification – AES GCM: “authenticated encryption” – data not tampered with – file not replaced with wrong version • Column access control – encryption with column-specific keys 7#SAISDev14
  • 8. AES Modes: GCM and CTR • AES: symmetric encryption standard supported in – Java and other languages – CPUs (x86 and others): hardware acceleration of AES • GCM (Galois Counter Mode) – encryption – integrity verification • basic mode: make sure data not modified • AAD mode: “additional authentication data”, e.g. file name with a version/timestamp – prevent replacing with wrong (old) version • CTR (Counter Mode) – encryption only, no integrity verification – useful when AES hardware acceleration is not available (e.g. Java 8) 8#SAISDev14
  • 9. Status & Roadmap • PARQUET-1178 – full encryption functionality layer, with API access – format definition PR merged in Apache Parquet feature branch – Java and C++ implementation • one PR merged, others being reviewed • Next: tools on top ➢ PARQUET-1373: key management tools ▪ number of different mechanisms ▪ options for storing key material ➢ PARQUET-1376: data obfuscation layer ▪ per-cell masking and aggregated anonymization ▪ reader alerts and choice of obfuscated data 9 ➢ PARQUET-1396: schema service interface ▪ “no-API” interface to encryption ▪ all parameters fetched via loaded class ➢ Demo feature: Property interface ▪ “no-API” interface to encryption ▪ most parameters passed via (Hadoop) config properties ▪ key server access via loaded class #SAISDev14
  • 10. Spark Integration Prototype • Standard Spark 2.3.0 is built with Parquet 1.8.2 • Encryption prototype: standard Spark 2.3.0 with Parquet 1.8.2-E – Spark-side encryption and decryption • storage (admin) never sees data key or plain data – Column access control (key-per-column) – Cryptographic verification of data integrity • replaced with wrong version, tampered with – Filtering of encrypted data • column projection, predicate push-down 10#SAISDev14
  • 11. Connected Car Usecase 11 Car Monitoring Events - car ID (plate #) - driver ID (license #) - speed - location Personal Data of Drivers - driver ID (license #) - name - home address - email address - credit card Gateway Usage-based Insurance Service Spark Parquet who is speeding at night? (20% and more above limit) Raise premium, Notify by email #SAISDev14
  • 12. Car Event Schema 12 Secret data, to be encrypted Unencrypted columns #SAISDev14
  • 13. Driver Table Schema 13 Secret data, to be encrypted #SAISDev14 Masked data
  • 14. Writing Encrypted Parquet Files 14 Data Keys (base64) #SAISDev14
  • 15. Writing Encrypted Parquet Files 15#SAISDev14
  • 16. Reading Encrypted Parquet Files 16 location key: not eligible credit card key: not eligible #SAISDev14 table schema and other metadata are protected
  • 19. Tampering with Data 19 missed Jack Wilson, 80% above speed limit !! #SAISDev14
  • 20. Parquet Data Integrity Protection 20#SAISDev14
  • 21. Key Management 21#SAISDev14 Spark Parquet Client AUTH Service Token KMS • PARQUET-1373 – randomly generated Data keys – wrapped in KMS with Master keys – storage options for wrapped keys • in file itself: easier to find • in separate location: easier to re-key – compromised Master key – client: specify Master key ID per column
  • 22. IBM Spark-based Services • IBM Analytics Engine – on-demand Spark (and Hadoop) clusters in IBM cloud • Watson Studio Spark Environments – cloud tools for data scientists and application developers – dedicated Spark cluster per Notebook • DB2 Event Store – rapidly ingest, store and analyze streamed data in event-driven applications – analyze with a combination of Spark and C++ engines, store with C++ Parquet • SQL Query Service – serverless scale-out SQL execution on TB of data – Spark SQL engine, running in IBM Cloud 22#SAISDev14
  • 23. Column Access Control with Encryption @Uber • Access control to sensitive columns only • PARQUET-1178 provides fundamentals to encrypt columns • PARQUET-1396 provides schema and key retrieval interface • Currently under development - stay tuned! 23#SAISDev14
  • 24. Encryption Performance • Ubuntu 16.04 box – 8-core Intel Core i7 6820HQ CPU / 2.70GHz – 32GB memory • Raw Java AES-GCM speed – no Spark or Parquet, just encrypting buffers in loop • 8 threads, 1MB buffers – Java 8: ~ 250 MB/sec – Java 9 and later: ~ 2-5 GB/sec • AES-NI in Hotspot (hardware acceleration) • Slow warmup in decryption: Bug report JDK-8201633 – can be bypassed with a workaround – if you know someone who knows someone in Java Hotspot team… 24 Not a Benchmark #SAISDev14
  • 25. Spark Encryption Performance • Local mode, 1 executor: 8 cores, 32GB memory • Storage: Local RamDisk • Write / encryption test – load carEvents.csv files (10GB) – cache in Spark memory! – write as parquet{.encrypted}: measure time (of second write) • Query / decryption test – ‘read’ carEvents.parquet{.encrypted} files (2.7GB) – run “speeding at night” query: measure time 25#SAISDev14
  • 26. Spark Encryption Performance • Typical usecases - only 5-10% of columns are sensitive (encrypted) • Java 8… If needed, use GCM_CTR instead of GCM • Next Java version in Spark will support AES hardware acceleration • Encryption is not a bottleneck – app workload, data I/O, encoding, compression 26 Test Write (sec) Query (sec) Notes no encryption 26 2 query on 4 columns: input ~12% of data encryption (GCM) 28.5 2.5 GCM on data and metadata encryption (GCM_CTR) 26.8 2.2 CTR on data, GCM on metadata To Be Benchmarked! #SAISDev14
  • 27. From Prototype to Real Thing • Spark encryption prototype – standard Spark code, built with modified Parquet – encryption is triggered via Hadoop config • transparent to Spark • That’s it? Just upgrade to new Parquet version? Not really .. – partitions! • exception on encrypted columns? obfuscate or encrypt sensitive partition values? – Spark support for other things above file format • TBD! 27#SAISDev14 Parquet encryption: Community work. On track. Feedback welcome! Spark encryption: Prototype, with gaps. Open challenge.
  • 29. Existing Solutions • Bottom line: Need data protection mechanism built in the file format 29 Flat file encryption Storage server encryption with range read Storage client encryption with range read Preserves analytics performance X V V Works in any storage system V X X Hides data and keys from storage system V X V Supports data integrity verification V V X Enables column access control (column keys) X X X #SAISDev14
  • 30. Obfuscated Data 30 no keys for personal data where the worst drivers live? anti-speeding ad campaign: budget per State? Transportation Safety Agency Kills! Speeding #SAISDev14