Apple logo is a trademark of Apple Inc. This presentation discusses Parquet encryption at scale using Spark and Parquet. It covers goals of Parquet modular encryption including data privacy, integrity, and performance. It demonstrates writing and reading encrypted Parquet files in Spark and discusses the Apache community roadmap for further integration of Parquet encryption.
Efficient Spark Analytics on Encrypted Data with Gidon GershinskyDatabricks
Enterprises and non-profit organizations often work with sensitive business or personal information, that must be stored in an encrypted form due to corporate confidentiality requirements, the new GDPR regulations, and other reasons. Unfortunately, a straightforward encryption doesn’t work well for modern columnar data formats, such as Apache Parquet, that are leveraged by Spark for acceleration of data ingest and processing. When Parquet files are bulk-encrypted at the storage, their internal modules can’t be extracted, leading to a loss of column / row filtering capabilities and a significant slowdown of Spark workloads.
Existing solutions suffer from either performance or security drawbacks. We work with the Apache Parquet community on a new modular encryption mechanism, that enables full columnar projection and predicate push down (filtering) functionality on encrypted data in any storage system. Besides confidentiality, the mechanism supports data authentication, where the reader can verify a file has not been tampered with or replaced with a wrong version. Different columns can be encrypted with different keys, allowing for a fine grained access control.
In this talk, I will demonstrate Spark integration with the Parquet modular encryption mechanism, running efficient analytics directly on encrypted data. The demonstration scenarios are derived from use cases in our joint research project with a number of European companies, working with sensitive data such as connected car messages (location, speed, driver identity, etc). I will describe the encryption mechanism, and the observed performance implications of encrypting and decrypting data in Spark SQL workloads.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
How to Extend Apache Spark with Customized OptimizationsDatabricks
There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.
Efficient Spark Analytics on Encrypted Data with Gidon GershinskyDatabricks
Enterprises and non-profit organizations often work with sensitive business or personal information, that must be stored in an encrypted form due to corporate confidentiality requirements, the new GDPR regulations, and other reasons. Unfortunately, a straightforward encryption doesn’t work well for modern columnar data formats, such as Apache Parquet, that are leveraged by Spark for acceleration of data ingest and processing. When Parquet files are bulk-encrypted at the storage, their internal modules can’t be extracted, leading to a loss of column / row filtering capabilities and a significant slowdown of Spark workloads.
Existing solutions suffer from either performance or security drawbacks. We work with the Apache Parquet community on a new modular encryption mechanism, that enables full columnar projection and predicate push down (filtering) functionality on encrypted data in any storage system. Besides confidentiality, the mechanism supports data authentication, where the reader can verify a file has not been tampered with or replaced with a wrong version. Different columns can be encrypted with different keys, allowing for a fine grained access control.
In this talk, I will demonstrate Spark integration with the Parquet modular encryption mechanism, running efficient analytics directly on encrypted data. The demonstration scenarios are derived from use cases in our joint research project with a number of European companies, working with sensitive data such as connected car messages (location, speed, driver identity, etc). I will describe the encryption mechanism, and the observed performance implications of encrypting and decrypting data in Spark SQL workloads.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
How to Extend Apache Spark with Customized OptimizationsDatabricks
There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
To provide better security, ORC files are adding column encryption. Column encryption provides the ability to grant access to different columns within the same file. All of the encryption is handled transparently to the user.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Protect your private data with ORC column encryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly.
Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022
Businesses need to react to results immediately; to achieve this, real-time processing is becoming a requirement in many analytic verticals. But sometimes, the move from batch to real-time can leave you in a pinch. How do you handle and correct mistakes in your data? How do you migrate a new system to real-time along with historical data?
Let’s start with how to run Apache Druid locally with your containerized-based development environment. While streaming real-time events from Kafka into Druid, an S3 Complaint Store captures messages via Kafka Connect, for historical processing. An exploration of performance implications when the real-time stream of events contains historical data and how that affects performance and the techniques to prevent those issues, leaving a high-performance analytic platform supporting real-time and historical processing.
You’ll leave with the tools of doing real-time analytic processing and historical batch processing from a single source of truth. Your Druid cluster will have better rollups (pre-computed aggregates) and fewer segments, which reduces cost and improves query performance.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax
This talk will review the advanced security features in DataStax Enterprise and discuss best practices for secure deployments. In particular, topics reviewed will cover: Authentication with Kerberos & LDAP/Active Directory, Role-based Authorization and LDAP role assignment, Auditing, Securing network communication, Encrypting data files and using the Key-Management Interoperability Protocol (KMIP) for secure off-host key management. The talk will also suggest strategies for addressing security needs not met directly by the built-in features of the database such as how to address applications that require Attribute Based Access Control (ABAC).
About the Speaker
Matt Kennedy Sr. Product Manager, DataStax
Matt Kennedy works at DataStax as the product manager for DataStax Enterprise Core. Matt has been a Cassandra user and occasional contributor since version 0.7 and was named a Cassandra MVP in 2013 shortly before joining DataStax. Unlike Cassandra, Matt is not partition tolerant.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
To provide better security, ORC files are adding column encryption. Column encryption provides the ability to grant access to different columns within the same file. All of the encryption is handled transparently to the user.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Protect your private data with ORC column encryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly.
Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022
Businesses need to react to results immediately; to achieve this, real-time processing is becoming a requirement in many analytic verticals. But sometimes, the move from batch to real-time can leave you in a pinch. How do you handle and correct mistakes in your data? How do you migrate a new system to real-time along with historical data?
Let’s start with how to run Apache Druid locally with your containerized-based development environment. While streaming real-time events from Kafka into Druid, an S3 Complaint Store captures messages via Kafka Connect, for historical processing. An exploration of performance implications when the real-time stream of events contains historical data and how that affects performance and the techniques to prevent those issues, leaving a high-performance analytic platform supporting real-time and historical processing.
You’ll leave with the tools of doing real-time analytic processing and historical batch processing from a single source of truth. Your Druid cluster will have better rollups (pre-computed aggregates) and fewer segments, which reduces cost and improves query performance.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax
This talk will review the advanced security features in DataStax Enterprise and discuss best practices for secure deployments. In particular, topics reviewed will cover: Authentication with Kerberos & LDAP/Active Directory, Role-based Authorization and LDAP role assignment, Auditing, Securing network communication, Encrypting data files and using the Key-Management Interoperability Protocol (KMIP) for secure off-host key management. The talk will also suggest strategies for addressing security needs not met directly by the built-in features of the database such as how to address applications that require Attribute Based Access Control (ABAC).
About the Speaker
Matt Kennedy Sr. Product Manager, DataStax
Matt Kennedy works at DataStax as the product manager for DataStax Enterprise Core. Matt has been a Cassandra user and occasional contributor since version 0.7 and was named a Cassandra MVP in 2013 shortly before joining DataStax. Unlike Cassandra, Matt is not partition tolerant.
Toni de la Fuente - Automate or die! How to survive to an attack in the Cloud...RootedCON
Los procedimientos relacionados con Respuesta a Incidentes y Análisis Forense son diferentes en la nube respecto a cuando se realizan en entornos tradicionales, locales. Veremos las diferencias entre el análisis forense digital tradicional y el relacionado con sistemas en la nube de AWS, Azure o Google Compute Platform. Cuando se trata de la nube y nos movemos en un entorno totalmente virtual nos enfrentamos a desafíos que son diferentes al mundo tradicional. Lo que antes era hardware, ahora es software. Con los proveedores de infraestructura en la nube trabajamos con APIs, creamos, eliminamos o modificamos cualquier recurso con una llamada a su API. Disponemos de balanceadores, servidores, routers, firewalls, bases de datos, WAFs, sistemas de cifrado y muchos recursos más a sin abrir una caja y sin tocar un cable. A golpe de comando. Es lo que conocemos como Infraestructura como código. Si lo puedes programar, lo puedes automatizar. ¿Como podemos aprovecharnos de ello desde el punto de vista de la respuesta a incidentes, análisis forense o incluso hardening automatizado?
Security in IaaS, attacks, hardening, incident response, forensics and all about its automation. Despite I will talk about general concept related to AWS, Azure and GCP, I will show specific demos and threats in AWS and I will go in detail with some caveats and hazards in AWS.
Describes 3 levels of complexity when implementing a secret management architecture, and presents 2 real world examples.
Technologies used: Hashicorp Vault, Chef Vault, AWS KMS, git-crypt.
Apache Spot (incubating) can be installed on a new or existing Hadoop cluster, its components viewed as services and distributed according to common roles in the cluster.
Design-Time Properties in Custom Pipeline ComponentsDaniel Toomey
Understanding Design-Time Properties for Custom Pipeline Components in BizTalk Server
Based on article by Saravana Kumar, MCAD (Charter member), MCP (BizTalk 2004)
Published: December 2006
key aggregate cryptosystem for scalable data sharing in cloudSravan Narra
Data sharing is an important functionality in cloud storage.
Here we show how to securely, efficiently, and flexibly share data with others in cloud storage.
We describe new public-key cryptosystems that produce constant-size ciphertexts.
One can aggregate any set of secret keys and make them as compact as a single key, but encompassing the power of all the keys being aggregated.
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleDatabricks
The increase in consumer data privacy laws brings continuing challenges to data teams all over the world which collect, store, and use data protected by these laws. The data engineering team at Mars Petcare is no exception, and in order to improve efficiency and accuracy in responding to these challenges they have built Gecko: an efficient, auditable, and simple CCPA compliance ecosystem designed for Spark and Delta Lake.
Can puppet help you run docker on a T2.Micro?Neil Millard
A puppet beginners guide through a number of the key concepts of puppet; stages, Role and profile, hiera data and puppet forge, as well as a brief introduction to Docker.
We will use these to explain a solution of running a puppet manifest to configure Amazon's smallest server to run a docker containerised web service.
You will learn why puppet stages are required in this solution, how roles and profiles are defined and used, and finally use of the puppet Forge with Hiera data to install and run docker containers.
This talk will contain links to code that can be used afterwards and we'll touch on what docker is and how to configure the puppet module to automatically run containers.
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
Talk 1 : Evolution of the GoPro's data platform
In this talk, we will share GoPro’s experiences in building Data Analytics Cluster in Cloud. We will discuss: evolution of data platform from fixed-size Hadoop clusters to Cloud-based Spark Cluster with Centralized Hive Metastore +S3: Cost Benefits and DevOp Impact; Configurable, spark-based batch Ingestion/ETL framework;
Migration Streaming framework to Cloud + S3;
Analytics metrics delivery with Slack integration;
BedRock: Data Platform Management, Visualization & Self-Service Portal
Visualizing Machine learning Features via Google Facets + Spark
Speakers: Chester Chen
Chester Chen is the Head of Data Science & Engineering, GoPro. Previously, he was the Director of Engineering at Alpine Data Lab.
David Winters
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. Previously He worked at Apple & Splice Machines.
Hao Zou
Hao is a Senior big data engineer at Data Science and Engineering team. Previously He worked as Alpine Data Labs and Pivotal
The Document provides an overview of
the key security challenges in Big Data (Apache Hadoop)systems, and showcases the solutions used by Hortonworks Distribution to solve these security challenges.
Similar to Data Security at Scale through Spark and Parquet Encryption (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Security at Scale through Spark and Parquet Encryption
1. Apple logo is a trademark of Apple Inc.
Gidon Gershinsky, Tim Perelmutov | Data + AI Summit
Data Security at Scale through
Spark and Parquet Encryption
THIS IS NOT A CONTRIBUTION
2. Presenters
Gidon Gershinsky
• Designs and builds data security solutions at Apple
• Leading role in Apache Parquet community work on data encryption
Tim Perelmutov
• Data ingestion and analytics for iCloud
3. Agenda
Parquet Encryption: Goals and Features
Status in Apache projects
API and “Hello World” samples
Community Roadmap
Demo Learnings: using Parquet Encryption at Scale
4. Apache Parquet
• Popular columnar storage format
• Encoding, compression
• Advanced data filtering
• columnar projection: skip columns
• predicate push down: skip files, or row groups, or
data page
s
• Performance benefits of Parquet filtering
- less data to fetch from storage: I/O, time
- less data to process: CPU, latenc
y
• How to protect sensitive Parquet data?
=
+
Columnar Statistics
Read only the
data you need
Strata 2017 Parquet Arrow Roadmap
5. Parquet Modular Encryption: Goals
• data privacy/confidentiality
- hiding sensitive informatio
n
•data integrity
- tamper-proofing sensitive information
+
Protect sensitive data-at-rest
Photo by Manuel Geissinger from Pexels
6. Parquet Modular Encryption: Goals
• Full Parquet capabilities (columnar
projection, predicate pushdown, etc) with
encrypted dat
a
• Big Data challenge: Integrity protection
• signing full files will break Parquet filtering,
and slow analytic workloads down by
order(s) of magnitude
+
Read only the
data you need
Preserve performance of analytic engines
2017 Parquet Arrow Roadmap
7. Define open standard for safe storage of analytic data
• works the same in any storage
• cloud or private, file systems, object stores, archives
• untrusted storage!
•with any KMS (key management service)
•key-based access in any storage: private - cloud - archive
•enable per-column encryption keys
Parquet Modular Encryption: Goals
8. Big Data Challenges
Safe migration from one storage to another
• no need to import / decrypt / export / encrypt
• simply move the files
Sharing data subset / table column(s)
• no need to extract / encrypt a copy for each user
• simply provide column key access to eligible users
Parquet Modular Encryption: Goals
9. Data Privacy / Confidentiality
Full encryption mode
•all modules are hidden
Plaintext footer mode
•footer is exposed for legacy readers
•sensitive metadata is hidden
Separate keys for sensitive columns
•column access control
“Client-side” encryption
•storage backend / admin never see data or keys
10. Data Integrity
File contents not tampered with
File not replaced with wrong fil
e
PME signs data and metadata modules
•with module ID and file I
D
AES GCM: “authenticated encryption
”
Framework for other encryption algorithms
customers-may-2021.part0.parquet customers-jan-2020.part0.parquet
11. Envelope Encryption
• Parquet file modules are encrypted with “Data Encryption Keys” (DEKs)
• DEKs are encrypted with “Master Encryption Keys” (MEKs)
• result is called “key material” and stored either in Parquet file footers, or in separate files in same folder
• MEKs are stored and managed in “Key Management Service” (KMS)
• access control verification
• Advanced mode in Parquet: Double Envelope Encryption
• DEKs are encrypted with “Key Encryption Keys” (KEKs)
• KEKs are encrypted with MEKs
• single KMS call in process lifetime / or one call in X minutes, configurable
12. Thank you to all contributors!
Current Status
Format
• PME specification
approved and released
in 2019 (v2.7)
MR
• Java implementation,
released in 2021
(v1.12.0)
• C++ implementation,
merged in 2021
• Python interface under
construction
• Parquet updated to 1.12.0 -
enables basic encryption
out-of-box
• Planned for Spark 3.2.0
releas
e
Other analytic frameworks
ongoing work on integrating
Parquet encryption
13. Invoke encryption via Hadoop parameters
Spark with Parquet Encryption
• pass list of columns to encrypt
• specify IDs of master keys for these columns
• specify ID of master key for Parquet footers
• pass class name for client of your KMS
• activate encryption
• instructions at PARQUET-1854
• try today!
• clone Spark repo and build a runnable distribution
Spark App
KMS
Auth
17. Real World
•Master keys are kept in KMS
•Develop client for your KMS server
•Implement KMS client interfac
e
public interface KmsClient {
// encrypt e.g. data key with master key (envelope encryption
)
String wrapKey(byte[] keyBytes, String masterKeyIdenti
fi
er)
// decrypt ke
y
byte[] unwrapKey(String wrappedKey, String masterKeyIdenti
fi
er)
}
18. parquet-mr-1.12.0
Example: Hashicorp Vault Client
•Search for VaultClient in github.com/apache/parquet-mr
•Set up encryption
sc.hadoopConfiguration.set(parquet.crypto.factory.class ,
“org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory
)
sc.hadoopConfiguration.set(parquet.encryption.kms.client.class ,
“org.apache.parquet.crypto.keytools.samples.VaultClient
)
sc.hadoopConfiguration.set(parquet.encryption.key.access.token , vault token
)
sc.hadoopConfiguration.set(parquet.encryption.kms.instance.url , vault server url)
20. Minimization of KMS calls
Advanced Key Management Features
•“double envelope encryption”
• activated by default (can be disabled)
•single KMS call in process lifetime,
or one call in X minutes, configurable
• per master key
21. Key Rotation
Advanced Key Management Features
•Refresh master keys (periodically or on demand)
•Enable key rotation when writing data
•Rotate master keys in key management system
•re-wrap data keys in Parquet file
s
sc.hadoopConfiguration.set(parquet.encryption.key.material.store.internally, “false”)
import org.apache.parquet.crypto.keytools.KeyToolkit
KeyToolkit.rotateMasterKeys(“/path/to/table.parquet.encrypted, sc.hadoopConfiguration)
23. Write data
Parquet encryption with raw Java
EncryptionPropertiesFactory cryptoFactory =
EncryptionPropertiesFactory.loadFactory(hadoopConfiguration)
;
FileEncryptionProperties
fi
leEncryptionProperties =
cryptoFactory.getFileEncryptionProperties(hadoopCon
fi
guration,
/path/to/folder/
fi
le, null)
;
ParquetWriter writer = ParquetWriter.builder(path/to/folder/
fi
le
)
.withConf(hadoopCon
fi
guration
)
…
.withEncryption(
fi
leEncryptionProperties
)
.build();
// write as usual
24. Read data
Similar, with:
No need to pass footer and column key properties
Parquet encryption with raw Java
DecryptionPropertiesFactory,
ParquetReader.builder.withDecryption
25. Performance effect of Parquet Encryption
AES ciphers implemented in CPU hardware (AES-NI)
• Gigabyte(s) per second in each thread
• Order(s) of magnitude faster than “software stack”
(App/Framework/Parquet/compression)
• C++: OpenSSL EVP librar
y
Java AES-NI
• AES-NI support in HotSpot since Java 9
• Java 11.0.4 – enhanced AES GCM decryption
• Thanks Java community!
Bottom line: Encryption won’t be
your bottleneck
• app workload, data I/O, encoding, compression
26. Community Roadmap
Apache Spark: SPARK-33966: “Two-tier encryption key management
”
Apache Parquet MR: New features for parquet-mr-1.13+, such as Uniform
encryption, CLI for encrypted data, local wrapping with key rotation, et
c
Apache Iceberg, Presto, Hudi: Integration with Parquet encryptio
n
Apache Arrow: ARROW-9947: “Python API for Parquet encryption”
27. Data Analytics
iCloud CloudKit Analytics
• Zeppelin and Jupyter on Spark
• Spark Batch Workflows
• Weekly Reports
• Ad-Hoc analytics
• Cohorts of iCloud Users
• iCloud-wide sample of all users 0.1%
• Semantic and geographic cohorts
• Ad-hoc
• Weekly Snapshot of Metadata DBs (No user
data)
• iCloud Server side activity (uploads, etc.) data
streams
• Anonymized and stripped of private data
• 100s of structured data types organized into
external Hive tables
28. iCloud CloudKit Analytics Use Cases
iCloud Storage
• Intelligent tiered storage optimizations uses combination of snapshot and streaming data
• Storage capacity forecasting
• Delete/compaction eligible data volume, la
g
Service utilization and spike analysis
Seed builds monitoring and qualification
Data integrity verification
Quick ad-hoc analytics (minutes in CloudKit Analytics vs hours in Splunk)
29. Encryption Requirements
Master key rotation
Enforce 2^32 encryption operations with same key
Each encryption = 2^35 bytes (2^31 AES blocks)
Scalable to Petabytes of data
Reduce impact on performance of ingestion and analytics workflows
30. Ingestion Pipelines Modification Steps
PME in CloudKit Analytics
•Update Parquet dependency to PME-compatible version
•Set Hadoop Config Properties
parquetConf.set(EncryptionPropertiesFactory.CRYPTO_FACTORY_CLASS_PROPERTY_NAME, AppleCryptoFactory.class.getCanonicalName())
;
// KMS client class
parquetConf.set(KeyToolkit.KMS_CLIENT_CLASS_PROPERTY_NAME, CustomerKmsBridge.class.getCanonicalName())
;
// with this property turned on, we do not need to specify the individual key ids per column
parquetConf.setBoolean(AppleCryptoFactory.UNIFORM_ENCRYPTION_PROPERTY_NAME, true)
;
// key id for parquet.encryption.footer.key property value
parquetConf.set(PropertiesDrivenCryptoFactory.FOOTER_KEY_PROPERTY_NAME, /*Key Name from Config*/)
;
// store key material externally (separate files)
parquetConf.setBoolean(KeyToolkit.KEY_MATERIAL_INTERNAL_PROPERTY_NAME, false);
31. • Update spark configuration
…
properties:
…
spark.hadoop.parquet.crypto.factory.class: org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactor
y
# KMS Client class
spark.hadoop.parquet.encryption.kms.client.class: com.apple.parquet.crypto.keytools.CustomerKmsBridge
PME in CloudKit Analytics
Spark Read Configuration
32. Write Performance and Storage Space Impact
•All columns encrypted!
•No impact on ingestion time and
resource utilization
•Minimal storage penalty
• Measurable only for datasets with parquet
small files
• Key Material Files: few KB eac
h
$ hadoop fs -ls hdfs://.../bucket=0
/
10100101-ff5b0f56-4779-4aea-8765-2d406bcd70a3.parque
t
. .
.
_KEY_MATERIAL_FOR_10100102-33ef104e-3ab6-49ee-9a16-
b150f7da24ab.parquet.json
{
{
Ingestion
w/o encryption
Ingestion
w/ encryption
33. Running join with aggregation on 2 large tables. All columns encrypted!
No Significant Impact on Read Performance
25.1 sec
with encryption
23.4 sec
without encryption