Building a Virtual Data Lake with Apache Arrow

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Pig, and Presto.

Reshape Data Lake (as of 2020.07)

Eric Sun

Iceberg: A modern table format for big data (Strata NY 2018)

Ryan Blue

Introduction SQL Analytics on Lakehouse Architecture

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved!

Making Apache Spark Better with Delta Lake

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

Apache Iceberg: An Architectural Look Under the Covers

ScyllaDB

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

The Parquet Format and Performance Optimization Opportunities

It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?

Achieving Lakehouse Models with Spark 3.0

In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.

Architecting a datalake

Laurent Leturgez

Dynamic Partition Pruning in Apache Spark

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

Some Iceberg Basics for Beginners (CDP).pdf

Michael Kogan

Introducing Databricks Delta

Intro to Delta Lake

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 4

Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need. Topics include: * The tools and techniques Netflix uses to analyze Parquet tables * How to spot common problems * Recommendations for Parquet configuration settings to get the best performance out of your processing platform * The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform

Parquet performance tuning: the missing guide

Ryan Blue

Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker. The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.

Considerations for Data Access in the Lakehouse

Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data. That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads. Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos. Here’s what you’ll learn in this 2-hour session: How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse

Free Training: How to Build a Lakehouse

What's hot (20)

Iceberg: a fast table format for S3

Introduction to Dremio

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Reshape Data Lake (as of 2020.07)

Iceberg: A modern table format for big data (Strata NY 2018)

Introduction SQL Analytics on Lakehouse Architecture

Making Apache Spark Better with Delta Lake

Apache Iceberg: An Architectural Look Under the Covers

The Parquet Format and Performance Optimization Opportunities

Achieving Lakehouse Models with Spark 3.0

Architecting a datalake

Dynamic Partition Pruning in Apache Spark

Apache Iceberg - A Table Format for Hige Analytic Datasets

Some Iceberg Basics for Beginners (CDP).pdf

Introducing Databricks Delta

Intro to Delta Lake

Data Lakehouse Symposium | Day 4

Parquet performance tuning: the missing guide

Considerations for Data Access in the Lakehouse

Free Training: How to Build a Lakehouse

Viewers also liked

Apache Calcite: One planner fits all

Data Science Languages and Industry Analytics

Wes McKinney

The twins that everyone loved too much

Apache Arrow - An Overview

Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.

Options for Data Prep - A Survey of the Current Market

Bi on Big Data - Strata 2016 in London

Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them. A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.

Don’t optimize my queries, optimize my data!

Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.

Apache Arrow: In Theory, In Practice

Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL. Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.

SQL on everything, in memory

Apache Calcite overview

This talk will address how a new architecture is emerging for analytics, based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK). Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (i.e. ETL). I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.

Viewers also liked (10)

Apache Calcite: One planner fits all

Data Science Languages and Industry Analytics

The twins that everyone loved too much

Apache Arrow - An Overview

Options for Data Prep - A Survey of the Current Market

Bi on Big Data - Strata 2016 in London

Don’t optimize my queries, optimize my data!

Apache Arrow: In Theory, In Practice

SQL on everything, in memory

Apache Calcite overview

Similar to Building a Virtual Data Lake with Apache Arrow

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Helena Edelson

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015

Iulia Emanuela Iancuta

Data modeling trends for analytics

Ike Ellis

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Spark Summit

Tech Spark Presentation

Stephen Borg

How do you go from a strictly typed object-based streaming pipeline with simple operations to a structured streaming pipeline with higher order complex relational operations? This is what the Data Engineering team did at GoPro to scale up the development of streaming pipelines for the rapidly growing number of devices and applications. When big data frameworks such as Hadoop first came to exist, developers were happy because we could finally process large amounts of data without writing complex multi-threaded code or worse yet writing complicated distributed code. Unfortunately, only very simple operations were available such as map and reduce. Almost immediately, higher level operations were desired similar to relational operations. And so Hive and dozens (hundreds?) of SQL-based big data tools became available for more developer-efficient batch processing of massive amounts of data. In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world, so that nearly every streaming framework now supports higher level relational operations. In this talk, we will discuss in a very hands-on manner how the streaming data pipelines for GoPro devices and apps have moved from the original Spark streaming with its simple RDD-based operations in Spark 1.x to Spark's structured streaming with its higher level relational operations in Spark 2.x. We will talk about the differences, advantages, and necessary pain points that must be addressed in order to scale relational-based streaming pipelines for massive IoT streams. We will also talk about moving from “hand built” Hadoop/Spark clusters running in the cloud to using a Spark-based cloud service. DAVID WINTERS, Big Data Architect, GoPro and HAO ZOU, Senior Software Engineer, GoPro

Adding structure to your streaming pipelines: moving from Spark streaming to ...

DataWorks Summit

Meta scale kognitio hadoop webinar

Kognitio

Intake at AnacondaCon

Martin Durant

Spark_Intro_Syed_Academy

Syed Hadoop

Real Time Big Data Processing on AWS

Caserta

Nisha talagala keynote_inflow_2016

Nisha Talagala

Big Data Introduction - Solix empower

Durga Gadiraju

Big data berlin

kammeyer

Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on. Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering. You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.

Designing and Building Next Generation Data Pipelines at Scale with Structure...

Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.

20160331 sa introduction to big data pipelining berlin meetup 0.3

Simon Ambridge

DoneDeal - AWS Data Analytics Platform

martinbpeters

From Pipelines to Refineries: scaling big data applications with Tim Hunter