Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Dremio Corporation
Dremio CorporationDremio Corporation
© 2017 Dremio Corporation @DremioHQ
Using Apache Arrow, Calcite and Parquet to build a
Relational Cache
Halloween 2017
@DataEngConf
Jacques Nadeau
© 2017 Dremio Corporation @DremioHQ
Who?
Jacques Nadeau
@intjesus
• CTO & Co-founder of Dremio
• Apache member
• VP Apache Arrow
• PMCs: Arrow, Calcite, Incubator, Heron (incubating)
© 2017 Dremio Corporation @DremioHQ
Agenda
• Tech Backgrounder
• Caching Techniques
• Relational Caching In Depth
• Definition and Matching
• Dealing with Updates
• Closing Words
© 2017 Dremio Corporation @DremioHQ
Tech Backgrounder
© 2017 Dremio Corporation @DremioHQ
What is Apache Arrow
• Columnar In-memory Data processing
library
• Designed to work with any programming
language
• Support for both relational and complex
data as-is
• Used by Pandas, Spark, Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Calcite
• SQL parser, Relational Algebra &
Optimizer
• Understands Materialized Views and
Lattices
• Used by many to add SQL functionality
including Apex, Drill, Hive, Flink, Kylin,
Phoenix, Samza, Storm, Cascading &
Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Parquet
• OSS implementation of Google Dremel
disk format for complex columnar data
• Support high-level of data-ware
columnar compression, vectorized
columnar readback
• Defacto standard for Analytical data on
disk in Big Data ecosystem
© 2017 Dremio Corporation @DremioHQ
Caching Techniques
© 2017 Dremio Corporation @DremioHQ
What does Caching Mean?
• Caching: Reduce the distance to data (DTD).
• Distance: How much time and resources it takes to
access data?
– How fast is the medium? How near is it?
– Is the data designed for efficient consumption?
– How similar is the data to what you need to answer a
question?
Perf & Proximity
Relevance
Consumability
Ways to reduce DTD
© 2017 Dremio Corporation @DremioHQ
Types of Caching
• In-Memory File Pinning
• Columnar Disk Caching
• In-Memory Block Caching
• Near-CPU Data Caching
• Cube Relational Caching
• Arbitrary Relational Caching
© 2017 Dremio Corporation @DremioHQ
In-Memory File Pinning
• Hold a File in Memory for frequent retrieval
• Pros
– Simple, standard and well-defined interface
– Improves the performance of the medium.
– If you’re performance is primarily bound by disk IO,
this might be a good option.
• Cons
– File structure not necessarily best in-memory
structure.
– Data manipulation almost always requires a copy of
data to also be held in memory (because the file
format is not directly consumable).
© 2017 Dremio Corporation @DremioHQ
Columnar Disk Caching
• Store the data in an optimized columnar
format.
• Pros
– Better compression reduces IO
– Good structure improves processing
– Benefits selective workloads (needed
subset of all columns)
• Cons
– Requires duplicating data
– Typically manual/semi-automated (e.g.
MapReduce/Spark to ETL persist/update)
© 2017 Dremio Corporation @DremioHQ
In-Memory Block Caching
• Maintain portions of on-disk data in
Memory (e.g. Linux page cache, HBase
block cache)
• Pros
– Very mature and usually had for free
• Cons
– Not easy to control/influence.
– Very disconnected from workloads.
© 2017 Dremio Corporation @DremioHQ
Near-CPU Data Caching (memory or disk)
• Hold the data directly in a representation that can
be processed without restructuring (e.g. Arrow
format)
• Pros
– Processing can be done without interpretation of
format
– Very efficient to consume
– Possible to consume data by multiple consumers
without duplicating memory
• Cons
– Larger than compressed formats
– Requires applications to agree on format
© 2017 Dremio Corporation @DremioHQ
Cube-Based Relational Caching
• Create several partially aggregated cuboids that can
satisfy a range of aggregation queries
• Pros
– Low-latency performance for common aggregate
query patterns
– Cube storage requirements can be small fraction of
original dataset size
• Cons
– Analysis latency is bi-modal: cube hit is great but a
miss is either unserved or served slowly
– Difficult or impossible to satisfy arbitrary queries
© 2017 Dremio Corporation @DremioHQ
Arbitrary Relational Caching
• Create arbitrary data fragments combined
with partitioning and sorting schemes to
speed any query
• Pros
– Base case is easy to understand
– Can improve the performance of any query
• Cons
– Complex to match to arbitrary queries
– Can be large depending on needs
© 2017 Dremio Corporation @DremioHQ
Types of Caching: The combination we found useful
• In-Memory File Pinning
– Too non-specific given memory scarcity
• ✔ Columnar Disk Caching
– Make sure everything is in Parquet (for any non-ephemeral data)
• ✔ In-Memory Block Caching
– Leverage existing page-cache, avoid additional memory cache layers
• ✔ Near-CPU Data Caching
– Used primarily for ephemeral/short-term persistence to avoid overhead
• ✔ Cube Relational Caching
– Useful for aggregation patterns
• ✔ Arbitrary Relational Caching
– Useful for unusual aggregation and non-aggregation needs
© 2017 Dremio Corporation @DremioHQ
Relational Caching In Depth
© 2017 Dremio Corporation @DremioHQ
Relational Algebra Refresher
• Relations: Source of data (a table)
• Operators: Define a set of transformations
– Join, Project, Scan, Filter, Aggregate, Window, etc
• Properties: Defining traits of data at a particular
relation
– Sorted by X, Hash distributed by Y, etc.
• Rules: Defining equality conditions between a
collection of operations
– Project > Filter can be changed to Filter > Project, A scan
doesn’t need to project columns that aren’t used later,
etc.
• Graph/Tree: A collection of operators that define a
particular dataset in a DAG
Project
Scan
Filter
Filter
Scan
Project
© 2017 Dremio Corporation @DremioHQ
Relational Caching: Basic Concept
• Store derived data that is
between what you want
and original dataset
• Shortens Distance to
Data (DTD)
• Reduces resource
requirements & latency
Original Data
What you
Want
What you
Want
What you
Want
Persisted Shared
Intermediate State
originalDTD
newDTDcostreduction
© 2017 Dremio Corporation @DremioHQ
You Probably Already Do This!
Data Alternatives (Manually Created)
• Sessionized
• Cleansed
• Partitioned by time or region
• Summarized for a particular
purpose
Users Choose Depending on Need
• Analysts trained on using different
tables depending on use case
• Custom datasets built for
reporting
• Summarization and/or extraction
for dashboards
© 2017 Dremio Corporation @DremioHQ
Benefit of Relational Caching over “Copy and Pick”
“Copy and Pick” Relational Caching
Physical
Optimizations
(transform, sort, partition,
aggregate)
Logical Model
Source Table
????
User picks best
optimization
Cache picks best optimization
Cache maintains
representations
Admin picks manage
maintenance
© 2017 Dremio Corporation @DremioHQ
Key Components of Relational Caching
• How to Express Transformations/States: SQL
• Hold and Match Relational algebra: Calcite
• Persist alternative datasets: Parquet
• A way to process: Arrow + Sabot
• And a lot of code to put it all together…
© 2017 Dremio Corporation @DremioHQ
Query Planner
Our Approach
Data Processing
System (Sabot)
End User Queries
UI to Define
Cached Patterns
Source Storage Interface (Arrow)
HDFS S3 Elastic
Relational Pattern
Matching System
Relational
Pattern
Database
Change
Detection
Database
Cache
Persistence
Parquet
Arrow
Refresh
System
© 2017 Dremio Corporation @DremioHQ
Definition and Matching
© 2017 Dremio Corporation @DremioHQ
Coming Back to Calcite
• Calcite is a Planner & Optimizer
• Comes with a prebuilt selection of
operators, rules, properties (called
traits) and ways to express relations
• Also has a basic Materialized View
facility (relevant!)
Perfect
Foundation
for Relational
Caching
© 2017 Dremio Corporation @DremioHQ
How We Built Caching: Reflections
• Reflection: A persisted alternative view of data in Parquet
format
– Raw Reflection: Persist all records of underlying dataset, controlling
partitioning and sortedness
– Aggregate Reflection: Persist a partially aggregated dataset based on a
selection of dimensions and measures, still controlling partitioning and
sortedness
• Reflections can be built on either source tables or arbitrarily
defined Virtual Datasets
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Aggregation Rollup
Given a user query, try to create an alternative version of the
query that matches the cached target.
P(a,c)
F(c’ < 10)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(a,b, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
F(c’ < 10)
S(r1)
A(a, sum(c) as c’)
Target
Materialization
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Join/Aggregation Transposition
Join(t1.id=t2.id)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(id, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
Target
MaterializationS(t2)
Join(r1.id=t2.id)
S(r1)
A(a, sum(c) as c’)
S(t2)
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Costing and Partitioning Benefits
F(a)
S(t1)
S(t1)
S(r1)
Part by a
User Query
Target
Materialization
S(t1)
S(r1)
Part by b
Target
Materialization
S(r1)
pruned on a
© 2017 Dremio Corporation @DremioHQ
Relational Matching, Other Examples
• Physical Property Matching
• Predicate Promotion
• Predicate Inference
• Join Decomposition
• Join Promotion
© 2017 Dremio Corporation @DremioHQ
Dealing with Updates
© 2017 Dremio Corporation @DremioHQ
Refresh Management
Importance of Cache Creation Ordering
• Not all updating
orderings are equal
• Want to order
updates based on
“Refresh Graph” and
dependencies
• Multiple orders
possible, cost against
each other to
minimize update cost
Freshness Management
• Underlying data
may change
• User Should define
refresh frequency
• Separately Define
Absolute TTL
Physical
dataset
1H refresh
3H expiration
Raw Reflection
Aggregate Reflection
© 2017 Dremio Corporation @DremioHQ
Multiple Update Modes (Depending on Mutation Pattern)
• Full: Always rebuild reflections from scratch (highly mutating)
• Incremental (files): Incrementally builds reflections based on new
files and folders (append-only)
• Incremental (rowstores): Incrementally builds reflections based on
monotonically increasing field (append-only)
• Partitioned Refresh: Maintains reflections based on source
partitions (e.g. Filesystem directories, Hive partitions). (partially
mutating)
© 2017 Dremio Corporation @DremioHQ
Closing Words
© 2017 Dremio Corporation @DremioHQ
What We’ve Seen Using these Techniques
• Frequent 10x-100x+ performance improvements in multiple
workloads
• Vast reduction in resources required to achieve performance
levels
• In many cases, a reduction in disk space
– Due to avoidance of excessive unused or rarely used physical copies
© 2017 Dremio Corporation @DremioHQ
Find out More and Get Involved
• Drop by my office hours (East Room Lounge - now)
• Drop by the Dremio table behind you
• Join us at @ApacheArrow meetup at @enigma_data Midtown
– Wes Mckinney, creator of Pandas and myself, tech deep dive
• Join the Dremio community (Relational Caching)
– github.com/dremio/dremio-oss (Apache Licensed)
– dremio.com
– community.dremio.com
• Find out more about the Building Blocks
– dev@[arrow|calcite|parquet].apache.org
– http://github.com/apache/[arrow|calcite|parquet-mr]
– http://[arrow|calcite|parquet].apache.org
• Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite,
@ApacheParquet
1 of 37

Recommended

Apache Arrow: In Theory, In Practice by
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
9.5K views31 slides
Apache Iceberg: An Architectural Look Under the Covers by
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
1.4K views24 slides
Apache Iceberg - A Table Format for Hige Analytic Datasets by
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
6.6K views28 slides
Apache Arrow Flight Overview by
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
6K views8 slides
Iceberg: A modern table format for big data (Strata NY 2018) by
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
2K views34 slides
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ... by
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
2.7K views24 slides

More Related Content

What's hot

Building an open data platform with apache iceberg by
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
562 views20 slides
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021 by
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
536 views18 slides
Building a Virtual Data Lake with Apache Arrow by
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
8.1K views20 slides
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa... by
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
1.1K views37 slides
Iceberg: a fast table format for S3 by
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
7.5K views30 slides
Making Apache Spark Better with Delta Lake by
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K views40 slides

What's hot(20)

Building an open data platform with apache iceberg by Alluxio, Inc.
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.562 views
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021 by StreamNative
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative536 views
Building a Virtual Data Lake with Apache Arrow by Dremio Corporation
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation8.1K views
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa... by Dremio Corporation
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation1.1K views
Iceberg: a fast table format for S3 by DataWorks Summit
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K views
Making Apache Spark Better with Delta Lake by Databricks
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks5.4K views
Presto Summit 2018 - 09 - Netflix Iceberg by kbajda
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda3K views
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S... by Spark Summit
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit9.4K views
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud by Noritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama33.3K views
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang by Databricks
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks5.8K views
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P... by Databricks
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks1K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Parquet Hadoop Summit 2013 by Julien Le Dem
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem26K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.4K views
The columnar roadmap: Apache Parquet and Apache Arrow by DataWorks Summit
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit3.3K views
Write Faster SQL with Trino.pdf by Eric Xiao
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao179 views
A Thorough Comparison of Delta Lake, Iceberg and Hudi by Databricks
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks11.1K views
Fine Tuning and Enhancing Performance of Apache Spark Jobs by Databricks
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks2.5K views
How to build a streaming Lakehouse with Flink, Kafka, and Hudi by Flink Forward
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward488 views

Viewers also liked

Options for Data Prep - A Survey of the Current Market by
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
2.2K views33 slides
The twins that everyone loved too much by
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
2.3K views17 slides
Data Science Languages and Industry Analytics by
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
5.5K views19 slides
Apache Calcite: One planner fits all by
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
6.7K views10 slides
Bi on Big Data - Strata 2016 in London by
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
1.7K views26 slides
Don’t optimize my queries, optimize my data! by
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
4.4K views49 slides

Viewers also liked(10)

Options for Data Prep - A Survey of the Current Market by Dremio Corporation
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation2.2K views
The twins that everyone loved too much by Julian Hyde
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde2.3K views
Data Science Languages and Industry Analytics by Wes McKinney
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney5.5K views
Apache Calcite: One planner fits all by Julian Hyde
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde6.7K views
Don’t optimize my queries, optimize my data! by Julian Hyde
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde4.4K views
SQL on everything, in memory by Julian Hyde
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde7.4K views
Apache Calcite overview by Julian Hyde
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde19.5K views
Oracle対応アプリケーションのDockerize事始め by Satoshi Nagayasu
Oracle対応アプリケーションのDockerize事始めOracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始め
Satoshi Nagayasu6.1K views
はじめてのDockerパーフェクトガイド(2017年版) by Hiroshi Hayakawa
はじめてのDockerパーフェクトガイド(2017年版)はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)
Hiroshi Hayakawa3.2K views

Similar to Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Efficient Data Formats for Analytics with Parquet and Arrow by
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
1K views37 slides
Mule soft mar 2017 Parquet Arrow by
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
478 views20 slides
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana... by
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
11.7K views43 slides
The columnar roadmap: Apache Parquet and Apache Arrow by
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
6.8K views45 slides
Strata NY 2016: The future of column-oriented data processing with Arrow and ... by
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
749 views39 slides
Data Eng Conf NY Nov 2016 Parquet Arrow by
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
2.1K views39 slides

Similar to Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache(20)

Mule soft mar 2017 Parquet Arrow by Julien Le Dem
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Julien Le Dem478 views
The columnar roadmap: Apache Parquet and Apache Arrow by Julien Le Dem
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem6.8K views
Strata NY 2016: The future of column-oriented data processing with Arrow and ... by Julien Le Dem
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem749 views
Data Eng Conf NY Nov 2016 Parquet Arrow by Julien Le Dem
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem2.1K views
Strata London 2016: The future of column oriented data processing with Arrow ... by Julien Le Dem
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem2.1K views
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World by Cloudera, Inc.
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.2.6K views
Improving Python and Spark Performance and Interoperability with Apache Arrow... by Databricks
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks2.3K views
Improving Python and Spark Performance and Interoperability with Apache Arrow by Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem4.4K views
Meta scale kognitio hadoop webinar by Kognitio
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio199 views
IBM Spectrum Scale Overview november 2015 by Doug O'Flaherty
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
Doug O'Flaherty2K views
Application Architectures with Hadoop - Big Data TechCon SF 2014 by hadooparchbook
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook1.2K views
12 Architectural Requirements for Protecting Business Data in the Cloud by Buurst
12 Architectural Requirements for Protecting Business Data in the Cloud12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the Cloud
Buurst441 views
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte... by Amazon Web Services
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times by Storage Switzerland
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesWebinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Storage Switzerland1.4K views
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions by Perficient, Inc.
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud SolutionsLower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Perficient, Inc.3.4K views
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War by Storage Switzerland
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-WarWebinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War

Recently uploaded

Dapr Unleashed: Accelerating Microservice Development by
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice DevelopmentMiroslav Janeski
10 views29 slides
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionMárton Kodok
5 views55 slides
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema by
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - GeertsemaDSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - GeertsemaDeltares
17 views13 slides
DevsRank by
DevsRankDevsRank
DevsRankdevsrank786
11 views1 slide
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...Deltares
6 views15 slides
Airline Booking Software by
Airline Booking SoftwareAirline Booking Software
Airline Booking SoftwareSharmiMehta
5 views26 slides

Recently uploaded(20)

Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski10 views
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok5 views
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - GeertsemaDSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
DSD-INT 2023 Delft3D FM Suite 2024.01 1D2D - Beta testing programme - Geertsema
Deltares17 views
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares6 views
Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta5 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares9 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j6 views
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke28 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin95 views
Advanced API Mocking Techniques by Dimpy Adhikary
Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking Techniques
Dimpy Adhikary19 views
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge... by Deltares
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
DSD-INT 2023 Delft3D FM Suite 2024.01 2D3D - New features + Improvements - Ge...
Deltares17 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller38 views
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols by Deltares
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - DolsDSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
Deltares7 views
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI... by Marc Müller
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Marc Müller37 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm14 views
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut... by Deltares
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
Deltares7 views

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

  • 1. © 2017 Dremio Corporation @DremioHQ Using Apache Arrow, Calcite and Parquet to build a Relational Cache Halloween 2017 @DataEngConf Jacques Nadeau
  • 2. © 2017 Dremio Corporation @DremioHQ Who? Jacques Nadeau @intjesus • CTO & Co-founder of Dremio • Apache member • VP Apache Arrow • PMCs: Arrow, Calcite, Incubator, Heron (incubating)
  • 3. © 2017 Dremio Corporation @DremioHQ Agenda • Tech Backgrounder • Caching Techniques • Relational Caching In Depth • Definition and Matching • Dealing with Updates • Closing Words
  • 4. © 2017 Dremio Corporation @DremioHQ Tech Backgrounder
  • 5. © 2017 Dremio Corporation @DremioHQ What is Apache Arrow • Columnar In-memory Data processing library • Designed to work with any programming language • Support for both relational and complex data as-is • Used by Pandas, Spark, Dremio
  • 6. © 2017 Dremio Corporation @DremioHQ What is Apache Calcite • SQL parser, Relational Algebra & Optimizer • Understands Materialized Views and Lattices • Used by many to add SQL functionality including Apex, Drill, Hive, Flink, Kylin, Phoenix, Samza, Storm, Cascading & Dremio
  • 7. © 2017 Dremio Corporation @DremioHQ What is Apache Parquet • OSS implementation of Google Dremel disk format for complex columnar data • Support high-level of data-ware columnar compression, vectorized columnar readback • Defacto standard for Analytical data on disk in Big Data ecosystem
  • 8. © 2017 Dremio Corporation @DremioHQ Caching Techniques
  • 9. © 2017 Dremio Corporation @DremioHQ What does Caching Mean? • Caching: Reduce the distance to data (DTD). • Distance: How much time and resources it takes to access data? – How fast is the medium? How near is it? – Is the data designed for efficient consumption? – How similar is the data to what you need to answer a question? Perf & Proximity Relevance Consumability Ways to reduce DTD
  • 10. © 2017 Dremio Corporation @DremioHQ Types of Caching • In-Memory File Pinning • Columnar Disk Caching • In-Memory Block Caching • Near-CPU Data Caching • Cube Relational Caching • Arbitrary Relational Caching
  • 11. © 2017 Dremio Corporation @DremioHQ In-Memory File Pinning • Hold a File in Memory for frequent retrieval • Pros – Simple, standard and well-defined interface – Improves the performance of the medium. – If you’re performance is primarily bound by disk IO, this might be a good option. • Cons – File structure not necessarily best in-memory structure. – Data manipulation almost always requires a copy of data to also be held in memory (because the file format is not directly consumable).
  • 12. © 2017 Dremio Corporation @DremioHQ Columnar Disk Caching • Store the data in an optimized columnar format. • Pros – Better compression reduces IO – Good structure improves processing – Benefits selective workloads (needed subset of all columns) • Cons – Requires duplicating data – Typically manual/semi-automated (e.g. MapReduce/Spark to ETL persist/update)
  • 13. © 2017 Dremio Corporation @DremioHQ In-Memory Block Caching • Maintain portions of on-disk data in Memory (e.g. Linux page cache, HBase block cache) • Pros – Very mature and usually had for free • Cons – Not easy to control/influence. – Very disconnected from workloads.
  • 14. © 2017 Dremio Corporation @DremioHQ Near-CPU Data Caching (memory or disk) • Hold the data directly in a representation that can be processed without restructuring (e.g. Arrow format) • Pros – Processing can be done without interpretation of format – Very efficient to consume – Possible to consume data by multiple consumers without duplicating memory • Cons – Larger than compressed formats – Requires applications to agree on format
  • 15. © 2017 Dremio Corporation @DremioHQ Cube-Based Relational Caching • Create several partially aggregated cuboids that can satisfy a range of aggregation queries • Pros – Low-latency performance for common aggregate query patterns – Cube storage requirements can be small fraction of original dataset size • Cons – Analysis latency is bi-modal: cube hit is great but a miss is either unserved or served slowly – Difficult or impossible to satisfy arbitrary queries
  • 16. © 2017 Dremio Corporation @DremioHQ Arbitrary Relational Caching • Create arbitrary data fragments combined with partitioning and sorting schemes to speed any query • Pros – Base case is easy to understand – Can improve the performance of any query • Cons – Complex to match to arbitrary queries – Can be large depending on needs
  • 17. © 2017 Dremio Corporation @DremioHQ Types of Caching: The combination we found useful • In-Memory File Pinning – Too non-specific given memory scarcity • ✔ Columnar Disk Caching – Make sure everything is in Parquet (for any non-ephemeral data) • ✔ In-Memory Block Caching – Leverage existing page-cache, avoid additional memory cache layers • ✔ Near-CPU Data Caching – Used primarily for ephemeral/short-term persistence to avoid overhead • ✔ Cube Relational Caching – Useful for aggregation patterns • ✔ Arbitrary Relational Caching – Useful for unusual aggregation and non-aggregation needs
  • 18. © 2017 Dremio Corporation @DremioHQ Relational Caching In Depth
  • 19. © 2017 Dremio Corporation @DremioHQ Relational Algebra Refresher • Relations: Source of data (a table) • Operators: Define a set of transformations – Join, Project, Scan, Filter, Aggregate, Window, etc • Properties: Defining traits of data at a particular relation – Sorted by X, Hash distributed by Y, etc. • Rules: Defining equality conditions between a collection of operations – Project > Filter can be changed to Filter > Project, A scan doesn’t need to project columns that aren’t used later, etc. • Graph/Tree: A collection of operators that define a particular dataset in a DAG Project Scan Filter Filter Scan Project
  • 20. © 2017 Dremio Corporation @DremioHQ Relational Caching: Basic Concept • Store derived data that is between what you want and original dataset • Shortens Distance to Data (DTD) • Reduces resource requirements & latency Original Data What you Want What you Want What you Want Persisted Shared Intermediate State originalDTD newDTDcostreduction
  • 21. © 2017 Dremio Corporation @DremioHQ You Probably Already Do This! Data Alternatives (Manually Created) • Sessionized • Cleansed • Partitioned by time or region • Summarized for a particular purpose Users Choose Depending on Need • Analysts trained on using different tables depending on use case • Custom datasets built for reporting • Summarization and/or extraction for dashboards
  • 22. © 2017 Dremio Corporation @DremioHQ Benefit of Relational Caching over “Copy and Pick” “Copy and Pick” Relational Caching Physical Optimizations (transform, sort, partition, aggregate) Logical Model Source Table ???? User picks best optimization Cache picks best optimization Cache maintains representations Admin picks manage maintenance
  • 23. © 2017 Dremio Corporation @DremioHQ Key Components of Relational Caching • How to Express Transformations/States: SQL • Hold and Match Relational algebra: Calcite • Persist alternative datasets: Parquet • A way to process: Arrow + Sabot • And a lot of code to put it all together…
  • 24. © 2017 Dremio Corporation @DremioHQ Query Planner Our Approach Data Processing System (Sabot) End User Queries UI to Define Cached Patterns Source Storage Interface (Arrow) HDFS S3 Elastic Relational Pattern Matching System Relational Pattern Database Change Detection Database Cache Persistence Parquet Arrow Refresh System
  • 25. © 2017 Dremio Corporation @DremioHQ Definition and Matching
  • 26. © 2017 Dremio Corporation @DremioHQ Coming Back to Calcite • Calcite is a Planner & Optimizer • Comes with a prebuilt selection of operators, rules, properties (called traits) and ways to express relations • Also has a basic Materialized View facility (relevant!) Perfect Foundation for Relational Caching
  • 27. © 2017 Dremio Corporation @DremioHQ How We Built Caching: Reflections • Reflection: A persisted alternative view of data in Parquet format – Raw Reflection: Persist all records of underlying dataset, controlling partitioning and sortedness – Aggregate Reflection: Persist a partially aggregated dataset based on a selection of dimensions and measures, still controlling partitioning and sortedness • Reflections can be built on either source tables or arbitrarily defined Virtual Datasets
  • 28. © 2017 Dremio Corporation @DremioHQ Cache Matching: Aggregation Rollup Given a user query, try to create an alternative version of the query that matches the cached target. P(a,c) F(c’ < 10) S(t1) S(t1) A(a, sum(c) as c’) A(a,b, sum(c)) S(r1) User Query Reflection Definition Alternative Plan F(c’ < 10) S(r1) A(a, sum(c) as c’) Target Materialization
  • 29. © 2017 Dremio Corporation @DremioHQ Cache Matching: Join/Aggregation Transposition Join(t1.id=t2.id) S(t1) S(t1) A(a, sum(c) as c’) A(id, sum(c)) S(r1) User Query Reflection Definition Alternative Plan Target MaterializationS(t2) Join(r1.id=t2.id) S(r1) A(a, sum(c) as c’) S(t2)
  • 30. © 2017 Dremio Corporation @DremioHQ Cache Matching: Costing and Partitioning Benefits F(a) S(t1) S(t1) S(r1) Part by a User Query Target Materialization S(t1) S(r1) Part by b Target Materialization S(r1) pruned on a
  • 31. © 2017 Dremio Corporation @DremioHQ Relational Matching, Other Examples • Physical Property Matching • Predicate Promotion • Predicate Inference • Join Decomposition • Join Promotion
  • 32. © 2017 Dremio Corporation @DremioHQ Dealing with Updates
  • 33. © 2017 Dremio Corporation @DremioHQ Refresh Management Importance of Cache Creation Ordering • Not all updating orderings are equal • Want to order updates based on “Refresh Graph” and dependencies • Multiple orders possible, cost against each other to minimize update cost Freshness Management • Underlying data may change • User Should define refresh frequency • Separately Define Absolute TTL Physical dataset 1H refresh 3H expiration Raw Reflection Aggregate Reflection
  • 34. © 2017 Dremio Corporation @DremioHQ Multiple Update Modes (Depending on Mutation Pattern) • Full: Always rebuild reflections from scratch (highly mutating) • Incremental (files): Incrementally builds reflections based on new files and folders (append-only) • Incremental (rowstores): Incrementally builds reflections based on monotonically increasing field (append-only) • Partitioned Refresh: Maintains reflections based on source partitions (e.g. Filesystem directories, Hive partitions). (partially mutating)
  • 35. © 2017 Dremio Corporation @DremioHQ Closing Words
  • 36. © 2017 Dremio Corporation @DremioHQ What We’ve Seen Using these Techniques • Frequent 10x-100x+ performance improvements in multiple workloads • Vast reduction in resources required to achieve performance levels • In many cases, a reduction in disk space – Due to avoidance of excessive unused or rarely used physical copies
  • 37. © 2017 Dremio Corporation @DremioHQ Find out More and Get Involved • Drop by my office hours (East Room Lounge - now) • Drop by the Dremio table behind you • Join us at @ApacheArrow meetup at @enigma_data Midtown – Wes Mckinney, creator of Pandas and myself, tech deep dive • Join the Dremio community (Relational Caching) – github.com/dremio/dremio-oss (Apache Licensed) – dremio.com – community.dremio.com • Find out more about the Building Blocks – dev@[arrow|calcite|parquet].apache.org – http://github.com/apache/[arrow|calcite|parquet-mr] – http://[arrow|calcite|parquet].apache.org • Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite, @ApacheParquet