SlideShare a Scribd company logo
1 of 97
Download to read offline
Supercharge Your Data
Analytical Tool with a Rusty
Query Engine
Andrew Lamb
Staff Engineer, InfluxData
Apache Arrow PMC
DataFusion and
Apache Arrow
Daniël Heres
Data Engineer, GoDataDriven
Apache Arrow PMC
Introduction
Your Speakers
Staff Engineer @ InfluxData
Previously
● Query Optimizer @ Vertica, Oracle
Database server, embedded Compilers
● Chief Architect + VP Engineering roles
at ML startups
Andrew
Data/ML Engineer @ GoDataDriven
Previously
• Data / ML Engineer @ bol.com
• Startups
2
Daniël
Why should you
care?
3
Andrew
Recent Proliferation of Big Data systems
4
…
Recent Proliferation of Databases
5
DB
6
What is going on?
COTS → Totally Custom
7
IT FANG
“Buy and Operate”
● Buy software from
vendors
● Operate on your own
hardware, with
sysadmins
“Build and Operate”
● Write software for, and
operate all components
● Optimized for exact needs
✓
Current Trend
“Assemble and Operate”
● Assemble from open
source technologies
● Operate on resources
in a public cloud
Apache Arrow
Multi-language toolkit for Processing and Interchange
Founded in 2016
Apache Software Foundation
Low level / foundational technology to build fast and
interoperable analytic systems
Open standard, implementations in 12+ languages
Adopted widely in industry products and open
source projects
8
“DataFusion is an extensible query
execution framework, written in Rust,
that uses Apache Arrow as its
in-memory format.”
- DataFusion Website
DataFusion: A Query Engine
9
DataFusion: A Query Engine
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
Data
Batches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
Data Batches
Catalog information:
tables, schemas, etc
10
Implementation timeline for a new
Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a Database”
🤔
“Ok now this
is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery
Window functions
11
But for Databases
🤔
12
LLVM-like Infrastructure for Databases
Inputs
Logical Plan Execution Plan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations /
Transformations
Optimizations /
Transformations
HashAggregate
Sort
…
Optimized Execution
Operators
(Arrow Based)
Join
Data
(Parquet, CSV, statistics, …) DataFusion
Query
(SQL, code, DataFrame, …)
Code
(UDF, UDA, etc)
Resources
(Cores, memory, etc)
13
DataFusion: Totally Customizable
Inputs
Logical Plan Execution Plan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations /
Transformations
Optimizations /
Transformations
HashAggregate
Sort
…
Optimized Execution
Operators
(Arrow Based)
Join
Data
(Parquet, CSV, statistics, …) DataFusion
Query
(SQL, code, DataFrame, …)
Code
(UDF, UDA, etc)
Resources
(Cores, memory, etc)
Extend ✅
Extend ✅
Extend ✅
Extend ✅
Extend ✅
Extend ✅
Extend ✅
Extend ✅
14
DataFusion Project Growth
15
5
.
0
.
0
6
.
0
.
0
7
.
0
.
0
8
.
0
.
0
Number
of
Unique
Contributors
Date
DataFusion Project Growth
16
https://star-history.com/#apache/arrow-datafusion&Date
DataFusion Milestones: Time to Mature
17
5+ year labor of love
Dec 2016 Initial
DataFusion Commit By
Andy Grove
Feb 2019 Donation to
Apache Arrow
Apr/May 2022:
Subqueries
Apr 2021 Ballista is
donated to Apache
Arrow
2020-2021: (hash) joins,
window functions,
performance &
parallelization, etc.
Mar 2018 Arrow in Rust
started, DataFusion
switches to Arrow
Nov 2021
DataFusion Contrib
Overview of
Apache Arrow
DataFusion
18
Daniël
From Query to Results
ExecutionPlan
LogicalPlan
Optimize Optimize
Execute!
Query / DataFrame
19
From Query to Results
an example
20
select
count(*) num_visitors,
job_title
from
visitors
where
city = "San Francisco"
group by
job_title
1
2
3
4
5
6
7
8
9
10
From Query to Results
datafusion package available via PyPI
21
visitors = ctx.table("visitors")
df = (
visitors.filter(col("city") == literal("San Francisco"))
.aggregate([col("job_title")], [f.count(literal(1))])
)
batches = df.collect() # collect results into memory (Arrow batches)
1
2
3
4
5
6
7
8
9
10
From Query to Results
ExecutionPlan
Optimize Optimize
Execute!
Query / DataFrame LogicalPlan
22
Logical Plan represents the what
Initial Logical Plan
SQL is parsed, then translated into a initial Logical Plan.
23
Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title
Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]]
Filter: #visitors.city = Utf8("San Francisco")
TableScan: visitors projection=None
select count(*) num_visitors, job_title
from visitors
where city = "San Francisco"
group by job_title
visitors = ctx.table("visitors")
df = (
visitors.filter(col("city") == literal("San Francisco"))
.aggregate([col("job_title")], [f.count(literal(1))])
)
(Read plan from bottom to top)
Let's Optimize!
ExecutionPlan
Optimize
Execute!
Query / DataFrame LogicalPlan
Optimize
24
• Massively speed up execution times (10x, 100x, 1000x) by rewriting
queries to a equivalent, optimized version
• 14 built-in optimization passes in DataFusion, adding more each version
• Add custom optimization passes
Let's Optimize!
Projection Pushdown
Minimizing IO (especially useful for formats like Parquet), processing
25
Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title
Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]]
Filter: #visitors.city = Utf8("San Francisco")
TableScan: visitors projection=None
num_visitors
Let's Optimize!
Projection Pushdown
Minimizing IO (especially useful for formats like Parquet), processing
26
Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title
Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]]
Filter: #visitors.city = Utf8("San Francisco")
TableScan: visitors projection=Some([0, 1])
projection_push_down
Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title
Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]]
Filter: #visitors.city = Utf8("San Francisco")
TableScan: visitors projection=None
num_visitors
Let's Optimize!
Filter Pushdown
Minimizing IO (especially useful for formats like Parquet), processing
27
Projection: #COUNT(UInt8(1)) AS n, #visitors.job_title
Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]]
Filter: #visitors.city = Utf8("San Francisco")
TableScan: visitors projection=Some([0, 1])
num_visitors
Let's Optimize!
Filter Pushdown
Minimizing IO (especially useful for formats like Parquet), processing
28
filter_push_down
Projection: #COUNT(UInt8(1)) AS n, #visitors.job_title
Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]]
Filter: #visitors.city = Utf8("San Francisco")
TableScan: visitors projection=Some([0, 1]), partial_filters=[#visitors.city = Utf8("San Francisco")]
num_visitors
Projection: #COUNT(UInt8(1)) AS n, #visitors.job_title
Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]]
Filter: #visitors.city = Utf8("San Francisco")
TableScan: visitors projection=Some([0, 1])
Let's Create...
The ExecutionPlan
Optimize
Execute!
Query / DataFrame LogicalPlan ExecutionPlan
Optimize
29
The Execution Plan represents the where and how
The Initial Execution Plan
30
ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title]
HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16)
HashAggregateExec: mode=Partial, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
FilterExec: city@1 = San Francisco
CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
And... Optimize!
Execute!
Query / DataFrame LogicalPlan
Optimize Optimize
ExecutionPlan
31
Optimize
CoalesceBatches: Avoiding small batch size
32
ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title]
HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16)
HashAggregateExec: mode=Partial, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
FilterExec: city@1 = San Francisco
CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
Optimize
CoalesceBatches: Avoiding small batch size
33
ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title]
HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
CoalesceBatchesExec: target_batch_size=4096
RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16)
CoalesceBatchesExec: target_batch_size=4096
FilterExec: city@1 = San Francisco
CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
coalesce_batches
ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title]
HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16)
HashAggregateExec: mode=Partial, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
FilterExec: city@1 = San Francisco
CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
Optimize
Repartition: Introducing parallelism
34
ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title]
HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
CoalesceBatchesExec: target_batch_size=4096
RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16)
CoalesceBatchesExec: target_batch_size=4096
FilterExec: city@1 = San Francisco
CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
Optimize
Repartition: Introducing parallelism
35
ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title]
HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
CoalesceBatchesExec: target_batch_size=4096
RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16)
CoalesceBatchesExec: target_batch_size=4096
FilterExec: city@1 = San Francisco
CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
repartition
ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title]
HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))]
CoalesceBatchesExec: target_batch_size=4096
RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16)
CoalesceBatchesExec: target_batch_size=4096
FilterExec: city@1 = San Francisco
RepartitionExec: partitioning=RoundRobinBatch(16)
CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
Getting results
Return record batches (or write results)
36
Query / DataFrame LogicalPlan
Optimize
ExecutionPlan Execute!
Optimize
Arrow Batches
DataFusion Features
● Mostly complete SQL implementation (aggregates, joins, window
functions, etc)
● DataFrame API (Python, Rust)
● High performance vectorized, native, safe, multi-threaded execution
● Common file formats: Parquet, CSV, JSON, Avro
● Highly extensible / customizable
● Large, growing community driving project forward
37
SQL Support
https://arrow.apache.org/datafusion/user-guide/sql/sql_sta
tus.html#supported-sql
Projection (SELECT), Filtering (WHERE), Ordering (ORDER BY), Aggregation
(GROUP BY)
Aggregation functions (COUNT, SUM, MIN, MAX, AVG, APPROX_PERCENTILE, etc)
Window functions (OVER .. ([ORDER BY ...] [PARTITION BY ..])
Set functions: UNION (ALL), INTERSECT (ALL), EXCEPT
Scalar functions: string, Date/time,... (basic)
Joins (INNER, LEFT, RIGHT, FULL OUTER, SEMI, ANTI)
Subqueries, Grouping Sets
38
39
Extensibility
Customize DataFusion to your needs
User Defined Functions
User Defined Aggregates
User Defined Optimizer passes
User Defined LogicalPlan nodes
User Defined ExecutionPlan nodes
User Defined TableProvider
User Defined FileFormat
User Defined ObjectStore
Systems Powered
by DataFusion
40
Andrew
FLOCK
https://github.com/flock-lab/flock
● Overview:
○ Low-Cost Streaming Query Engine on FaaS Platforms
○ Project from UMD Database Group, runs streaming queries on AWS Lambda (x86
and arm64/graviton2).
● Use of DataFusion
○ SQL API:
○ DataFrame API: To build plans
○ Optimized native plan execution
41
ROAPI
https://roapi.github.io/
● Overview:
○ read-only APIs for static datasets without code
○ columnq-cli: run sql queries against CSV files
● Use of DataFusion
○ SQL API:
○ DataFrame API: (to build plans for GraphQL)
○ File formats: CSV, JSON, Parquet, Avro
○ Optimized native plan execution
42
VegaFusion
https://vegafusion.io/
● Overview:
○ Accelerates execution of (interactive) data
visualizations
○ Compiles Vega data transforms into
DataFusion query plans.
● Use of DataFusion:
○ DataFrame API: To build plans
○ UDFs: to implement some Vega expressions
○ Optimized native plan execution
43
Cube.js / Cube Store
https://cube.dev/
● Overview:
○ Headless Business Intelligence
○ cubestore pre-aggregation storage layer
● Use of DataFusion (fork)
○ SQL API (with custom extensions)
○ Custom Logical and Physical Operators
○ UDFs: custom functions
○ Optimized native plan execution
44
InfluxDB IOx
https://github.com/influxdata/influxdb_iox
● Overview:
○ In-memory columnar store using object storage, future
core of InfluxDB; support SQL, InfluxQL, and Flux
○ Query and data reorganization built with DataFusion
● Use of DataFusion:
○ Table Provider: Custom data sources
○ SQL API
○ PlanBuilder API: Plans for custom query language
○ UD Logical and Execution Plans
○ UDFs: to implement the precise semantics of influxRPC
○ Optimized native plan execution
45
Coralogix
https://coralogix.com/
● Overview:
○ Stateful streaming analytics with machine learning enables teams to monitor and
visualize observability data in real-time before indexing
● Use of DataFusion:
○ Table Provider: custom data source
○ User Defined Logical and Execution Plans: to implement a custom query language
○ User Defined ObjectStore: for queries over data in object storage
○ UDFs: for working with semi-structured data
○ Optimized native plan execution
46
blaze-rs
https://github.com/blaze-init/blaze
● Overview:
○ High performance, low-cost native execution layer for Spark: execute the physical
operators with Rust
○ Translates Spark Exec nodes into DataFusion Execution Plans
● Use of DataFusion
○ Optimized native plan execution
○ HDFS Object Store Extension
47
Ballista Distributed Compute
https://github.com/apache/arrow-ballista
● Overview:
○ Spark-like distributed Query Engine (part of Arrow Project)
○ Adds distributed execution to DataFusion plans
● Use of DataFusion:
○ SQL API
○ DataFrame API
○ Optimized native plan execution
○ File formats: CSV, JSON, Parquet, Avro
48
What’s Next?
49
Daniël
Future Directions
● Embeddability
○ More regular releases to crates.io, more modularity
● Broader SQL features
○ Subqueries, more date/time functions, struct / array types
● Improved Performance
○ Query directly from Object Storage
○ More state of the art tech: JIT, NUMA aware scheduling, hybrid row/columnar exec
● Ecosystem integration
○ FlightSQL, Substrait.io
○ Databases
● GPU support
50
Come Join Us
We ❤ Our Contributors
● Contributions at all levels are encouraged and welcomed.
● Learn Rust!
● Learn Database Internals!
● Have a great time with a welcoming community!
More details:
https://arrow.apache.org/datafusion/community/communication.html
51
52
Thank you
arrow.apache.org/datafusion
github.com/apache/arrow-datafusion
Andrew Lamb
Staff Engineer, InfluxData
Apache Arrow PMC
Daniël Heres
Data Engineer, GoDataDriven
Apache Arrow PMC
53
Backup Slides
54
Thank You!
arrow.apache.org/datafusion
github.com/apache/arrow-datafusion
Andrew Lamb
Staff Engineer,
InfluxData
Apache Arrow
PMC
Daniël Heres
Data Engineer,
GoDataDriven
Apache Arrow
PMC
DataFusion / Arrow / Parquet
Parquet
Arrow
sqlparser-rs
DataFusion
A Virtuous Cycle
Increased Use of Drives Increased Contribution
57
Increased use of
open source systems
Increased capacity
for maintenance and
contribution
DataFusion, and Apache Arrow are key open source technologies
for building interoperable open source systems
delta-rs
https://github.com/delta-io/delta-rs
● Overview:
○ Native Delta Lake implementation in Rust
● Use of DataFusion
○ Table Provider API: allows other DataFusion users
to read from Delta tables
58
DISCLAIMER: Not yet cleared / verified with project team
Cloudfuse Buzz
https://github.com/cloudfuse-io/buzz-rust
● Serverless cloud-based query engine
○ map using cloud functions (AWS Lambda)
○ aggregate using containers (AWS Fargate)
● Project (expected to be) continued from june
59
DISCLAIMER: Not yet cleared / verified with project team
dask-sql
https://github.com/dask-contrib/dask-sql
● Overview:
○ TBD
●
● Use of DataFusion:
○ WIP https://github.com/dask-contrib/dask-sql/issues/474
60
DISCLAIMER: Not yet cleared / verified with project team
Apache Arrow Analytics Toolkit
Where does DataFusion fit?
61
Parquet (“Disk”) Arrow (“Memory”)
Compute
Kernels
Arrow Flight
Arrow FlightSQL DataFusion
Data Formats
Low Level
Calculations +
Interchange
Runtime
Subsystems
IPC
C ABI
Analytics /
Database
Systems
C++ Query Engine
Analytic systems built using some of this stack
Native
Implementations
Language
Bindings
Query Engines
What is it and why do you need one?
1. Add SQL or DataFrame interface to your application’s data
2. Implement a custom query language / DSL
3. Implement a new data analytic system
4. Implement a new database system (natch)
Maps Desired Computations: SQL and DataFrame (ala Pandas)
To Efficient Calculation: projection pushdown, filter pushdown, joins,
expression simplification, parallelization, etc
62
datafusion-python
https://github.com/datafusion-contrib/datafusion-python
● Overview:
○ Python dataframe library (modeled after pyspark)
● Use of DataFusion
○ SQL API
○ DataFrame API
○ File formats: CSV, JSON, Parquet, Avro
○ Optimized native plan execution
63
DISCLAIMER: Not yet cleared / verified with project team
Common Themes
Come for the performance, stay for the features (?)
Native execution
Native (non JVM) of Spark/Spark like behavior
SQL interface
Projects are leveraging properties of Rustlang
SQL / DataFrame API
64
Better, Faster, Cheaper
The DataFusion Query Engine is part of the commoditization of advanced
analytic database technologies
Transform analytic systems over the next decade
65
Better Faster Cheaper
Andrew’s Notes
Proposal: Data + AI Summit talk
Desired Takeaways:
1. If you need a query engine (in Rust?), you should use DataFusion
Thesis: DataFusion is part of a larger trend (spearheaded by Apache Arrow) in the
commoditization of analytic database technologies, which will lead to many faster / cheaper /
better analytic systems over the next decade
Other decks for inspiration:
DataFusion: An Embeddable Query Engine Written in Rust
xA Rusty Introduction to Apache Arrow and how it Applies to a Time Series Database
2021-04-20: Apache Arrow and its Impact on the Database industry.pptx 66
67
Instructions: Read me!
Getting started with our slide template
When using this template, create your new slides at the very top of the
slide order, above this slide. Explore the advice and example slides below to
find useful layouts and graphics to pull into your design. When your slide
deck is complete, delete this slide and every slide below it.
There are great baseline
slides in this template, but
it may not have everything
you need. Don’t be afraid
to craft your own layouts!
Just pay attention to the
font and grid guidelines,
and take advantage of
starter shapes.
Get creative
Use text hierarchy to
create order and keep your
content scannable. No
walls of text! Try to keep
headlines short.
Make it scannable
Presentation best practices
Less is more
Don’t try to cram
everything onto a limited
number of slides. More
slides with less text per
slide is easier to digest.
Clarity over density
68
69
Font Guidance
Font selection
All text in our slide decks should use
one of two available event brand
fonts: DM Sans or DM Mono.
If you do not see these fonts in your font
selection menu, they can be added by
selecting “More fonts” and searching for
“dm.” Click on DM Sans and DM Mono, then
hit OK.
1
2
70
Font Guidance (Cont.)
Font sizing
Using consistent type sizing is a
good way to help your slides feel
uniform. When selecting type sizes,
try to stick to multiple of 8, with the
exceptions of 12 and 20 as
in-betweens.
DATA+AI Summit
12 DATA+AI Summit
DATA+AI Summit
DATA+AI Summit
DATA+AI Summit
DATA+AI Summit
DATA+AI Summi
DATA+AI Summ
16
20
24
32
40
56
64
71
Grid Guidance
Keep it orderly
Your presentation template has a 12
column grid to help you organize the
elements on your slides. When laying
out objects, consider using the grid
to help.
Toggle the grid visibility by navigating
to View > Guides > Show Guides.
72
Color Guidance
Keep it on brand
When customizing charts or adding other visual
elements, do your best to stay within our defined
event color palette. This will ensure that all your
content looks great together and doesn’t clash with
the slide template design.
Always use black text when placing content over a
colored background. The only exception is when
using a black background. Any color text is
acceptable on black.
10121E 00B6E0 85DDB5 F16047
EDEEF1 8FDDEF AFE9CF F3A89B
73
Example Slides
Eighteen colorful title slide
options with varying shapes
74
Add your Name
Add your title, company
Choose Your
Title Slide
Eighteen colorful title slide
options with varying shapes
75
Add your Name
Add your title, company
Choose Your
Title Slide
Eighteen colorful title slide
options with varying shapes
76
Add your Name
Add your title, company
Choose Your
Title Slide
Basic Content Slide
Your all-purpose zone
Use this slide as a starting point for crafting your own layouts, or for simple
text slides.
77
Activate Dark Mode
Mix in black slides to add contrast and variety
Or make your whole presentation dark!
78
Insert your charts or images
Take advantage of the content panels
79
If you want to insert a gif or
other image from the web,
simply navigate to
Insert > Image > by URL.
Crop and resize your image to fit
within content panels, if you’re
feeling fancy.
Insert Image by URL
“With just a few adjustments to text size and
alignment, you can use the basic content slide for
other types of content such as quotes.”
80
Andrew Pons
Slide Designer
81
Column A Column B Column C Column D Column E Column F
Row A You can create
simple tables to
help organize
information.
Row B
Row C
Row D
Row E
Row F
Row G
Row H
Timeline Style One
82
Your subtitle here
Timeline Item Timeline Item Timeline Item Timeline Item
Timeline Item Timeline Item Timeline Item
Timeline Style Two
83
Your subtitle here
Your gantt chart item
Q1 Q2 Q3 Q4
Your gantt chart item
Your gantt chart item
Your gantt chart item
Your gantt chart item
Your gantt chart item
Your gantt chart item
Your gantt chart item
Single Column
Content Tile
Use this panel for content, images, diagrams, or whatever else you want to include. You can
use the line tool to divide this panel into multiple sections if you want.
Multi-purpose
84
Two Column
Content Tile
Use these slides for comparing two topics
or just for splitting your content into
multiple pieces.
Multi-purpose
85
Use these slides for comparing two topics
or just for splitting your content into
multiple pieces.
Multi-purpose
Column 3
Column 2
Three Column
Column 1
86
Four Column
Column 1
87
Column 4
Column 3
Column 2
Half Panel
Right aligned
88
This space is great for supporting text that
compliments whatever content is inside
the panel.
Open Content
This space can be for text content, images,
diagrams, or whatever you need
Panel Content
Half Panel
Left aligned
89
This space can be for text content, images,
diagrams, or whatever you need
Panel Content
This space is great for supporting text that
compliments whatever content is inside
the panel.
Open Content
⅔ Panel
Right aligned
90
This space can be for text content, images, diagrams, or
whatever you need
Panel Content
This space is great for
supporting text that
compliments whatever
content is inside the panel.
Open Content
⅔ Panel
Left aligned
91
This space can be for text content, images, diagrams, or
whatever you need
Panel Content
This space is great for
supporting text that
compliments whatever
content is inside the panel.
Open Content
Code Display
Paste snippets
92
select
count (*),
age
from
visitors
where location="SanFrancisco"
group by job_title
1
2
3
4
5
6
7
8
9
10
Use breaker
slides to divide
your deck into
sections
93
94
Use breaker
slides to divide
your deck into
sections
Starter Shapes
Copy and paste these wherever you need them
95
Floating panel for text or graphics
❤ ⚠ ★
✓ ✗
Medium pill label
SMALL PILL LABEL
✓ ✗
Medium pill label
SMALL PILL LABEL
Logos
Partners and cloud platforms
96
Logos
Open source projects
97

More Related Content

What's hot

Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineInfluxData
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company PresentationAndrewJiang18
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache SparkDataWorks Summit
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...Amazon Web Services
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 

What's hot (20)

Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company Presentation
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 

Similar to DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query-Engine_iteblog.pdf

DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...Medcl1
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analyticsSteven Hugo
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analyticsJan Vandevelde
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
How to Design Reports and Data Visualizations Your Users Love
How to Design Reports and Data Visualizations Your Users LoveHow to Design Reports and Data Visualizations Your Users Love
How to Design Reports and Data Visualizations Your Users LoveTIBCO Jaspersoft
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Ike Ellis
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
 
IFESFinal58
IFESFinal58IFESFinal58
IFESFinal58anish h
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
German introduction to sp framework
German   introduction to sp frameworkGerman   introduction to sp framework
German introduction to sp frameworkBob German
 

Similar to DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query-Engine_iteblog.pdf (20)

DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analytics
 
Introduction to einstein analytics
Introduction to einstein analyticsIntroduction to einstein analytics
Introduction to einstein analytics
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
How to Design Reports and Data Visualizations Your Users Love
How to Design Reports and Data Visualizations Your Users LoveHow to Design Reports and Data Visualizations Your Users Love
How to Design Reports and Data Visualizations Your Users Love
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
IFESFinal58
IFESFinal58IFESFinal58
IFESFinal58
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
German introduction to sp framework
German   introduction to sp frameworkGerman   introduction to sp framework
German introduction to sp framework
 

Recently uploaded

jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjaytendertech
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样wsppdmt
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...mikehavy0
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...vershagrag
 

Recently uploaded (20)

jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 

DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query-Engine_iteblog.pdf

  • 1. Supercharge Your Data Analytical Tool with a Rusty Query Engine Andrew Lamb Staff Engineer, InfluxData Apache Arrow PMC DataFusion and Apache Arrow Daniël Heres Data Engineer, GoDataDriven Apache Arrow PMC
  • 2. Introduction Your Speakers Staff Engineer @ InfluxData Previously ● Query Optimizer @ Vertica, Oracle Database server, embedded Compilers ● Chief Architect + VP Engineering roles at ML startups Andrew Data/ML Engineer @ GoDataDriven Previously • Data / ML Engineer @ bol.com • Startups 2 Daniël
  • 4. Recent Proliferation of Big Data systems 4 …
  • 5. Recent Proliferation of Databases 5 DB
  • 6. 6
  • 7. What is going on? COTS → Totally Custom 7 IT FANG “Buy and Operate” ● Buy software from vendors ● Operate on your own hardware, with sysadmins “Build and Operate” ● Write software for, and operate all components ● Optimized for exact needs ✓ Current Trend “Assemble and Operate” ● Assemble from open source technologies ● Operate on resources in a public cloud
  • 8. Apache Arrow Multi-language toolkit for Processing and Interchange Founded in 2016 Apache Software Foundation Low level / foundational technology to build fast and interoperable analytic systems Open standard, implementations in 12+ languages Adopted widely in industry products and open source projects 8
  • 9. “DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.” - DataFusion Website DataFusion: A Query Engine 9
  • 10. DataFusion: A Query Engine SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; Data Batches DataFrame ctx.read_table("http")? .filter(...)? .aggregate(..)?; Data Batches Catalog information: tables, schemas, etc 10
  • 11. Implementation timeline for a new Database system Client API In memory storage In-Memory filter + aggregation Durability / persistence Metadata Catalog + Management Query Language Parser Optimized / Compressed storage Execution on Compressed Data Joins! Additional Client Languages Outer Joins Subquery support More advanced analytics Cost based optimizer Out of core algorithms Storage Rearrangement Heuristic Query Planner Arithmetic expressions Date / time Expressions Concurrency Control Data Model / Type System Distributed query execution Resource Management “Lets Build a Database” 🤔 “Ok now this is pretty good” 😐 “Look mom! I have a database!” 😃 Online recovery Window functions 11
  • 13. LLVM-like Infrastructure for Databases Inputs Logical Plan Execution Plan Plan Representations (DataFlow Graphs) Expression Eval Optimizations / Transformations Optimizations / Transformations HashAggregate Sort … Optimized Execution Operators (Arrow Based) Join Data (Parquet, CSV, statistics, …) DataFusion Query (SQL, code, DataFrame, …) Code (UDF, UDA, etc) Resources (Cores, memory, etc) 13
  • 14. DataFusion: Totally Customizable Inputs Logical Plan Execution Plan Plan Representations (DataFlow Graphs) Expression Eval Optimizations / Transformations Optimizations / Transformations HashAggregate Sort … Optimized Execution Operators (Arrow Based) Join Data (Parquet, CSV, statistics, …) DataFusion Query (SQL, code, DataFrame, …) Code (UDF, UDA, etc) Resources (Cores, memory, etc) Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ 14
  • 17. DataFusion Milestones: Time to Mature 17 5+ year labor of love Dec 2016 Initial DataFusion Commit By Andy Grove Feb 2019 Donation to Apache Arrow Apr/May 2022: Subqueries Apr 2021 Ballista is donated to Apache Arrow 2020-2021: (hash) joins, window functions, performance & parallelization, etc. Mar 2018 Arrow in Rust started, DataFusion switches to Arrow Nov 2021 DataFusion Contrib
  • 19. From Query to Results ExecutionPlan LogicalPlan Optimize Optimize Execute! Query / DataFrame 19
  • 20. From Query to Results an example 20 select count(*) num_visitors, job_title from visitors where city = "San Francisco" group by job_title 1 2 3 4 5 6 7 8 9 10
  • 21. From Query to Results datafusion package available via PyPI 21 visitors = ctx.table("visitors") df = ( visitors.filter(col("city") == literal("San Francisco")) .aggregate([col("job_title")], [f.count(literal(1))]) ) batches = df.collect() # collect results into memory (Arrow batches) 1 2 3 4 5 6 7 8 9 10
  • 22. From Query to Results ExecutionPlan Optimize Optimize Execute! Query / DataFrame LogicalPlan 22 Logical Plan represents the what
  • 23. Initial Logical Plan SQL is parsed, then translated into a initial Logical Plan. 23 Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]] Filter: #visitors.city = Utf8("San Francisco") TableScan: visitors projection=None select count(*) num_visitors, job_title from visitors where city = "San Francisco" group by job_title visitors = ctx.table("visitors") df = ( visitors.filter(col("city") == literal("San Francisco")) .aggregate([col("job_title")], [f.count(literal(1))]) ) (Read plan from bottom to top)
  • 24. Let's Optimize! ExecutionPlan Optimize Execute! Query / DataFrame LogicalPlan Optimize 24 • Massively speed up execution times (10x, 100x, 1000x) by rewriting queries to a equivalent, optimized version • 14 built-in optimization passes in DataFusion, adding more each version • Add custom optimization passes
  • 25. Let's Optimize! Projection Pushdown Minimizing IO (especially useful for formats like Parquet), processing 25 Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]] Filter: #visitors.city = Utf8("San Francisco") TableScan: visitors projection=None num_visitors
  • 26. Let's Optimize! Projection Pushdown Minimizing IO (especially useful for formats like Parquet), processing 26 Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]] Filter: #visitors.city = Utf8("San Francisco") TableScan: visitors projection=Some([0, 1]) projection_push_down Projection: #COUNT(UInt8(1)) AS num_visitors, #visitors.job_title Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]] Filter: #visitors.city = Utf8("San Francisco") TableScan: visitors projection=None num_visitors
  • 27. Let's Optimize! Filter Pushdown Minimizing IO (especially useful for formats like Parquet), processing 27 Projection: #COUNT(UInt8(1)) AS n, #visitors.job_title Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]] Filter: #visitors.city = Utf8("San Francisco") TableScan: visitors projection=Some([0, 1]) num_visitors
  • 28. Let's Optimize! Filter Pushdown Minimizing IO (especially useful for formats like Parquet), processing 28 filter_push_down Projection: #COUNT(UInt8(1)) AS n, #visitors.job_title Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]] Filter: #visitors.city = Utf8("San Francisco") TableScan: visitors projection=Some([0, 1]), partial_filters=[#visitors.city = Utf8("San Francisco")] num_visitors Projection: #COUNT(UInt8(1)) AS n, #visitors.job_title Aggregate: groupBy=[[#visitors.job_title]], aggr=[[COUNT(UInt8(1))]] Filter: #visitors.city = Utf8("San Francisco") TableScan: visitors projection=Some([0, 1])
  • 29. Let's Create... The ExecutionPlan Optimize Execute! Query / DataFrame LogicalPlan ExecutionPlan Optimize 29 The Execution Plan represents the where and how
  • 30. The Initial Execution Plan 30 ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title] HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16) HashAggregateExec: mode=Partial, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] FilterExec: city@1 = San Francisco CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
  • 31. And... Optimize! Execute! Query / DataFrame LogicalPlan Optimize Optimize ExecutionPlan 31
  • 32. Optimize CoalesceBatches: Avoiding small batch size 32 ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title] HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16) HashAggregateExec: mode=Partial, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] FilterExec: city@1 = San Francisco CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
  • 33. Optimize CoalesceBatches: Avoiding small batch size 33 ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title] HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] CoalesceBatchesExec: target_batch_size=4096 RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16) CoalesceBatchesExec: target_batch_size=4096 FilterExec: city@1 = San Francisco CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city] coalesce_batches ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title] HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16) HashAggregateExec: mode=Partial, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] FilterExec: city@1 = San Francisco CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
  • 34. Optimize Repartition: Introducing parallelism 34 ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title] HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] CoalesceBatchesExec: target_batch_size=4096 RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16) CoalesceBatchesExec: target_batch_size=4096 FilterExec: city@1 = San Francisco CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
  • 35. Optimize Repartition: Introducing parallelism 35 ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title] HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] CoalesceBatchesExec: target_batch_size=4096 RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16) CoalesceBatchesExec: target_batch_size=4096 FilterExec: city@1 = San Francisco CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city] repartition ProjectionExec: expr=[COUNT(UInt8(1))@1 as number_visitors, job_title@0 as job_title] HashAggregateExec: mode=FinalPartitioned, gby=[job_title@0 as job_title], aggr=[COUNT(UInt8(1))] CoalesceBatchesExec: target_batch_size=4096 RepartitionExec: partitioning=Hash([Column { name: "job_title", index: 0 }], 16) CoalesceBatchesExec: target_batch_size=4096 FilterExec: city@1 = San Francisco RepartitionExec: partitioning=RoundRobinBatch(16) CsvExec: files=[./data/visitors.csv], has_header=true, limit=None, projection=[job_title, city]
  • 36. Getting results Return record batches (or write results) 36 Query / DataFrame LogicalPlan Optimize ExecutionPlan Execute! Optimize Arrow Batches
  • 37. DataFusion Features ● Mostly complete SQL implementation (aggregates, joins, window functions, etc) ● DataFrame API (Python, Rust) ● High performance vectorized, native, safe, multi-threaded execution ● Common file formats: Parquet, CSV, JSON, Avro ● Highly extensible / customizable ● Large, growing community driving project forward 37
  • 38. SQL Support https://arrow.apache.org/datafusion/user-guide/sql/sql_sta tus.html#supported-sql Projection (SELECT), Filtering (WHERE), Ordering (ORDER BY), Aggregation (GROUP BY) Aggregation functions (COUNT, SUM, MIN, MAX, AVG, APPROX_PERCENTILE, etc) Window functions (OVER .. ([ORDER BY ...] [PARTITION BY ..]) Set functions: UNION (ALL), INTERSECT (ALL), EXCEPT Scalar functions: string, Date/time,... (basic) Joins (INNER, LEFT, RIGHT, FULL OUTER, SEMI, ANTI) Subqueries, Grouping Sets 38
  • 39. 39 Extensibility Customize DataFusion to your needs User Defined Functions User Defined Aggregates User Defined Optimizer passes User Defined LogicalPlan nodes User Defined ExecutionPlan nodes User Defined TableProvider User Defined FileFormat User Defined ObjectStore
  • 41. FLOCK https://github.com/flock-lab/flock ● Overview: ○ Low-Cost Streaming Query Engine on FaaS Platforms ○ Project from UMD Database Group, runs streaming queries on AWS Lambda (x86 and arm64/graviton2). ● Use of DataFusion ○ SQL API: ○ DataFrame API: To build plans ○ Optimized native plan execution 41
  • 42. ROAPI https://roapi.github.io/ ● Overview: ○ read-only APIs for static datasets without code ○ columnq-cli: run sql queries against CSV files ● Use of DataFusion ○ SQL API: ○ DataFrame API: (to build plans for GraphQL) ○ File formats: CSV, JSON, Parquet, Avro ○ Optimized native plan execution 42
  • 43. VegaFusion https://vegafusion.io/ ● Overview: ○ Accelerates execution of (interactive) data visualizations ○ Compiles Vega data transforms into DataFusion query plans. ● Use of DataFusion: ○ DataFrame API: To build plans ○ UDFs: to implement some Vega expressions ○ Optimized native plan execution 43
  • 44. Cube.js / Cube Store https://cube.dev/ ● Overview: ○ Headless Business Intelligence ○ cubestore pre-aggregation storage layer ● Use of DataFusion (fork) ○ SQL API (with custom extensions) ○ Custom Logical and Physical Operators ○ UDFs: custom functions ○ Optimized native plan execution 44
  • 45. InfluxDB IOx https://github.com/influxdata/influxdb_iox ● Overview: ○ In-memory columnar store using object storage, future core of InfluxDB; support SQL, InfluxQL, and Flux ○ Query and data reorganization built with DataFusion ● Use of DataFusion: ○ Table Provider: Custom data sources ○ SQL API ○ PlanBuilder API: Plans for custom query language ○ UD Logical and Execution Plans ○ UDFs: to implement the precise semantics of influxRPC ○ Optimized native plan execution 45
  • 46. Coralogix https://coralogix.com/ ● Overview: ○ Stateful streaming analytics with machine learning enables teams to monitor and visualize observability data in real-time before indexing ● Use of DataFusion: ○ Table Provider: custom data source ○ User Defined Logical and Execution Plans: to implement a custom query language ○ User Defined ObjectStore: for queries over data in object storage ○ UDFs: for working with semi-structured data ○ Optimized native plan execution 46
  • 47. blaze-rs https://github.com/blaze-init/blaze ● Overview: ○ High performance, low-cost native execution layer for Spark: execute the physical operators with Rust ○ Translates Spark Exec nodes into DataFusion Execution Plans ● Use of DataFusion ○ Optimized native plan execution ○ HDFS Object Store Extension 47
  • 48. Ballista Distributed Compute https://github.com/apache/arrow-ballista ● Overview: ○ Spark-like distributed Query Engine (part of Arrow Project) ○ Adds distributed execution to DataFusion plans ● Use of DataFusion: ○ SQL API ○ DataFrame API ○ Optimized native plan execution ○ File formats: CSV, JSON, Parquet, Avro 48
  • 50. Future Directions ● Embeddability ○ More regular releases to crates.io, more modularity ● Broader SQL features ○ Subqueries, more date/time functions, struct / array types ● Improved Performance ○ Query directly from Object Storage ○ More state of the art tech: JIT, NUMA aware scheduling, hybrid row/columnar exec ● Ecosystem integration ○ FlightSQL, Substrait.io ○ Databases ● GPU support 50
  • 51. Come Join Us We ❤ Our Contributors ● Contributions at all levels are encouraged and welcomed. ● Learn Rust! ● Learn Database Internals! ● Have a great time with a welcoming community! More details: https://arrow.apache.org/datafusion/community/communication.html 51
  • 52. 52 Thank you arrow.apache.org/datafusion github.com/apache/arrow-datafusion Andrew Lamb Staff Engineer, InfluxData Apache Arrow PMC Daniël Heres Data Engineer, GoDataDriven Apache Arrow PMC
  • 53. 53
  • 55. Thank You! arrow.apache.org/datafusion github.com/apache/arrow-datafusion Andrew Lamb Staff Engineer, InfluxData Apache Arrow PMC Daniël Heres Data Engineer, GoDataDriven Apache Arrow PMC
  • 56. DataFusion / Arrow / Parquet Parquet Arrow sqlparser-rs DataFusion
  • 57. A Virtuous Cycle Increased Use of Drives Increased Contribution 57 Increased use of open source systems Increased capacity for maintenance and contribution DataFusion, and Apache Arrow are key open source technologies for building interoperable open source systems
  • 58. delta-rs https://github.com/delta-io/delta-rs ● Overview: ○ Native Delta Lake implementation in Rust ● Use of DataFusion ○ Table Provider API: allows other DataFusion users to read from Delta tables 58 DISCLAIMER: Not yet cleared / verified with project team
  • 59. Cloudfuse Buzz https://github.com/cloudfuse-io/buzz-rust ● Serverless cloud-based query engine ○ map using cloud functions (AWS Lambda) ○ aggregate using containers (AWS Fargate) ● Project (expected to be) continued from june 59 DISCLAIMER: Not yet cleared / verified with project team
  • 60. dask-sql https://github.com/dask-contrib/dask-sql ● Overview: ○ TBD ● ● Use of DataFusion: ○ WIP https://github.com/dask-contrib/dask-sql/issues/474 60 DISCLAIMER: Not yet cleared / verified with project team
  • 61. Apache Arrow Analytics Toolkit Where does DataFusion fit? 61 Parquet (“Disk”) Arrow (“Memory”) Compute Kernels Arrow Flight Arrow FlightSQL DataFusion Data Formats Low Level Calculations + Interchange Runtime Subsystems IPC C ABI Analytics / Database Systems C++ Query Engine Analytic systems built using some of this stack Native Implementations Language Bindings
  • 62. Query Engines What is it and why do you need one? 1. Add SQL or DataFrame interface to your application’s data 2. Implement a custom query language / DSL 3. Implement a new data analytic system 4. Implement a new database system (natch) Maps Desired Computations: SQL and DataFrame (ala Pandas) To Efficient Calculation: projection pushdown, filter pushdown, joins, expression simplification, parallelization, etc 62
  • 63. datafusion-python https://github.com/datafusion-contrib/datafusion-python ● Overview: ○ Python dataframe library (modeled after pyspark) ● Use of DataFusion ○ SQL API ○ DataFrame API ○ File formats: CSV, JSON, Parquet, Avro ○ Optimized native plan execution 63 DISCLAIMER: Not yet cleared / verified with project team
  • 64. Common Themes Come for the performance, stay for the features (?) Native execution Native (non JVM) of Spark/Spark like behavior SQL interface Projects are leveraging properties of Rustlang SQL / DataFrame API 64
  • 65. Better, Faster, Cheaper The DataFusion Query Engine is part of the commoditization of advanced analytic database technologies Transform analytic systems over the next decade 65 Better Faster Cheaper
  • 66. Andrew’s Notes Proposal: Data + AI Summit talk Desired Takeaways: 1. If you need a query engine (in Rust?), you should use DataFusion Thesis: DataFusion is part of a larger trend (spearheaded by Apache Arrow) in the commoditization of analytic database technologies, which will lead to many faster / cheaper / better analytic systems over the next decade Other decks for inspiration: DataFusion: An Embeddable Query Engine Written in Rust xA Rusty Introduction to Apache Arrow and how it Applies to a Time Series Database 2021-04-20: Apache Arrow and its Impact on the Database industry.pptx 66
  • 67. 67 Instructions: Read me! Getting started with our slide template When using this template, create your new slides at the very top of the slide order, above this slide. Explore the advice and example slides below to find useful layouts and graphics to pull into your design. When your slide deck is complete, delete this slide and every slide below it.
  • 68. There are great baseline slides in this template, but it may not have everything you need. Don’t be afraid to craft your own layouts! Just pay attention to the font and grid guidelines, and take advantage of starter shapes. Get creative Use text hierarchy to create order and keep your content scannable. No walls of text! Try to keep headlines short. Make it scannable Presentation best practices Less is more Don’t try to cram everything onto a limited number of slides. More slides with less text per slide is easier to digest. Clarity over density 68
  • 69. 69 Font Guidance Font selection All text in our slide decks should use one of two available event brand fonts: DM Sans or DM Mono. If you do not see these fonts in your font selection menu, they can be added by selecting “More fonts” and searching for “dm.” Click on DM Sans and DM Mono, then hit OK. 1 2
  • 70. 70 Font Guidance (Cont.) Font sizing Using consistent type sizing is a good way to help your slides feel uniform. When selecting type sizes, try to stick to multiple of 8, with the exceptions of 12 and 20 as in-betweens. DATA+AI Summit 12 DATA+AI Summit DATA+AI Summit DATA+AI Summit DATA+AI Summit DATA+AI Summit DATA+AI Summi DATA+AI Summ 16 20 24 32 40 56 64
  • 71. 71 Grid Guidance Keep it orderly Your presentation template has a 12 column grid to help you organize the elements on your slides. When laying out objects, consider using the grid to help. Toggle the grid visibility by navigating to View > Guides > Show Guides.
  • 72. 72 Color Guidance Keep it on brand When customizing charts or adding other visual elements, do your best to stay within our defined event color palette. This will ensure that all your content looks great together and doesn’t clash with the slide template design. Always use black text when placing content over a colored background. The only exception is when using a black background. Any color text is acceptable on black. 10121E 00B6E0 85DDB5 F16047 EDEEF1 8FDDEF AFE9CF F3A89B
  • 74. Eighteen colorful title slide options with varying shapes 74 Add your Name Add your title, company Choose Your Title Slide
  • 75. Eighteen colorful title slide options with varying shapes 75 Add your Name Add your title, company Choose Your Title Slide
  • 76. Eighteen colorful title slide options with varying shapes 76 Add your Name Add your title, company Choose Your Title Slide
  • 77. Basic Content Slide Your all-purpose zone Use this slide as a starting point for crafting your own layouts, or for simple text slides. 77
  • 78. Activate Dark Mode Mix in black slides to add contrast and variety Or make your whole presentation dark! 78
  • 79. Insert your charts or images Take advantage of the content panels 79 If you want to insert a gif or other image from the web, simply navigate to Insert > Image > by URL. Crop and resize your image to fit within content panels, if you’re feeling fancy. Insert Image by URL
  • 80. “With just a few adjustments to text size and alignment, you can use the basic content slide for other types of content such as quotes.” 80 Andrew Pons Slide Designer
  • 81. 81 Column A Column B Column C Column D Column E Column F Row A You can create simple tables to help organize information. Row B Row C Row D Row E Row F Row G Row H
  • 82. Timeline Style One 82 Your subtitle here Timeline Item Timeline Item Timeline Item Timeline Item Timeline Item Timeline Item Timeline Item
  • 83. Timeline Style Two 83 Your subtitle here Your gantt chart item Q1 Q2 Q3 Q4 Your gantt chart item Your gantt chart item Your gantt chart item Your gantt chart item Your gantt chart item Your gantt chart item Your gantt chart item
  • 84. Single Column Content Tile Use this panel for content, images, diagrams, or whatever else you want to include. You can use the line tool to divide this panel into multiple sections if you want. Multi-purpose 84
  • 85. Two Column Content Tile Use these slides for comparing two topics or just for splitting your content into multiple pieces. Multi-purpose 85 Use these slides for comparing two topics or just for splitting your content into multiple pieces. Multi-purpose
  • 86. Column 3 Column 2 Three Column Column 1 86
  • 87. Four Column Column 1 87 Column 4 Column 3 Column 2
  • 88. Half Panel Right aligned 88 This space is great for supporting text that compliments whatever content is inside the panel. Open Content This space can be for text content, images, diagrams, or whatever you need Panel Content
  • 89. Half Panel Left aligned 89 This space can be for text content, images, diagrams, or whatever you need Panel Content This space is great for supporting text that compliments whatever content is inside the panel. Open Content
  • 90. ⅔ Panel Right aligned 90 This space can be for text content, images, diagrams, or whatever you need Panel Content This space is great for supporting text that compliments whatever content is inside the panel. Open Content
  • 91. ⅔ Panel Left aligned 91 This space can be for text content, images, diagrams, or whatever you need Panel Content This space is great for supporting text that compliments whatever content is inside the panel. Open Content
  • 92. Code Display Paste snippets 92 select count (*), age from visitors where location="SanFrancisco" group by job_title 1 2 3 4 5 6 7 8 9 10
  • 93. Use breaker slides to divide your deck into sections 93
  • 94. 94 Use breaker slides to divide your deck into sections
  • 95. Starter Shapes Copy and paste these wherever you need them 95 Floating panel for text or graphics ❤ ⚠ ★ ✓ ✗ Medium pill label SMALL PILL LABEL ✓ ✗ Medium pill label SMALL PILL LABEL
  • 96. Logos Partners and cloud platforms 96