SlideShare a Scribd company logo
A Rusty Introduction to Apache
Arrow and how it Applies to a
Time Series Database
December 9, 2020
Andrew Lamb
InfluxData
IOx Team at InfluxData
Query Optimizer / Architect @ Vertica
(Columnar Database),
Chief Architect @ DataRobot (Machine
Learning Platform )
Chief Architect @ Nutonian (Machine
Learning Apps
XLST JIT Compiler Team at DataPower
Goals + Outline
Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤
● Opinions and Perspectives of Databases
● Background on Arrow
● Arrow Examples, in Rust
Databases -- Trend Towards Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st
International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model Deployment
Embedded / Edge
Cloud
Single-Node
Hybrid
Ecosystem
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Use Case
Transactions
Analytics
Streaming
...
… and our new database is …
🎉
InfluxDB IOx - The Future Core of InfluxDB Built
with Rust and Arrow
Analytic Systems (vs Transactional)
● Transactional (OLTP, Key-value stores, etc)
○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent”
○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, etc etc
● Analytic (OLAP, “Big Data”, etc)
○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated
○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc
⇒ Rest of the talk focused on Analytic Databases
So, you want to build a new database… ?
Databases need many features just to look like a database:
● Get Data In and Out
● Store Data and Catalog / Metadata
● Query Store: + Query Language
● Connect: Client API
…
Before you can invest in what makes your database special
Implementation timeline for a new Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a
Database”
🤔
“Ok now
this is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery
Arrow Project Goals
“Build a better open source
foundation for data science”
🤔 How is this related to databases?
https://arrow.apache.org/
Arrow == toolkit for a modern analytic databases
match tool_needed {
File Format (persistence) => Parquet
Columnar memory representation => Arrow Arrays
Operations (e.g. add, multiply) => Compute Kernels
Network transfer => Arrow Flight IPC
_ => ... to be continued ...
}
InfluxDB line protocol
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
weather,location=us-east temperature=83,humidity=69 1465839830200400200
weather,location=us-midwest temperature=87,humidity=78 1465839830200400200
weather,location=us-west temperature=72,humidity=56 1465839830200400200
weather,location=us-east temperature=84,humidity=67 1465839830300400200
weather,location=us-midwest temperature=90,humidity=82 1465839830400400200
weather,location=us-west temperature=71,humidity=57 1465839830400400200
Line Protocol Tutorial (link)
Measurements
Tags Fields
Timestamp
IOx Data Model
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
weather,location=us-east temperature=83,humidity=69 1465839830200400200
weather,location=us-midwest temperature=87,humidity=78 1465839830200400200
weather,location=us-west temperature=72,humidity=56 1465839830200400200
weather,location=us-east temperature=84,humidity=67 1465839830300400200
weather,location=us-midwest temperature=90,humidity=82 1465839830400400200
weather,location=us-west temperature=71,humidity=57 1465839830400400200
location
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
temperature
82
82
70
83
87
72
84
90
71
humidity
67
65
54
69
78
56
67
82
57
timestamp
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
Table: weather
Code Examples
Thesis: “When writing an analytic database, you will end up implementing the
Arrow feature set”
(Ecosystem integration is another major benefit of Arrow, subject of a future talk)
+
* Take performance comparisons with a large grain of salt
Compare Plain Rust and Rust using the Arrow library
Motivating Example
“Find the rows that are not in `us-west`”
Create the Array
let string_vec: Vec<String> =
(0..NUM_TAGS)
.map(|i| {
match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
}.into()
})
.collect();
let mut builder =
StringBuilder::new(NUM_TAGS);
(0..NUM_TAGS).enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append_value(location)
.unwrap()
});
let array = builder.finish();
> created array with 10000000 elements
~600ms
> created array with 10000000 elements
~400ms
+
Memory Footprint
let size =
size_of::<Vec<String>>() +
string_vec
.iter()
.fold(0, |sz, s| {
sz + size_of::<String>() + s.len()
});
println!("total size: {} bytes", size);
println!("total size: {} bytes",
array.get_array_memory_size());
> total size: 320000023 bytes
~320 MB *
> total size: 149206128 bytes
~150 MB
+
Find Rows != “us-west”
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| s != "us-west")
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
let not_west_bitset =
neq_utf8_scalar(
&array,
"us-west"
).unwrap();
let num_not_west = not_west_bitset
.iter()
.filter(|v| matches!(v, Some(true)))
.count();
> Found 6666667 not in west
~50ms
> Found 6666667 not in west
~120ms
+
Find Rows != “us-west” (with null handling)
let string_vec: Vec<Option<String>> = ...;
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| {
s.as_ref()
.map(|s| s != "us-west")
.unwrap_or(false)
})
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
+
Same as previous
> Found 6666667 not in west
~50ms
Materialize rows for future processing
let not_west: Vec<String> = not_west_bitset
.iter()
.enumerate()
.filter_map(|(i, &v)| {
if v {
Some(string_vec[i].clone())
} else {
None
}
})
.collect();
let not_west = filter(
&array,
&not_west_bitset
).unwrap();
> Made array of 6666667 Strings not in west
~450 ms
> Made array of 6666667 Strings not in west
~50 ms
+
More efficient encoding (dictionary)
let vb = StringBuilder::new();
let kb = Int8Builder::new();
let mut builder =
StringDictionaryBuilder::new(vb,kb);
(0..NUM_TAGS)
.enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append(location).unwrap();
});
let array = builder.finish();
> total size: 10000688 bytes
10MB
250 ms
+
dictionary
"us-east"
"us-midwest"
"us-west"
Location
0
1
2
0
1
2
0
1
2
[0]
[1]
[2]
[u8]
SIMD Anyone?
let output = gt(
&left,
&right
).unwrap();
+
10
20
17
5
23
5
9
12
4
5
76
2
3
5
2
33
2
1
6
7
8
2
7
2
5
6
7
8
left right output
1
0
1
1
1
0
1
1
0
1
1
0
0
0
>
>
>
>
SIMD Implementation
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
feature = "simd"))]
fn simd_compare_op<T, F>(left: &PrimitiveArray<T>,
right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray>
where
T: ArrowNumericType,
F: Fn(T::Simd, T::Simd) -> T::SimdMask,
{
// use / error checking elided
let null_bit_buffer = combine_option_bitmap(
left.data_ref(), right.data_ref(), len
)?;
let lanes = T::lanes();
let mut result = MutableBuffer::new(
left.len() * mem::size_of::<bool>()
);
let rem = len % lanes;
for i in (0..len - rem).step_by(lanes) {
let simd_left = T::load(left.value_slice(i, lanes));
let simd_right = T::load(right.value_slice(i, lanes));
let simd_result = op(simd_left, simd_right);
T::bitmask(&simd_result, |b| {
result.write(b).unwrap();
});
}
Source: arrow/src/compute/kernels/comparison.rs
if rem > 0 {
let simd_left = T::load(left.value_slice(len - rem, lanes));
let simd_right = T::load(right.value_slice(len - rem, lanes));
let simd_result = op(simd_left, simd_right);
let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize;
T::bitmask(&simd_result, |b| {
result.write(&b[0..rem_buffer_size]).unwrap();
});
}
let data = ArrayData::new(
DataType::Boolean,
left.len(),
None,
null_bit_buffer,
0,
vec![result.freeze()],
vec![],
);
Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data)))
}
Other things needed in a database
Vec<Option<String>> to support nulls
Handle other data types with same code
Vectorized implementations of filter, aggregate, etc
Persist it to storage
Send data over the network
Ecosystem compatibility
...
Rust / Arrow Community: Good and Getting better
Major Roadmap Items (see also Apache Arrow (Rust) 2.0.0)
1. Support Stable Rust
2. Improved DictionaryArray support and performance
3. Improved compute kernel performance
4. SQL: Joins
5. Parallel CPU-bound operations; Additional platform support (e.g. ARMv8)
InfluxData specifically is investing in:
1. Flight IPC
2. Improved Dictionary and Date/Time support
3. Data Fusion (some other tech talk)
Thank You
Find us online
Github: https://github.com/influxdata/influxdb_iox
Slack: https://influxdata.com/slack
It is early days; there are many cool things left to implement
And we are hiring (Senior IOx Engineer Job Posting)

More Related Content

What's hot

Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Altinity Ltd
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
InfluxData
 
PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"
Uptime Technologies LLC
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
Severalnines
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Community
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
InfluxData
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015PostgreSQL-Consulting
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 

What's hot (20)

Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 

Similar to A Rusty introduction to Apache Arrow and how it applies to a time series database

2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
Konrad Kokosa
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
antoinegirbal
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
antoinegirbal
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
John Beresniewicz
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
Ortus Solutions, Corp
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
NoSQLmatters
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
Thodoris Bais
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
Lukas Vlcek
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
Amazon Web Services
 

Similar to A Rusty introduction to Apache Arrow and how it applies to a time series database (20)

2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 

A Rusty introduction to Apache Arrow and how it applies to a time series database

  • 1. A Rusty Introduction to Apache Arrow and how it Applies to a Time Series Database December 9, 2020 Andrew Lamb InfluxData
  • 2. IOx Team at InfluxData Query Optimizer / Architect @ Vertica (Columnar Database), Chief Architect @ DataRobot (Machine Learning Platform ) Chief Architect @ Nutonian (Machine Learning Apps XLST JIT Compiler Team at DataPower
  • 3. Goals + Outline Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤ ● Opinions and Perspectives of Databases ● Background on Arrow ● Arrow Examples, in Rust
  • 4. Databases -- Trend Towards Specialization Relational Key-Value Timeseries Graph Array / Scientific Document Stream Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1 Data Model Deployment Embedded / Edge Cloud Single-Node Hybrid Ecosystem Hadoop Java Json / Javascript AWS GCP Azure Apple Cloud Use Case Transactions Analytics Streaming ...
  • 5. … and our new database is … 🎉 InfluxDB IOx - The Future Core of InfluxDB Built with Rust and Arrow
  • 6. Analytic Systems (vs Transactional) ● Transactional (OLTP, Key-value stores, etc) ○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent” ○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, etc etc ● Analytic (OLAP, “Big Data”, etc) ○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated ○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc ⇒ Rest of the talk focused on Analytic Databases
  • 7. So, you want to build a new database… ? Databases need many features just to look like a database: ● Get Data In and Out ● Store Data and Catalog / Metadata ● Query Store: + Query Language ● Connect: Client API … Before you can invest in what makes your database special
  • 8. Implementation timeline for a new Database system Client API In memory storage In-Memory filter + aggregation Durability / persistence Metadata Catalog + Management Query Language Parser Optimized / Compressed storage Execution on Compressed Data Joins! Additional Client Languages Outer Joins Subquery support More advanced analytics Cost based optimizer Out of core algorithms Storage Rearrangement Heuristic Query Planner Arithmetic expressions Date / time Expressions Concurrency Control Data Model / Type System Distributed query execution Resource Management “Lets Build a Database” 🤔 “Ok now this is pretty good” 😐 “Look mom! I have a database!” 😃 Online recovery
  • 9. Arrow Project Goals “Build a better open source foundation for data science” 🤔 How is this related to databases? https://arrow.apache.org/
  • 10. Arrow == toolkit for a modern analytic databases match tool_needed { File Format (persistence) => Parquet Columnar memory representation => Arrow Arrays Operations (e.g. add, multiply) => Compute Kernels Network transfer => Arrow Flight IPC _ => ... to be continued ... }
  • 11. InfluxDB line protocol weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 Line Protocol Tutorial (link) Measurements Tags Fields Timestamp
  • 12. IOx Data Model weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 location "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" temperature 82 82 70 83 87 72 84 90 71 humidity 67 65 54 69 78 56 67 82 57 timestamp 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z Table: weather
  • 13. Code Examples Thesis: “When writing an analytic database, you will end up implementing the Arrow feature set” (Ecosystem integration is another major benefit of Arrow, subject of a future talk) + * Take performance comparisons with a large grain of salt Compare Plain Rust and Rust using the Arrow library
  • 14. Motivating Example “Find the rows that are not in `us-west`”
  • 15. Create the Array let string_vec: Vec<String> = (0..NUM_TAGS) .map(|i| { match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }.into() }) .collect(); let mut builder = StringBuilder::new(NUM_TAGS); (0..NUM_TAGS).enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append_value(location) .unwrap() }); let array = builder.finish(); > created array with 10000000 elements ~600ms > created array with 10000000 elements ~400ms +
  • 16. Memory Footprint let size = size_of::<Vec<String>>() + string_vec .iter() .fold(0, |sz, s| { sz + size_of::<String>() + s.len() }); println!("total size: {} bytes", size); println!("total size: {} bytes", array.get_array_memory_size()); > total size: 320000023 bytes ~320 MB * > total size: 149206128 bytes ~150 MB +
  • 17. Find Rows != “us-west” let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| s != "us-west") .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); let not_west_bitset = neq_utf8_scalar( &array, "us-west" ).unwrap(); let num_not_west = not_west_bitset .iter() .filter(|v| matches!(v, Some(true))) .count(); > Found 6666667 not in west ~50ms > Found 6666667 not in west ~120ms +
  • 18. Find Rows != “us-west” (with null handling) let string_vec: Vec<Option<String>> = ...; let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| { s.as_ref() .map(|s| s != "us-west") .unwrap_or(false) }) .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); + Same as previous > Found 6666667 not in west ~50ms
  • 19. Materialize rows for future processing let not_west: Vec<String> = not_west_bitset .iter() .enumerate() .filter_map(|(i, &v)| { if v { Some(string_vec[i].clone()) } else { None } }) .collect(); let not_west = filter( &array, &not_west_bitset ).unwrap(); > Made array of 6666667 Strings not in west ~450 ms > Made array of 6666667 Strings not in west ~50 ms +
  • 20. More efficient encoding (dictionary) let vb = StringBuilder::new(); let kb = Int8Builder::new(); let mut builder = StringDictionaryBuilder::new(vb,kb); (0..NUM_TAGS) .enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append(location).unwrap(); }); let array = builder.finish(); > total size: 10000688 bytes 10MB 250 ms + dictionary "us-east" "us-midwest" "us-west" Location 0 1 2 0 1 2 0 1 2 [0] [1] [2] [u8]
  • 21. SIMD Anyone? let output = gt( &left, &right ).unwrap(); + 10 20 17 5 23 5 9 12 4 5 76 2 3 5 2 33 2 1 6 7 8 2 7 2 5 6 7 8 left right output 1 0 1 1 1 0 1 1 0 1 1 0 0 0 > > > >
  • 22. SIMD Implementation #[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), feature = "simd"))] fn simd_compare_op<T, F>(left: &PrimitiveArray<T>, right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray> where T: ArrowNumericType, F: Fn(T::Simd, T::Simd) -> T::SimdMask, { // use / error checking elided let null_bit_buffer = combine_option_bitmap( left.data_ref(), right.data_ref(), len )?; let lanes = T::lanes(); let mut result = MutableBuffer::new( left.len() * mem::size_of::<bool>() ); let rem = len % lanes; for i in (0..len - rem).step_by(lanes) { let simd_left = T::load(left.value_slice(i, lanes)); let simd_right = T::load(right.value_slice(i, lanes)); let simd_result = op(simd_left, simd_right); T::bitmask(&simd_result, |b| { result.write(b).unwrap(); }); } Source: arrow/src/compute/kernels/comparison.rs if rem > 0 { let simd_left = T::load(left.value_slice(len - rem, lanes)); let simd_right = T::load(right.value_slice(len - rem, lanes)); let simd_result = op(simd_left, simd_right); let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize; T::bitmask(&simd_result, |b| { result.write(&b[0..rem_buffer_size]).unwrap(); }); } let data = ArrayData::new( DataType::Boolean, left.len(), None, null_bit_buffer, 0, vec![result.freeze()], vec![], ); Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data))) }
  • 23. Other things needed in a database Vec<Option<String>> to support nulls Handle other data types with same code Vectorized implementations of filter, aggregate, etc Persist it to storage Send data over the network Ecosystem compatibility ...
  • 24. Rust / Arrow Community: Good and Getting better Major Roadmap Items (see also Apache Arrow (Rust) 2.0.0) 1. Support Stable Rust 2. Improved DictionaryArray support and performance 3. Improved compute kernel performance 4. SQL: Joins 5. Parallel CPU-bound operations; Additional platform support (e.g. ARMv8) InfluxData specifically is investing in: 1. Flight IPC 2. Improved Dictionary and Date/Time support 3. Data Fusion (some other tech talk)
  • 25. Thank You Find us online Github: https://github.com/influxdata/influxdb_iox Slack: https://influxdata.com/slack It is early days; there are many cool things left to implement And we are hiring (Senior IOx Engineer Job Posting)