2021 04-20 apache arrow and its impact on the database industry.pptx

Andrew Lamb
Guest Lecture
CSC-132 Database Systems Implementation in Spring '21.
UCSD
April 20, 2021
Apache Arrow and its Impact
on the Database industry

IOx Team at InfluxData
Apache Arrow PMC Member
Query Optimizer / Architect @ Vertica
Technical Staff @ Oracle, on Database Server
Chief Architect @ DataRobot (ML Startup)
Chief Architect/VP Engineering @ Nutonian (ML Startup)
XLST JIT Compiler Team at DataPower (XML startup)

Goals + Outline
Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤
● Opinions and Perspectives of Databases
● Why and What of Arrow
● Arrow Examples, in Rust
Props to Wes McKinney who provided many of the arrow slides

Databases -- Trend Towards Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st
International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model Deployment
Embedded / Edge
Cloud
Single-Node
Hybrid
Ecosystem
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Use Case
Transactions
Analytics
Streaming
...

… and our new database is …
🎉
InfluxDB IOx - The Future Core of InfluxDB Built
with Rust and Arrow

And we are not alone
A proliferation of new databases…. And many are choosing to use some of the
Arrow ecosystem
⇒ lets them innovate at a higher level not the nuts and bolts
6
Arrow Parquet
● AWS Athena
● ClickHouse
● Snowflake
● Vertica
● DuckDb
● Greenplum
● DataBricks
● …
Arrow Flight
● Google BigQuery
● Snowflake
● Dremio
● DataBricks
● ...
Arrow Arrays
● Dremio
● TileDB
● SciDB
● ...

Analytic Systems (vs Transactional)
● Transactional (OLTP, Key-value stores, etc)
○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent”
○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, SQLite etc etc
● Analytic (OLAP, “Big Data”, etc)
○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated
○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc
⇒ Rest of the talk focused on Analytic Databases

So, you want to build a new database… ?
Databases need many features just to look like a database:
● Get Data In and Out
● Store Data and Catalog / Metadata
● Query Language + Query Engine
● Connect: Client API
…
Before you can invest in what makes your database special

Implementation timeline for a new Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a Database”
🤔
“Ok now this
is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery

Arrow Project Goals
“Build a better open source
foundation for data science”
🤔 How is this related to databases?
https://arrow.apache.org/

• Open source project begun in 2016, founded by
Wes McKinney
• Initially focused on applying systems / database
research to data science ecosystems (pandas)

Data scientists working with “small” data
have not experienced great pain
Small Data (< ~10GB)
Medium Data (~10 - ~100GB)
Big Data (> ~100GB-1TB)
Current Python/R
stack begins to “fail”
around this point
Users doing fine here
NYC R Conference 2018-04-20

We can do so much better through modern
systems techniques
Multi-core algorithms,
GPU acceleration,
Code generation (LLVM)
Lazy evaluation,
“query” optimization
Sophisticated memory
management,
Efficient access to huge
data sets
Interoperable memory
models, zero-copy
interchange between
system components
Note 1
Moore’s Law (and small
data) enabled us to get by
for a long time without
confronting some of these
challenges
Note 2
Most of these methods
have already been widely
employed in analytic
databases. Limited
“novel” research needed
NYC R Conference 2018-04-20

• Language agnostic, state-of-the-art library
for building fast data systems
• Arrow Memory Format
• Compute Kernels (not standardized)
• Parquet File Format
• Flight: Data RPC / Network Transfer format
• Query Engines (not standardized)

Arrow Implementations and Bindings
16
Java
C++
JavaScript
Go
Rust
C Ruby
Python
Julia
R
C#
Matlab

https://www.snowflake.com/blog/fetching-query-results-from-snowflake-just-got-a-lot-faster-with-apache-arrow/

https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171

Apache Arrow in Academic Papers
20

Some Industry Contributors in Apache Arrow
ClearCode

Arrow == toolkit for a modern analytic databases
match tool_needed {
File Format (persistence) => Parquet
Columnar memory representation => Arrow Arrays
Operations (e.g. add, multiply) => Compute Kernels
Network transfer => Arrow Flight IPC
_ => ... to be continued ...
}

Code Examples
Thesis: “When writing an analytic database, you will end up implementing the
Arrow feature set”
+
* Take performance comparisons with a large grain of salt
Compare Plain Rust and Rust using the Arrow library

InfluxDB line protocol
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
Line Protocol Tutorial (link)
Measurements
Tags Fields
Timestamp

IOx Data Model
location
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
temperature
82
82
70
83
87
72
84
90
71
humidity
67
65
54
69
78
56
67
82
57
timestamp
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
Table: weather

Motivating Example
“Find the rows that are not in `us-west`”

Create the Array
let string_vec: Vec<String> =
(0..NUM_TAGS)
.map(|i| {
match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
}.into()
})
.collect();
let mut builder =
StringBuilder::new(NUM_TAGS);
(0..NUM_TAGS).enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append_value(location)
.unwrap()
});
let array = builder.finish();
> created array with 10000000 elements
~600ms
> created array with 10000000 elements
~400ms
+

Memory Footprint
let size =
size_of::<Vec<String>>() +
string_vec
.iter()
.fold(0, |sz, s| {
sz + size_of::<String>() + s.len()
});
println!("total size: {} bytes", size);
println!("total size: {} bytes",
array.get_array_memory_size());
> total size: 320000023 bytes
~320 MB *
~150 MB
+

Find Rows != “us-west”
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| s != "us-west")
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
let not_west_bitset =
neq_utf8_scalar(
&array,
"us-west"
).unwrap();
.iter()
.filter(|v| matches!(v, Some(true)))
.count();
> Found 6666667 not in west
~50ms
~120ms
+

Find Rows != “us-west” (with null handling)
let string_vec: Vec<Option<String>> = ...;
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| {
s.as_ref()
.map(|s| s != "us-west")
.unwrap_or(false)
})
.collect();
.iter()
.filter(|&&v| v)
.count();
+
Same as previous
~50ms

Materialize rows for future processing
let not_west: Vec<String> = not_west_bitset
.iter()
.enumerate()
.filter_map(|(i, &v)| {
if v {
Some(string_vec[i].clone())
} else {
None
}
})
.collect();
let not_west = filter(
&array,
&not_west_bitset
).unwrap();
> Made array of 6666667 Strings not in west
~450 ms
> Made array of 6666667 Strings not in west
~50 ms
+

More efficient encoding (dictionary)
let vb = StringBuilder::new();
let kb = Int8Builder::new();
let mut builder =
StringDictionaryBuilder::new(vb,kb);
(0..NUM_TAGS)
.enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append(location).unwrap();
});
let array = builder.finish();
10MB
250 ms
+
dictionary
"us-east"
"us-midwest"
"us-west"
Location
0
1
2
0
1
2
0
1
2
[0]
[1]
[2]
[u8]

SIMD Anyone?
let output = gt(
&left,
&right
).unwrap();
+
10
20
17
5
23
5
9
12
4
5
76
2
3
5
2
33
2
1
6
7
8
2
7
2
5
6
7
8
left right output
1
0
1
1
1
0
1
1
0
1
1
0
0
0
>
>
>
>

SIMD Implementation
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
feature = "simd"))]
fn simd_compare_op<T, F>(left: &PrimitiveArray<T>,
right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray>
where
T: ArrowNumericType,
F: Fn(T::Simd, T::Simd) -> T::SimdMask,
{
// use / error checking elided
let null_bit_buffer = combine_option_bitmap(
left.data_ref(), right.data_ref(), len
)?;
let lanes = T::lanes();
let mut result = MutableBuffer::new(
left.len() * mem::size_of::<bool>()
);
let rem = len % lanes;
for i in (0..len - rem).step_by(lanes) {
let simd_left = T::load(left.value_slice(i, lanes));
let simd_right = T::load(right.value_slice(i, lanes));
let simd_result = op(simd_left, simd_right);
T::bitmask(&simd_result, |b| {
result.write(b).unwrap();
});
}
Source: arrow/src/compute/kernels/comparison.rs
if rem > 0 {
let simd_left = T::load(left.value_slice(len - rem, lanes));
let simd_right = T::load(right.value_slice(len - rem,
lanes));
let simd_result = op(simd_left, simd_right);
let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize;
T::bitmask(&simd_result, |b| {
result.write(&b[0..rem_buffer_size]).unwrap();
});
}
let data = ArrayData::new(
DataType::Boolean,
left.len(),
None,
null_bit_buffer,
0,
vec![result.freeze()],
vec![],
);
Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data)))
}

Other things needed in a database
Vec<Option<String>> to support nulls
Handle other data types with same code
Vectorized implementations of filter, aggregate, etc
Persist it to storage
Send data over the network
Ecosystem compatibility
...

2021 04-20 apache arrow and its impact on the database industry.pptx

More Related Content

What's hot

Similar to 2021 04-20 apache arrow and its impact on the database industry.pptx

Recently uploaded

2021 04-20 apache arrow and its impact on the database industry.pptx