7. Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage
8. ● Columnar format objects and utilities
● Memory management and generic IO
● Binary protocol / serialization functions
● Memory-mapping and zero-copy “parsing”
● Integration testing
Arrow Core
9. ● Fast read and write of multi-file datasets
● Read only the parts of the dataset relevant to your analysis
(“predicate pushdown”)
C++ Datasets
File Formats Storage Systems
CSV
11. Arrow Flight RPC (Messaging)
● Efficient client-server dataset interchange
● Focused on gRPC (Google’s messaging framework), but may
support other transports in future
● It’s fast… really fast
○ Upwards 3GB/s server-to-client on localhost
12. Arrow for R
● Rcpp-based bindings
● https://github.com/apache/arrow/tree/master/r
● Goal: enable R package developers to leverage
Arrow ecosystem for better performance and
scalability
13. Arrow format vs. R data.frame
● Type-independent representation of NA values (bits vs. special
values)
● Better computational efficiency for strings
● Naturally chunk-based (vs. large contiguous allocations)
● Supports a much wider variety of data types, including nested
data (JSON-like)
17. flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset
18. flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
Can be a massive Arrow dataset
19. flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset
20. Keep up to date at
https://arrow.apache.org
https://ursalabs.org
https://wesmckinney.com
Thanks