Apache Arrow: Leveling Up the Data Science Stack

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
Apache Arrow: Leveling Up the Data Science Stack
https://ursalabs.org
● Build cross-language, portable
computational libraries for data
science
● Grow Apache Arrow ecosystem
● Funding and employment for
full-time open source developers
● Not-for-profit, funded by multiple
corporations
Ursa Labs Mission
Strategic Partnership Model
•
•
•
•
Up to 80-90% of CPU
cycles spent on
de/serialization
Life without Arrow Life with Arrow
No de/serialization
•
•
•
•
Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage
● Columnar format objects and utilities
● Memory management and generic IO
● Binary protocol / serialization functions
● Memory-mapping and zero-copy “parsing”
● Integration testing
Arrow Core
● Fast read and write of multi-file datasets
● Read only the parts of the dataset relevant to your analysis
(“predicate pushdown”)
C++ Datasets
File Formats Storage Systems
CSV
•
•
•
•
Arrow Flight RPC (Messaging)
● Efficient client-server dataset interchange
● Focused on gRPC (Google’s messaging framework), but may
support other transports in future
● It’s fast… really fast
○ Upwards 3GB/s server-to-client on localhost
Arrow for R
● Rcpp-based bindings
● https://github.com/apache/arrow/tree/master/r
● Goal: enable R package developers to leverage
Arrow ecosystem for better performance and
scalability
Arrow format vs. R data.frame
● Type-independent representation of NA values (bits vs. special
values)
● Better computational efficiency for strings
● Naturally chunk-based (vs. large contiguous allocations)
● Supports a much wider variety of data types, including nested
data (JSON-like)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
Can be a massive Arrow dataset
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset
Keep up to date at
https://arrow.apache.org
https://ursalabs.org
https://wesmckinney.com
Thanks
1 of 20

More Related Content

Similar to Apache Arrow: Leveling Up the Data Science Stack(20)

More from Wes McKinney(20)

Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views

Apache Arrow: Leveling Up the Data Science Stack

  • 2. https://ursalabs.org ● Build cross-language, portable computational libraries for data science ● Grow Apache Arrow ecosystem ● Funding and employment for full-time open source developers ● Not-for-profit, funded by multiple corporations Ursa Labs Mission
  • 5. Up to 80-90% of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization
  • 7. Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  • 8. ● Columnar format objects and utilities ● Memory management and generic IO ● Binary protocol / serialization functions ● Memory-mapping and zero-copy “parsing” ● Integration testing Arrow Core
  • 9. ● Fast read and write of multi-file datasets ● Read only the parts of the dataset relevant to your analysis (“predicate pushdown”) C++ Datasets File Formats Storage Systems CSV
  • 11. Arrow Flight RPC (Messaging) ● Efficient client-server dataset interchange ● Focused on gRPC (Google’s messaging framework), but may support other transports in future ● It’s fast… really fast ○ Upwards 3GB/s server-to-client on localhost
  • 12. Arrow for R ● Rcpp-based bindings ● https://github.com/apache/arrow/tree/master/r ● Goal: enable R package developers to leverage Arrow ecosystem for better performance and scalability
  • 13. Arrow format vs. R data.frame ● Type-independent representation of NA values (bits vs. special values) ● Better computational efficiency for strings ● Naturally chunk-based (vs. large contiguous allocations) ● Supports a much wider variety of data types, including nested data (JSON-like)
  • 17. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  • 18. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  • 19. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  • 20. Keep up to date at https://arrow.apache.org https://ursalabs.org https://wesmckinney.com Thanks