Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Arrow: Leveling Up the Data Science Stack

1,529 views

Published on

Wes McKinney Talk from NYC R Conference, May 10 2019

Published in: Technology
  • Be the first to comment

Apache Arrow: Leveling Up the Data Science Stack

  1. 1. https://ursalabs.org ● Build cross-language, portable computational libraries for data science ● Grow Apache Arrow ecosystem ● Funding and employment for full-time open source developers ● Not-for-profit, funded by multiple corporations Ursa Labs Mission
  2. 2. Strategic Partnership Model
  3. 3. • • • •
  4. 4. Up to 80-90% of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization
  5. 5. • • • •
  6. 6. Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  7. 7. ● Columnar format objects and utilities ● Memory management and generic IO ● Binary protocol / serialization functions ● Memory-mapping and zero-copy “parsing” ● Integration testing Arrow Core
  8. 8. ● Fast read and write of multi-file datasets ● Read only the parts of the dataset relevant to your analysis (“predicate pushdown”) C++ Datasets File Formats Storage Systems CSV
  9. 9. • • • •
  10. 10. Arrow Flight RPC (Messaging) ● Efficient client-server dataset interchange ● Focused on gRPC (Google’s messaging framework), but may support other transports in future ● It’s fast… really fast ○ Upwards 3GB/s server-to-client on localhost
  11. 11. Arrow for R ● Rcpp-based bindings ● https://github.com/apache/arrow/tree/master/r ● Goal: enable R package developers to leverage Arrow ecosystem for better performance and scalability
  12. 12. Arrow format vs. R data.frame ● Type-independent representation of NA values (bits vs. special values) ● Better computational efficiency for strings ● Naturally chunk-based (vs. large contiguous allocations) ● Supports a much wider variety of data types, including nested data (JSON-like)
  13. 13. • • • • •
  14. 14. • • • •
  15. 15. • • • • • • • •
  16. 16. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  17. 17. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  18. 18. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  19. 19. Keep up to date at https://arrow.apache.org https://ursalabs.org https://wesmckinney.com Thanks

×