Apache Arrow @ RMS
March 2019
The World's Leading Catastrophe Risk
Modeling Company
From earthquakes, hurricanes, and floods to terrorism and
infectious diseases, RMS helps financial institutions and
public agencies understand, quantify, and manage risk
3
So what do we actually do?
● Models
○ We have complex models for various types of risk
■ Fire, flood, earthquakes, etc
○ Our customers run our models against their portfolios of risk items (e.g.
properties) to understand financial impact
○ The models produce a lot of data
● Interactive Queries
○ Insurance analysts are similar to data scientists
○ Lots of result data to slice and dice and visualize
○ Low latency analytics on relatively large datasets
■ Too much for a SQL database but not PB scale
4
5
RMS Datastore Stack
Intelligent query parsing, rewriting
and routing.
Cost-based optimizations.
Ability to use different query
engines depending on use case or
size of data set.
6
Query Service 1.0
● Native Query Execution
○ Scala code, using Apache Arrow and Parquet libraries
○ Column-based file readers with projection push-down
○ Row-based query execution
○ Apache Arrow for the type system
● Performance
○ Order of magnitude improvements compared to Spark for some use cases
○ Slower than Spark for other use cases (larger data sets, JOINs, etc)
● SQL Interface
○ Apache Hive for our internal SQL dialect
○ Apache Hive protocol for compatibility with ODBC/JDBC drivers
○ REST API for integration with microservices
7
Query Service Conclusions & Next Steps
● The Query Service was successful
○ Reduced TCO (fewer Spark nodes required)
○ Improved performance for interactive queries
● In my spare time I had been working on an open source project called
DataFusion
○ DataFusion started out as a generic Rust query engine
○ I felt that Rust was much better suited than JVM
○ I learned a lot more about Apache Arrow and the benefits of columnar
processing
● So how could we leverage this at RMS?
○ I donated the initial Rust implementation of Apache Arrow and later donated
DataFusion as well
8
Why Columnar?
99
Row vs Column
Source code available:
https://github.com/andygrove/row-vs-col-rs
Compares:
● Rust Vec<Row>
● Rust Vec<Column>
● Rust Vec<Array> // Apache Arrow
Columnar benefits:
● Cache pipelining
● SIMD (Same instruction, multiple data)
● GPU vectorized processing
(higher is better)
10
Apache Arrow
11
Apache Arrow
● Standardized language-independent columnar memory format
○ for flat and hierarchical data
○ organized for efficient analytic operations on modern hardware
■ Vectorized processing, SIMD, GPU
● Implementations available for many programming languages
○ C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
● Zero-copy interprocess communication
○ IPC metadata defined in flatbuffer format
12
Apache Arrow
● Computational libraries
○ C++ libraries that leverage LLVM (donated by Dremio)
○ NVIDIA CUDA support
● Query Engines
○ Ursa Labs initiative
■ C++ query engine
○ DataFusion
■ Rust query engine
13
Apache Arrow
● 3 years as a top level project
● Project Management Committee (PMC) members work for ...
○ Cloudera, Databricks, DataStax, Dremio, Hortonworks, Looker, MapR, RMS,
RStudio, Salesforce, Twitter, UC Berkeley RISELab, Ursa Labs, WeWork,
Workday
● Committers work for ...
○ Amazon, CERN, Google, IBM
● Also many individual contributors
● Companies providing financial support (via Ursa Labs)
○ nVIDIA, ODSC, RStudio, Two Sigma
Huge overhead converting
between different data formats
and duplicating data.
Zero-copy data access
Exchange metadata and pointers
to Arrow arrays
16
DataFusion
Rust-native in-memory query engine for Apache Arrow
17
Why Rust
● See https://www.rust-lang.org/ for detailed information
● My take
○ Speed of C++ with the safety of Java
○ Memory efficient (no GC)
○ Predictable performance
○ Lower TCO
○ Forces you to think about what you are doing
■ Thread safety has to be explicit
■ Memory management has to be explicit
○ The compiler acts as a peer reviewer … tough but fair
18
DataFusion current functionality
● SQL query planner and optimizer
● Supported SQL features
○ Projection (SELECT)
○ Selection (WHERE)
○ Aggregates (MIN, MAX, SUM)
● Expressions
○ identifiers (column names)
○ Literal values
● Operators
○ Arithmetic (+, -, *, /, %)
○ Comparison (<, <=, =, >=, >, !=, etc)
○ Binary (AND, OR)
19
20
Demo Time
PoC of a Rust-based Query Service using Apache Arrow
2222
Benchmarks!
SELECT
riskitem_occupancyId,
occupancy_occupancyName,
SUM(risk_totalTIV)
FROM
ContractPrimaryRealPropertyView_1234
GROUP BY
riskitem_occupancyId,
occupancy_occupancyName
● Spark
○ Running in local mode
○ Parquet files on local SSD
○ Cached DataFrames
● DataFusion
○ Arrow format “MemTable”
23
Benchmark ResultsEC2 c5.18xlarge instance
72 vCPUs
144 GB
SSD (100 IOPS / 3000 burst)
Data set:
5MM risk items
Wide table (~600 columns)
~16 GB on disk
(higher is better)
24
DataFusion Roadmap
● DataFrame-style API for building logical query plans, as alternative to SQL
● Parallel Query Execution (threads, partitions)
● Support for more data sources (Parquet, JSON)
● More complete SQL support (joins, subqueries, columnar UDFs)
● Distributed Execution
○ Distributed query planner & optimizer
○ Kubernetes & Docker deployment model
○ Apache Flight protocol for streaming data between nodes
Apache Arrow is a “do-ocracy” where the individual contributors get to decide the
roadmap, but here are some things that I am planning on working on
25
Want to contribute?
● Great time to get involved!
○ The code base is still relatively small
■ Core Arrow library is 6k LOC
■ DataFusion is 4k LOC
○ Small number of regular contributors
○ Where to start?
■ https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard
○ Try adding DataFusion as a crate dependency
Thanks! Questions?
Contact Details
▪ @AndyGrove73
▪ andy.grove@rms.com
▪ https://www.linkedin.com/in/andygrove
Arrow Resources:
▪ @ApacheArrow
▪ https://arrow.apache.org
▪ https://github.com/apache/arrow

Rust & Apache Arrow @ RMS

  • 1.
    Apache Arrow @RMS March 2019
  • 2.
    The World's LeadingCatastrophe Risk Modeling Company From earthquakes, hurricanes, and floods to terrorism and infectious diseases, RMS helps financial institutions and public agencies understand, quantify, and manage risk
  • 3.
    3 So what dowe actually do? ● Models ○ We have complex models for various types of risk ■ Fire, flood, earthquakes, etc ○ Our customers run our models against their portfolios of risk items (e.g. properties) to understand financial impact ○ The models produce a lot of data ● Interactive Queries ○ Insurance analysts are similar to data scientists ○ Lots of result data to slice and dice and visualize ○ Low latency analytics on relatively large datasets ■ Too much for a SQL database but not PB scale
  • 4.
  • 5.
    5 RMS Datastore Stack Intelligentquery parsing, rewriting and routing. Cost-based optimizations. Ability to use different query engines depending on use case or size of data set.
  • 6.
    6 Query Service 1.0 ●Native Query Execution ○ Scala code, using Apache Arrow and Parquet libraries ○ Column-based file readers with projection push-down ○ Row-based query execution ○ Apache Arrow for the type system ● Performance ○ Order of magnitude improvements compared to Spark for some use cases ○ Slower than Spark for other use cases (larger data sets, JOINs, etc) ● SQL Interface ○ Apache Hive for our internal SQL dialect ○ Apache Hive protocol for compatibility with ODBC/JDBC drivers ○ REST API for integration with microservices
  • 7.
    7 Query Service Conclusions& Next Steps ● The Query Service was successful ○ Reduced TCO (fewer Spark nodes required) ○ Improved performance for interactive queries ● In my spare time I had been working on an open source project called DataFusion ○ DataFusion started out as a generic Rust query engine ○ I felt that Rust was much better suited than JVM ○ I learned a lot more about Apache Arrow and the benefits of columnar processing ● So how could we leverage this at RMS? ○ I donated the initial Rust implementation of Apache Arrow and later donated DataFusion as well
  • 8.
  • 9.
    99 Row vs Column Sourcecode available: https://github.com/andygrove/row-vs-col-rs Compares: ● Rust Vec<Row> ● Rust Vec<Column> ● Rust Vec<Array> // Apache Arrow Columnar benefits: ● Cache pipelining ● SIMD (Same instruction, multiple data) ● GPU vectorized processing (higher is better)
  • 10.
  • 11.
    11 Apache Arrow ● Standardizedlanguage-independent columnar memory format ○ for flat and hierarchical data ○ organized for efficient analytic operations on modern hardware ■ Vectorized processing, SIMD, GPU ● Implementations available for many programming languages ○ C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. ● Zero-copy interprocess communication ○ IPC metadata defined in flatbuffer format
  • 12.
    12 Apache Arrow ● Computationallibraries ○ C++ libraries that leverage LLVM (donated by Dremio) ○ NVIDIA CUDA support ● Query Engines ○ Ursa Labs initiative ■ C++ query engine ○ DataFusion ■ Rust query engine
  • 13.
    13 Apache Arrow ● 3years as a top level project ● Project Management Committee (PMC) members work for ... ○ Cloudera, Databricks, DataStax, Dremio, Hortonworks, Looker, MapR, RMS, RStudio, Salesforce, Twitter, UC Berkeley RISELab, Ursa Labs, WeWork, Workday ● Committers work for ... ○ Amazon, CERN, Google, IBM ● Also many individual contributors ● Companies providing financial support (via Ursa Labs) ○ nVIDIA, ODSC, RStudio, Two Sigma
  • 14.
    Huge overhead converting betweendifferent data formats and duplicating data.
  • 15.
    Zero-copy data access Exchangemetadata and pointers to Arrow arrays
  • 16.
  • 17.
    17 Why Rust ● Seehttps://www.rust-lang.org/ for detailed information ● My take ○ Speed of C++ with the safety of Java ○ Memory efficient (no GC) ○ Predictable performance ○ Lower TCO ○ Forces you to think about what you are doing ■ Thread safety has to be explicit ■ Memory management has to be explicit ○ The compiler acts as a peer reviewer … tough but fair
  • 18.
    18 DataFusion current functionality ●SQL query planner and optimizer ● Supported SQL features ○ Projection (SELECT) ○ Selection (WHERE) ○ Aggregates (MIN, MAX, SUM) ● Expressions ○ identifiers (column names) ○ Literal values ● Operators ○ Arithmetic (+, -, *, /, %) ○ Comparison (<, <=, =, >=, >, !=, etc) ○ Binary (AND, OR)
  • 19.
  • 20.
  • 21.
    Demo Time PoC ofa Rust-based Query Service using Apache Arrow
  • 22.
  • 23.
    23 Benchmark ResultsEC2 c5.18xlargeinstance 72 vCPUs 144 GB SSD (100 IOPS / 3000 burst) Data set: 5MM risk items Wide table (~600 columns) ~16 GB on disk (higher is better)
  • 24.
    24 DataFusion Roadmap ● DataFrame-styleAPI for building logical query plans, as alternative to SQL ● Parallel Query Execution (threads, partitions) ● Support for more data sources (Parquet, JSON) ● More complete SQL support (joins, subqueries, columnar UDFs) ● Distributed Execution ○ Distributed query planner & optimizer ○ Kubernetes & Docker deployment model ○ Apache Flight protocol for streaming data between nodes Apache Arrow is a “do-ocracy” where the individual contributors get to decide the roadmap, but here are some things that I am planning on working on
  • 25.
    25 Want to contribute? ●Great time to get involved! ○ The code base is still relatively small ■ Core Arrow library is 6k LOC ■ DataFusion is 4k LOC ○ Small number of regular contributors ○ Where to start? ■ https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard ○ Try adding DataFusion as a crate dependency
  • 26.
    Thanks! Questions? Contact Details ▪@AndyGrove73 ▪ andy.grove@rms.com ▪ https://www.linkedin.com/in/andygrove Arrow Resources: ▪ @ApacheArrow ▪ https://arrow.apache.org ▪ https://github.com/apache/arrow