Rust & Apache Arrow @ RMS

The World's Leading Catastrophe Risk
Modeling Company
From earthquakes, hurricanes, and floods to terrorism and
infectious diseases, RMS helps financial institutions and
public agencies understand, quantify, and manage risk

3
So what do we actually do?
● Models
○ We have complex models for various types of risk
■ Fire, flood, earthquakes, etc
○ Our customers run our models against their portfolios of risk items (e.g.
properties) to understand financial impact
○ The models produce a lot of data
● Interactive Queries
○ Insurance analysts are similar to data scientists
○ Lots of result data to slice and dice and visualize
○ Low latency analytics on relatively large datasets
■ Too much for a SQL database but not PB scale

5
RMS Datastore Stack
Intelligent query parsing, rewriting
and routing.
Cost-based optimizations.
Ability to use different query
engines depending on use case or
size of data set.

6
Query Service 1.0
● Native Query Execution
○ Scala code, using Apache Arrow and Parquet libraries
○ Column-based file readers with projection push-down
○ Row-based query execution
○ Apache Arrow for the type system
● Performance
○ Order of magnitude improvements compared to Spark for some use cases
○ Slower than Spark for other use cases (larger data sets, JOINs, etc)
● SQL Interface
○ Apache Hive for our internal SQL dialect
○ Apache Hive protocol for compatibility with ODBC/JDBC drivers
○ REST API for integration with microservices

7
Query Service Conclusions & Next Steps
● The Query Service was successful
○ Reduced TCO (fewer Spark nodes required)
○ Improved performance for interactive queries
● In my spare time I had been working on an open source project called
DataFusion
○ DataFusion started out as a generic Rust query engine
○ I felt that Rust was much better suited than JVM
○ I learned a lot more about Apache Arrow and the benefits of columnar
processing
● So how could we leverage this at RMS?
○ I donated the initial Rust implementation of Apache Arrow and later donated
DataFusion as well

99
Row vs Column
Source code available:
https://github.com/andygrove/row-vs-col-rs
Compares:
● Rust Vec<Row>
● Rust Vec<Column>
● Rust Vec<Array> // Apache Arrow
Columnar benefits:
● Cache pipelining
● SIMD (Same instruction, multiple data)
● GPU vectorized processing
(higher is better)

11
Apache Arrow
● Standardized language-independent columnar memory format
○ for flat and hierarchical data
○ organized for efficient analytic operations on modern hardware
■ Vectorized processing, SIMD, GPU
● Implementations available for many programming languages
○ C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
● Zero-copy interprocess communication
○ IPC metadata defined in flatbuffer format

12
Apache Arrow
● Computational libraries
○ C++ libraries that leverage LLVM (donated by Dremio)
○ NVIDIA CUDA support
● Query Engines
○ Ursa Labs initiative
■ C++ query engine
○ DataFusion
■ Rust query engine

13
Apache Arrow
● 3 years as a top level project
● Project Management Committee (PMC) members work for ...
○ Cloudera, Databricks, DataStax, Dremio, Hortonworks, Looker, MapR, RMS,
RStudio, Salesforce, Twitter, UC Berkeley RISELab, Ursa Labs, WeWork,
Workday
● Committers work for ...
○ Amazon, CERN, Google, IBM
● Also many individual contributors
● Companies providing financial support (via Ursa Labs)
○ nVIDIA, ODSC, RStudio, Two Sigma

Huge overhead converting
between different data formats
and duplicating data.

Zero-copy data access
Exchange metadata and pointers
to Arrow arrays

16
DataFusion
Rust-native in-memory query engine for Apache Arrow

17
Why Rust
● See https://www.rust-lang.org/ for detailed information
● My take
○ Speed of C++ with the safety of Java
○ Memory efficient (no GC)
○ Predictable performance
○ Lower TCO
○ Forces you to think about what you are doing
■ Thread safety has to be explicit
■ Memory management has to be explicit
○ The compiler acts as a peer reviewer … tough but fair

18
DataFusion current functionality
● SQL query planner and optimizer
● Supported SQL features
○ Projection (SELECT)
○ Selection (WHERE)
○ Aggregates (MIN, MAX, SUM)
● Expressions
○ identifiers (column names)
○ Literal values
● Operators
○ Arithmetic (+, -, *, /, %)
○ Comparison (<, <=, =, >=, >, !=, etc)
○ Binary (AND, OR)

Demo Time
PoC of a Rust-based Query Service using Apache Arrow

2222
Benchmarks!
SELECT
riskitem_occupancyId,
occupancy_occupancyName,
SUM(risk_totalTIV)
FROM
ContractPrimaryRealPropertyView_1234
GROUP BY
riskitem_occupancyId,
occupancy_occupancyName
● Spark
○ Running in local mode
○ Parquet files on local SSD
○ Cached DataFrames
● DataFusion
○ Arrow format “MemTable”

23
Benchmark ResultsEC2 c5.18xlarge instance
72 vCPUs
144 GB
SSD (100 IOPS / 3000 burst)
Data set:
5MM risk items
Wide table (~600 columns)
~16 GB on disk
(higher is better)

24
DataFusion Roadmap
● DataFrame-style API for building logical query plans, as alternative to SQL
● Parallel Query Execution (threads, partitions)
● Support for more data sources (Parquet, JSON)
● More complete SQL support (joins, subqueries, columnar UDFs)
● Distributed Execution
○ Distributed query planner & optimizer
○ Kubernetes & Docker deployment model
○ Apache Flight protocol for streaming data between nodes
Apache Arrow is a “do-ocracy” where the individual contributors get to decide the
roadmap, but here are some things that I am planning on working on

25
Want to contribute?
● Great time to get involved!
○ The code base is still relatively small
■ Core Arrow library is 6k LOC
■ DataFusion is 4k LOC
○ Small number of regular contributors
○ Where to start?
■ https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard
○ Try adding DataFusion as a crate dependency

Thanks! Questions?
Contact Details
▪ @AndyGrove73
▪ andy.grove@rms.com
▪ https://www.linkedin.com/in/andygrove
Arrow Resources:
▪ @ApacheArrow
▪ https://arrow.apache.org
▪ https://github.com/apache/arrow

Rust & Apache Arrow @ RMS

More Related Content

What's hot

Similar to Rust & Apache Arrow @ RMS

Recently uploaded

Rust & Apache Arrow @ RMS