Growing the Delta Ecosystem to Rust and Python with Delta-RS

1
Growing the Delta Lake ecosystem
to Rust, Python, and more.
introducing delta-rs
R. Tyler Croy
tech.scribd.com
github.com/rtyler
rtyler@brokenco.de

2
whoami(1)
● Long-time free and open source developer
● delta-rs contributor
● Director of Platform Engineering at Scribd
○ Core Platform
○ Data Engineering
○ Data Operations

5
Delta Lake basics
● Three major components:
○ Parquet files
○ Object storage
○ Transaction logs
● Well-defined transaction log semantics in the
protocol document
● Very little "magic"
github.com/delta-io/delta/blob/master/PROTOCOL.md

7
Why delta-rs was needed
● Not everything needs a Spark cluster 😀
○ Many workloads need "fractional compute" resources,
● Data ingestion is a key area where Scribd needed a better cost/performance ratio.
○ Some portion of writes into Delta Lake aren't results of "big data" processing.
● Hybrid workloads may need "oﬀline" and "online" data.
brokenco.de/2021/04/27/why-delta-lake.html

8
delta-rs
Extending Delta Lake outside the JVM ecosystem
● Provides Rust and Python bindings for working with Delta tables
● Supports local filesystem, S3, and Azure Data Lake Storage (Gen 2)
● Production ready:
○ Read and metadata operations
● Still cookin':
○ Support for writing to the Delta transaction log
○ S3 multi-writer support work in progress
○ Ruby bindings in their infancy
github.com/delta-io/delta-rs

9
The power of Rust
● Correctness and speed are important to Delta users
● No runtime allows for very easy embedding
● Opens up possibilities for Delta Lake in:
○ NodeJS
○ Ruby
○ Python
○ Golang
○ etc
● It's so hot right now

10
parquet
deltalake crate
deltalake python deltalake
ruby
deltalake
golang
tokio
arrow
deltalake
nodejs
Language
bindings
pip install deltalake
Core
Key
dependencies
rusoto (s3) azure sdk
pyo3 maturin
cargo install deltalake

11
Rust in action
% delta-inspect info s3://delta/audit_logs
DeltaTable(s3://delta/audit_logs)
version: 0
metadata: GUID=bb73f716-764d-419f-b2b7-d505c32fd872, name=None, description=None,
partitionColumns=["date"], createdTime=1619026186600, configuration={}
min_version: read=1, write=2
files count: 1

12
Python in action
% python demo.py
Table version: 0
- date=2021-03-12/part-00000-2254ce65-f690-4875-b088-2476b77e8b44.c000.snappy.parquet
accountId actionName auditLevel ... timestamp version workspaceId
0 0xdeadbeef databricksAccess WORKSPACE_LEVEL ... 1615535931674 2.0 0xdeadbeef
[1 rows x 11 columns]

14
What you can do right now
● Access Delta tables in: AWS S3, Azure Data Lake, Local filesystem
● Read tables
○ By partitions
○ With checkpoints
○ With stream table updates
○ Write to the transaction log
○ Vacuum

15
● Write parquet directly
● Create checkpoints
● Execute an OPTIMIZE command
○ Focusing on bin-packing
○ Not on z-ordering yet
What you can't quite do yet

16
pip install deltalake pandas
[dependencies]
deltalake = "*"
Python
Rust
Available on
scribd.com

18
kafka-delta-ingest
Rapidly ingesting structured data from Apache Kafka into Delta Lake
● Intended to provide a high speed bridge between Apache Kafka
streams and Delta Lake
● Initially targeting mapping JSON messages into Delta table rows
● Heavily dependent on delta-rs
○ Driving significant writer-based improvements
● Not intended to do any stream transformation or manipulation
github.com/delta-io/kafka-delta-ingest

19
Problems to solve
● Kafka topics can have variable throughput volume
● Auto-scaling is important for data timeliness
● Spark is a lot of overhead for writing data from one
socket to another
github.com/delta-io/kafka-delta-ingest

20
delta-rs community
Your name here!
● Active channels in the Delta Slack workspace, join on delta.io
○ #delta-rs
○ #kafka-delta-ingest
● Lots of good-first issues for anyone who wants to learn Delta Lake or Rust
● Notable contributions
○ Extensive Python binding support and bug fixes from Florian Valeye (@fvaleye)
○ Async IO and Azure storage backend from Ben Sully (@sdk2)
○ Safe concurrent writer work with S3 and DynamoDB from Mykhailo Osypov (@mosyp)
○ Parquet crate write support by Neville Dipale (@nevi-me)

21
thanks!
tech.scribd.com
github.com/rtyler
rtyler@brokenco.de

Growing the Delta Ecosystem to Rust and Python with Delta-RS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Growing the Delta Ecosystem to Rust and Python with Delta-RS

Similar to Growing the Delta Ecosystem to Rust and Python with Delta-RS (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Growing the Delta Ecosystem to Rust and Python with Delta-RS