1
Growing the Delta Lake ecosystem
to Rust, Python, and more.
introducing delta-rs
R. Tyler Croy
tech.scribd.com
github.com/rtyler
rtyler@brokenco.de
2
whoami(1)
● Long-time free and open source developer
● delta-rs contributor
● Director of Platform Engineering at Scribd
○ Core Platform
○ Data Engineering
○ Data Operations
3
Delta Lake
4
5
Delta Lake basics
● Three major components:
○ Parquet files
○ Object storage
○ Transaction logs
● Well-defined transaction log semantics in the
protocol document
● Very little "magic"
github.com/delta-io/delta/blob/master/PROTOCOL.md
6
delta-rs
7
Why delta-rs was needed
● Not everything needs a Spark cluster 😀
○ Many workloads need "fractional compute" resources,
● Data ingestion is a key area where Scribd needed a better cost/performance ratio.
○ Some portion of writes into Delta Lake aren't results of "big data" processing.
● Hybrid workloads may need "offline" and "online" data.
brokenco.de/2021/04/27/why-delta-lake.html
8
delta-rs
Extending Delta Lake outside the JVM ecosystem
● Provides Rust and Python bindings for working with Delta tables
● Supports local filesystem, S3, and Azure Data Lake Storage (Gen 2)
● Production ready:
○ Read and metadata operations
● Still cookin':
○ Support for writing to the Delta transaction log
○ S3 multi-writer support work in progress
○ Ruby bindings in their infancy
github.com/delta-io/delta-rs
9
The power of Rust
● Correctness and speed are important to Delta users
● No runtime allows for very easy embedding
● Opens up possibilities for Delta Lake in:
○ NodeJS
○ Ruby
○ Python
○ Golang
○ etc
● It's so hot right now
github.com/delta-io/delta-rs
10
parquet
deltalake crate
deltalake python deltalake
ruby
deltalake
golang
tokio
arrow
deltalake
nodejs
Language
bindings
pip install deltalake
Core
Key
dependencies
rusoto (s3) azure sdk
pyo3 maturin
github.com/delta-io/delta-rs
cargo install deltalake
11
Rust in action
% delta-inspect info s3://delta/audit_logs
DeltaTable(s3://delta/audit_logs)
version: 0
metadata: GUID=bb73f716-764d-419f-b2b7-d505c32fd872, name=None, description=None,
partitionColumns=["date"], createdTime=1619026186600, configuration={}
min_version: read=1, write=2
files count: 1
12
Python in action
% python demo.py
Table version: 0
- date=2021-03-12/part-00000-2254ce65-f690-4875-b088-2476b77e8b44.c000.snappy.parquet
accountId actionName auditLevel ... timestamp version workspaceId
0 0xdeadbeef databricksAccess WORKSPACE_LEVEL ... 1615535931674 2.0 0xdeadbeef
[1 rows x 11 columns]
13
demo!?
14
What you can do right now
● Access Delta tables in: AWS S3, Azure Data Lake, Local filesystem
● Read tables
○ By partitions
○ With checkpoints
○ With stream table updates
○ Write to the transaction log
○ Vacuum
15
● Write parquet directly
● Create checkpoints
● Execute an OPTIMIZE command
○ Focusing on bin-packing
○ Not on z-ordering yet
What you can't quite do yet
16
pip install deltalake pandas
[dependencies]
deltalake = "*"
Python
Rust
Available on
scribd.com
17
What's next
18
kafka-delta-ingest
Rapidly ingesting structured data from Apache Kafka into Delta Lake
● Intended to provide a high speed bridge between Apache Kafka
streams and Delta Lake
● Initially targeting mapping JSON messages into Delta table rows
● Heavily dependent on delta-rs
○ Driving significant writer-based improvements
● Not intended to do any stream transformation or manipulation
github.com/delta-io/kafka-delta-ingest
19
Problems to solve
● Kafka topics can have variable throughput volume
● Auto-scaling is important for data timeliness
● Spark is a lot of overhead for writing data from one
socket to another
github.com/delta-io/kafka-delta-ingest
20
delta-rs community
Your name here!
● Active channels in the Delta Slack workspace, join on delta.io
○ #delta-rs
○ #kafka-delta-ingest
● Lots of good-first issues for anyone who wants to learn Delta Lake or Rust
● Notable contributions
○ Extensive Python binding support and bug fixes from Florian Valeye (@fvaleye)
○ Async IO and Azure storage backend from Ben Sully (@sdk2)
○ Safe concurrent writer work with S3 and DynamoDB from Mykhailo Osypov (@mosyp)
○ Parquet crate write support by Neville Dipale (@nevi-me)
github.com/delta-io/delta-rs
21
thanks!
tech.scribd.com
github.com/rtyler
rtyler@brokenco.de

Growing the Delta Ecosystem to Rust and Python with Delta-RS

  • 1.
    1 Growing the DeltaLake ecosystem to Rust, Python, and more. introducing delta-rs R. Tyler Croy tech.scribd.com github.com/rtyler rtyler@brokenco.de
  • 2.
    2 whoami(1) ● Long-time freeand open source developer ● delta-rs contributor ● Director of Platform Engineering at Scribd ○ Core Platform ○ Data Engineering ○ Data Operations
  • 3.
  • 4.
  • 5.
    5 Delta Lake basics ●Three major components: ○ Parquet files ○ Object storage ○ Transaction logs ● Well-defined transaction log semantics in the protocol document ● Very little "magic" github.com/delta-io/delta/blob/master/PROTOCOL.md
  • 6.
  • 7.
    7 Why delta-rs wasneeded ● Not everything needs a Spark cluster 😀 ○ Many workloads need "fractional compute" resources, ● Data ingestion is a key area where Scribd needed a better cost/performance ratio. ○ Some portion of writes into Delta Lake aren't results of "big data" processing. ● Hybrid workloads may need "offline" and "online" data. brokenco.de/2021/04/27/why-delta-lake.html
  • 8.
    8 delta-rs Extending Delta Lakeoutside the JVM ecosystem ● Provides Rust and Python bindings for working with Delta tables ● Supports local filesystem, S3, and Azure Data Lake Storage (Gen 2) ● Production ready: ○ Read and metadata operations ● Still cookin': ○ Support for writing to the Delta transaction log ○ S3 multi-writer support work in progress ○ Ruby bindings in their infancy github.com/delta-io/delta-rs
  • 9.
    9 The power ofRust ● Correctness and speed are important to Delta users ● No runtime allows for very easy embedding ● Opens up possibilities for Delta Lake in: ○ NodeJS ○ Ruby ○ Python ○ Golang ○ etc ● It's so hot right now github.com/delta-io/delta-rs
  • 10.
    10 parquet deltalake crate deltalake pythondeltalake ruby deltalake golang tokio arrow deltalake nodejs Language bindings pip install deltalake Core Key dependencies rusoto (s3) azure sdk pyo3 maturin github.com/delta-io/delta-rs cargo install deltalake
  • 11.
    11 Rust in action %delta-inspect info s3://delta/audit_logs DeltaTable(s3://delta/audit_logs) version: 0 metadata: GUID=bb73f716-764d-419f-b2b7-d505c32fd872, name=None, description=None, partitionColumns=["date"], createdTime=1619026186600, configuration={} min_version: read=1, write=2 files count: 1
  • 12.
    12 Python in action %python demo.py Table version: 0 - date=2021-03-12/part-00000-2254ce65-f690-4875-b088-2476b77e8b44.c000.snappy.parquet accountId actionName auditLevel ... timestamp version workspaceId 0 0xdeadbeef databricksAccess WORKSPACE_LEVEL ... 1615535931674 2.0 0xdeadbeef [1 rows x 11 columns]
  • 13.
  • 14.
    14 What you cando right now ● Access Delta tables in: AWS S3, Azure Data Lake, Local filesystem ● Read tables ○ By partitions ○ With checkpoints ○ With stream table updates ○ Write to the transaction log ○ Vacuum
  • 15.
    15 ● Write parquetdirectly ● Create checkpoints ● Execute an OPTIMIZE command ○ Focusing on bin-packing ○ Not on z-ordering yet What you can't quite do yet
  • 16.
    16 pip install deltalakepandas [dependencies] deltalake = "*" Python Rust Available on scribd.com
  • 17.
  • 18.
    18 kafka-delta-ingest Rapidly ingesting structureddata from Apache Kafka into Delta Lake ● Intended to provide a high speed bridge between Apache Kafka streams and Delta Lake ● Initially targeting mapping JSON messages into Delta table rows ● Heavily dependent on delta-rs ○ Driving significant writer-based improvements ● Not intended to do any stream transformation or manipulation github.com/delta-io/kafka-delta-ingest
  • 19.
    19 Problems to solve ●Kafka topics can have variable throughput volume ● Auto-scaling is important for data timeliness ● Spark is a lot of overhead for writing data from one socket to another github.com/delta-io/kafka-delta-ingest
  • 20.
    20 delta-rs community Your namehere! ● Active channels in the Delta Slack workspace, join on delta.io ○ #delta-rs ○ #kafka-delta-ingest ● Lots of good-first issues for anyone who wants to learn Delta Lake or Rust ● Notable contributions ○ Extensive Python binding support and bug fixes from Florian Valeye (@fvaleye) ○ Async IO and Azure storage backend from Ben Sully (@sdk2) ○ Safe concurrent writer work with S3 and DynamoDB from Mykhailo Osypov (@mosyp) ○ Parquet crate write support by Neville Dipale (@nevi-me) github.com/delta-io/delta-rs
  • 21.