Successfully reported this slideshow.
Your SlideShare is downloading. ×

Growing the Delta Ecosystem to Rust and Python with Delta-RS

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 21 Ad

Growing the Delta Ecosystem to Rust and Python with Delta-RS

Download to read offline

In this session we will introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational Delta Lake library in Rust, delta-rs can enable native bindings in Python, Ruby, Golang, and more.We will review what functionality delta-rs supports in its current Rust and Python APIs and the upcoming roadmap.

We will also give an overview of one of the first projects to use it in production: kafka-delta-ingest, which builds on delta-rs to provide a high throughput service to bring data from Kafka into Delta Lake.

In this session we will introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational Delta Lake library in Rust, delta-rs can enable native bindings in Python, Ruby, Golang, and more.We will review what functionality delta-rs supports in its current Rust and Python APIs and the upcoming roadmap.

We will also give an overview of one of the first projects to use it in production: kafka-delta-ingest, which builds on delta-rs to provide a high throughput service to bring data from Kafka into Delta Lake.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Growing the Delta Ecosystem to Rust and Python with Delta-RS (20)

Advertisement

More from Databricks (20)

Advertisement

Growing the Delta Ecosystem to Rust and Python with Delta-RS

  1. 1. 1 Growing the Delta Lake ecosystem to Rust, Python, and more. introducing delta-rs R. Tyler Croy tech.scribd.com github.com/rtyler rtyler@brokenco.de
  2. 2. 2 whoami(1) ● Long-time free and open source developer ● delta-rs contributor ● Director of Platform Engineering at Scribd ○ Core Platform ○ Data Engineering ○ Data Operations
  3. 3. 3 Delta Lake
  4. 4. 4
  5. 5. 5 Delta Lake basics ● Three major components: ○ Parquet files ○ Object storage ○ Transaction logs ● Well-defined transaction log semantics in the protocol document ● Very little "magic" github.com/delta-io/delta/blob/master/PROTOCOL.md
  6. 6. 6 delta-rs
  7. 7. 7 Why delta-rs was needed ● Not everything needs a Spark cluster 😀 ○ Many workloads need "fractional compute" resources, ● Data ingestion is a key area where Scribd needed a better cost/performance ratio. ○ Some portion of writes into Delta Lake aren't results of "big data" processing. ● Hybrid workloads may need "offline" and "online" data. brokenco.de/2021/04/27/why-delta-lake.html
  8. 8. 8 delta-rs Extending Delta Lake outside the JVM ecosystem ● Provides Rust and Python bindings for working with Delta tables ● Supports local filesystem, S3, and Azure Data Lake Storage (Gen 2) ● Production ready: ○ Read and metadata operations ● Still cookin': ○ Support for writing to the Delta transaction log ○ S3 multi-writer support work in progress ○ Ruby bindings in their infancy github.com/delta-io/delta-rs
  9. 9. 9 The power of Rust ● Correctness and speed are important to Delta users ● No runtime allows for very easy embedding ● Opens up possibilities for Delta Lake in: ○ NodeJS ○ Ruby ○ Python ○ Golang ○ etc ● It's so hot right now github.com/delta-io/delta-rs
  10. 10. 10 parquet deltalake crate deltalake python deltalake ruby deltalake golang tokio arrow deltalake nodejs Language bindings pip install deltalake Core Key dependencies rusoto (s3) azure sdk pyo3 maturin github.com/delta-io/delta-rs cargo install deltalake
  11. 11. 11 Rust in action % delta-inspect info s3://delta/audit_logs DeltaTable(s3://delta/audit_logs) version: 0 metadata: GUID=bb73f716-764d-419f-b2b7-d505c32fd872, name=None, description=None, partitionColumns=["date"], createdTime=1619026186600, configuration={} min_version: read=1, write=2 files count: 1
  12. 12. 12 Python in action % python demo.py Table version: 0 - date=2021-03-12/part-00000-2254ce65-f690-4875-b088-2476b77e8b44.c000.snappy.parquet accountId actionName auditLevel ... timestamp version workspaceId 0 0xdeadbeef databricksAccess WORKSPACE_LEVEL ... 1615535931674 2.0 0xdeadbeef [1 rows x 11 columns]
  13. 13. 13 demo!?
  14. 14. 14 What you can do right now ● Access Delta tables in: AWS S3, Azure Data Lake, Local filesystem ● Read tables ○ By partitions ○ With checkpoints ○ With stream table updates ○ Write to the transaction log ○ Vacuum
  15. 15. 15 ● Write parquet directly ● Create checkpoints ● Execute an OPTIMIZE command ○ Focusing on bin-packing ○ Not on z-ordering yet What you can't quite do yet
  16. 16. 16 pip install deltalake pandas [dependencies] deltalake = "*" Python Rust Available on scribd.com
  17. 17. 17 What's next
  18. 18. 18 kafka-delta-ingest Rapidly ingesting structured data from Apache Kafka into Delta Lake ● Intended to provide a high speed bridge between Apache Kafka streams and Delta Lake ● Initially targeting mapping JSON messages into Delta table rows ● Heavily dependent on delta-rs ○ Driving significant writer-based improvements ● Not intended to do any stream transformation or manipulation github.com/delta-io/kafka-delta-ingest
  19. 19. 19 Problems to solve ● Kafka topics can have variable throughput volume ● Auto-scaling is important for data timeliness ● Spark is a lot of overhead for writing data from one socket to another github.com/delta-io/kafka-delta-ingest
  20. 20. 20 delta-rs community Your name here! ● Active channels in the Delta Slack workspace, join on delta.io ○ #delta-rs ○ #kafka-delta-ingest ● Lots of good-first issues for anyone who wants to learn Delta Lake or Rust ● Notable contributions ○ Extensive Python binding support and bug fixes from Florian Valeye (@fvaleye) ○ Async IO and Azure storage backend from Ben Sully (@sdk2) ○ Safe concurrent writer work with S3 and DynamoDB from Mykhailo Osypov (@mosyp) ○ Parquet crate write support by Neville Dipale (@nevi-me) github.com/delta-io/delta-rs
  21. 21. 21 thanks! tech.scribd.com github.com/rtyler rtyler@brokenco.de

×