Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse

©2022 Databricks Inc. — All rights reserved
Standing on the Shoulders of
Open-Source Giants
The Real-Time,
Serverless
Lakehouse in Action
Frank Munz, Principal TMM, Databricks / Current.io 2023
@frankmunz

©2022 Databricks Inc. — All rights reserved 2
Databricks
Lakehouse Platform
Lakehouse Platform
Data
Warehousing
Data
Engineering
Data Science
and ML
Data
Streaming
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Simple
Unify your data warehousing and AI
use cases on a single platform
Open
Built on open source and open standards
Multicloud
One consistent data platform across clouds

Standing on
the Shoulders of OSS
What is new?

Apache Spark

3,600 contributors, 40,000 commits
#1 in dev activity for 10 years

Subsecond Latency - Project Lightspeed
7
Performance Improvements
• Micro-Batch Pipelining
• Offset Management
• Log Purging
• Consistent Latency for Stateful Pipelines
• State Rebalancing
• Adaptive Query Execution
Enhanced Functionality
• Multiple Stateful Operators
• Arbitrary Stateful Processing in Python
• Drop Duplicates Within Watermark
• Native support for Protobuf
Improved Observability
• Python Query Listener
Connectors & Ecosystem
• Enhanced Fanout (EFO)
• Trigger.AvailableNow support for Amazon Kinesis
• Google Pub/Sub Connector
• Integrations with Unity Catalog

Spark Connect GA in Apache Spark 3.4
Applications
IDEs / Notebooks
Programming Languages / SDKs
Modern data application
Thin client, with full power of Apache Spark
Spark’s Monolith Driver
Application Gateway
Analyzer
Optimizer
Scheduler
Distributed Execution Engine
Spark Connect
Client API

Spark Assistant
Prompt engineering by
Spark experts
New LLM-powered
features

Delta.io
supports streaming
from the ground up

Introducing
Delta Kernel
Implements the
complete Delta
Data + Metadata
specification.
Unifies connector
development
=
Java Ecosystem
aws-
pandas-sdk
ray
airbyte
Python
Ecosystem
Power BI
pandas
dask
duck DB
Rust
Ecosystem
Startree
(pinot)
beam
ballista
kafka
data fusion
pulsar flink
prestodb hive
trino glue
athena
emr dlt (spark-r) azure synapse
delta-
spark
redshift datahub
C++
Excel
Golang
Java
Power BI
R-Stats
Rust
Delta Sharing
Others
Delta
Protocol
Delta Kernels
polars arrow

Metadata
Delta Lake
With UniForm
Metadata
Data
Delta
UniForm
Unifying the
lakehouse
formats
Parquet

Delta Sharing
Lightning talk
tomorrow at 3.30PM
Meetup Hub

active data consumers on
Delta Sharing
data shared with Delta Lake
6,000+
300+ PB per day
Delta Lake
table
Delta
Sharing
protocol
Any
compatible
client
Data consumer
Data provider
An open standard for secure data sharing

Delta Sharing Ecosystem
3rd Party Data Vendors/Clean Room
Open Source Clients Business Intelligence/Analytics
Governance SaaS/Multi-Cloud Infrastructure
Hyperscalers
Carto
NEW

MLFlow

Model
Serving
optimized
for LLMs
INTRODUCING
Model Serving
and Monitoring
Falcon-7B-Instruct whisper-large-v2 stable-diffusion-2-1
MPT-7B-Instruct

Manage, govern,
evaluate, and switch
models easily
MLﬂow AI
Gateway
INTRODUCING
Multiple Generative AI use cases
across the organization
BI Pipelines Apps
MLﬂow AI Gateway
Multiple Generative AI Models
Credentials Caching Logging Rate limiting
Model Serving
and Monitoring
Users

Demo Audience 1

Let's do the math…
This demo creates a sustained data rate
43 million
events / day
2

Data Engineering
on the Lakehouse

Unity Catalog
Delta Lake
BI & Data
Warehousing
Data
Streaming
Data
Science & ML
Data
Engineering
Databricks Workflows
Unified orchestration for data,
analytics, and AI on the
Lakehouse Platform
Lakehouse Platform
● Simple authoring
● Actionable insights
● Proven reliability
YipitData: Why we migrated from
Airflow to Workflows
Workflows
Sessions
Clicks
Join
Featurize
Aggregate Analyze
Train
Orders
22

Building Blocks of Databricks Workflows
23
A unit of orchestration in Databricks Workflows is called a Job.
Databricks
Notebooks
Python
Scripts
Python
Wheels
SQL
Files/Queries
Delta Live Tables
Pipeline
dbt Java
JAR file
Spark
Submit
Jobs consist of
one or more Tasks
Sequential Parallel Conditionals
(Run If)
Jobs as a Task
(Modular)
Control flows can
be established
between Tasks.
Jobs supports
different Triggers
Preview
DBSQL
Dashboards
Manual
Trigger
Scheduled
(Cron)
API
Trigger
File
Arrival
Delta Table
Update
Continuous
(Streaming)
Preview
Coming
Soon

Serverless Workflows
Hands-off, auto-optimizing compute in Databricks’ account
Benefit from Databricks’ scale of compute and
engineering expertise through Serverless
compute in Databricks’ account:
Problem: Setting up, managing, and optimising clusters is
cumbersome and requires expert knowledge, wasting
valuable time and resources.
● High efficiency: Don’t pay for idle,
auto-optimize compute config
● Reliability so your critical
workloads are shielded from cloud
disruptions
● Faster startup: So users don’t
have to wait and critical data is
always fresh
● Simplicity that enables every user
to set up serverless
2
PREVIEW

What is Delta Live Tables?
Delta Live Tables (DLT) is the ﬁrst ETL framework that uses a simple declarative approach to
building reliable data pipelines. DLT automatically manages your infrastructure at scale so data
analysts and engineers can spend less time on tooling and focus on getting value from data.
Accelerate ETL
Development
Automatically manage
your infrastructure
Have conﬁdence
in your data
Simplify batch and
streaming
https://databricks.com/product/delta-live-tables
Modern software engineering for ETL processing

Reference Architecture
Most use cases will use STs for ingestion and MVs for transformation
Bronze
cloud_files
CREATE STREAMING TABLE
Use a short retention
period to avoid
compliance risks and
reduce costs
Avoid complex
transformations
that could have
bugs or drop
important data
Retain inﬁnite history
Easy to perform
GDPR and other
compliance tasks
CREATE MATERIALZIED VIEW
Materialized views
automatically handle
complex joins /
aggregations, and
propagate updates and
deletes.
Silver/Gold
Ad-hoc DML
for GDPR /
Corrections

Serverless Streaming optimizations
DLT Serverless also optimizes streaming TCO and latency!
27
PREVIEW
DLT Serverless dynamically
optimizes compute and scheduling
• Pipelined execution of multiple
microbatches
• Dynamically tuning of batches sizes
based on the amount of compute
available

Demo Audience 2

Delta Live Tables
Link to blog

Workflows Or DLT?
Often Both: Workflows can orchestrate anything, including DLT
● At some schedule
● After other tasks have
completed
● When a file arrives
● When another table is
updated
30
● Batch and streaming data
transformations / quality
● Easy way to run
Structured Streaming
● Creating/updating delta tables
Use DLT for managing dataflow
Use Workflows to run any
task

The core abstractions of DLT
You deﬁne datasets, and DLT automatically keeps them up to date
31
A delta table with stream(s)
writing to it.
Used for:
• Ingestion
(ﬁles, message brokers)
• Low latency transformations
• Huge scale
The result of a query, stored in a
delta table.
Used for:
• Transforming data
• Building aggregate tables
• Speeding up BI queries and
reports
Streaming Tables Materialized View

Streaming does not always mean expensive
Costs: lowest
Latency: highest
Delta live tables lets you choose how often to update the results.
Costs: depends on frequency
Latency: 10 minutes to months
Costs: highest
Latency: minutes to seconds
Triggered: Manually Triggered: On a schedule
using Databricks Jobs
Continually
32
(for some workloads)

Challenge
Heavy burden on Data
Engineers to create workflows
for analysts due to the high
complexity of creating custom
workflows with Airflow.
Solution
Migrated from Airflow to
Databricks Workflows for a
unified platform providing
analysts a simple way to own
and manage their own
workflows from data ingestion
to downstream analytics.
60%
Lower database costs
90%
Reduction in
processing time
Impact
33
“If we went back to 2018 and Databricks Workflows was available, we would never
have considered building out a custom Airflow setup. We would just use
Workflows.”
—Hillevi Crognale, Engineering Manager, YipitData
Migrating from Apache Airflow
to Databricks Workflows

Delta Live Tables
Link to blog

From Zero to Hero
Sharing Streaming
Data with Open
Source Delta Sharing
Frank Munz, Principal TMM, Databricks
@frankmunz

About me
▪ Principal TMM @ Databricks
▪ Based in Munich, 🍻 ⛰ 🥨 󰎲
▪ ❤ all things large scale data & AI

What’s the problem with
Data Sharing?

Proprietary
Vendor Solutions
SFTP Cloud Object Store Delta Sharing
Secure ✅ ✅ ✅ ✅
Cheap ✅ ✅ ✅
Vendor agnostic ✅ ✅
Multi-cloud ✅ ✅
Open Source ✅ ✅
Table / Data Frame abstr. ✅ ✅
Live data ✅ ✅
Predicate Pushdown ✅ ✅
Object Store Bandwidth ✅ ✅
Zero compute cost ✅ ✅
Scalability ✅ ✅

The Open Approach To Sharing
Fully open, without
proprietary lock-in using
any computing platforms
Simple to share live
data with other
organizations
Easily managed
privacy, security, and
compliance
Additional
ﬂexibility and
interoperability
Less data
movement and
complexity
Ability unlock
data with strong
governance

Delta
Lake
Delta Sharing
Server
Parquet ﬁles
in cloud
storage
Request table
Pre-signed
short-lived URLs
Temporary direct access to ﬁles
(parquet format) in the object
store - AWS S3, GCP, ADLS
…
DATA PROVIDER DATA CONSUMER
Delta Sharing
Client
Under the hood
Activation link

OSS: Run a Sharing Server
https://github.com/delta-io/delta-sharing
bin/delta-sharing-server -- --config server-config.yaml
OR
docker run -p <host-port>:<container-port>
…
deltaio/delta-sharing-server:0.6.4 -- --config
/config/server-config.yaml

Databricks: Sharing Data from SQL
CREATE SHARE loan ;
ALTER SHARE loan ADD TABLE demo.lending.txs;
CREATE RECIPIENT l_recipient
GRANT SELECT ON SHARE loan TO RECIPIENT l_recipient;

Databricks UI: Create share
(1) create share
(2) add table

Pandas Client
!pip install delta-sharing
client = delta_sharing.SharingClient(profile_f)
table = profile_f+"#share.schema.table"
data = delta_sharing.load_as_pandas(table)

Streaming Support: Spark Structured Streaming
# client code
df = (spark.readStream
.format("deltasharing")
.option("readChangeFeed", "true")
.option("startingTimestamp", "2021-04-21 05:45:46")
.load("<profile>#<share>.<schema>.<table>")
)

https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

Adoption of Delta Sharing protocol takes aim at Snowflake
Oracle's adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data
Warehouse. The protocol was adopted, according to Oracle's Wheeler, to avoid vendor lock-ins for data sharing
and sort out issues such as security, version control and access management of data sets.
“With this open approach, customers can now securely share data with anyone using any application or service
that supports the protocol,” the company said in a statement.
Oracle’s decision to adopt the protocol could be primarily due to its popularity and to
counter Snowflake’s product offerings, analysts said.

Open for Databricks &
non-Databricks users
Data sets, Notebooks,
ML models and
applications from top
data & solution providers
Public marketplace,
private exchanges
Databricks Marketplace provides an open
marketplace for data, analytics, and AI
1
8
Dashboards
ML
Models
Data
Files
Data
Tables
Solution
Accelerators
Databricks
Marketplace
Notebooks

Databricks Clean Rooms
Secure environments to run computations on joint data
Collaborator 1
Mutually approved
jobs on Databricks
trusted compute
Existing tables
Scalable
Scale to multiple
collaborators and any data
size
Interoperable
Any data source with no
replication
Flexible
Your language and workload
of choice
Collaborator N
Existing tables
Delta
Sharing
Delta
Sharing

Conclusion Delta Sharing
● Platform-independent, multi-cloud, OSS for
sharing massive amounts live and streaming of data.
● built into Databricks Accounts, Marketplace, Clean Rooms
● Clients can be:
○ OSS pandas , Apache Spark
○ Enterprise BI Tableau, Power BI.
● Server
○ Pre-built reference implementation
○ OSS binary
○ OSS Docker container

Technical Questions?
Sign-up for the Databricks Community!
Ask your technical questions here: https://community.databricks.com/
22

New Databricks Demo Center
databricks.com/demos
Notebooks for this demo
on GitHub
This demo on Demo
Center

Technical Questions?
Sign-up for the Databricks Community!
Ask your technical questions here: https://community.databricks.com/
37

Thank You!
@frankmunz
Try
Databricks
free

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse

Similar to Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse