"Unlike just a few years ago, today the lakehouse architecture is an established data platform embraced by all major cloud data companies such as AWS, Azure, Google, Oracle, Microsoft, Snowflake and Databricks.
This session kicks off with a technical, no-nonsense introduction to the lakehouse concept, dives deep into the lakehouse architecture and recaps how a data lakehouse is built from the ground up with streaming as a first-class citizen.
Then we focus on serverless for streaming use cases. Serverless concepts are well-known from developers triggering hundreds of thousands of AWS Lambda functions at a negligible cost. However, the same concept becomes more interesting when looking at data platforms.
We have all heard about the principle ""It runs best on Powerpoint"", so I decided to skip slides here and bring a serverless demo instead:
A hands-on, fun, and interactive serverless streaming use case example where we ingest live events from hundreds of mobile devices (don't miss out - bring your phone and be part of it!!). Based on this use case I will critically explore how much of a modern lakehouse is serverless and how we implemented that at Databricks (spoiler alert: serverless is everywhere from data pipelines, workflows, optimized Spark APIs, to ML).
TL;DR benefits for the Data Practitioners:
-Recap the OSS foundation of the Lakehouse architecture and understand its appeal
- Understand the benefits of leveraging a lakehouse for streaming and what's there beyond Spark Structured Streaming.
- Meat of the talk: The Serverless Lakehouse. I give you the tech bits beyond the hype. How does a serverless lakehouse differ from other serverless offers?
- Live, hands-on, interactive demo to explore serverless data engineering data end-to-end. For each step we have a critical look and I explain what it means, e.g for you saving costs and removing operational overhead."
7. Subsecond Latency - Project Lightspeed
7
Performance Improvements
• Micro-Batch Pipelining
• Offset Management
• Log Purging
• Consistent Latency for Stateful Pipelines
• State Rebalancing
• Adaptive Query Execution
Enhanced Functionality
• Multiple Stateful Operators
• Arbitrary Stateful Processing in Python
• Drop Duplicates Within Watermark
• Native support for Protobuf
Improved Observability
• Python Query Listener
Connectors & Ecosystem
• Enhanced Fanout (EFO)
• Trigger.AvailableNow support for Amazon Kinesis
• Google Pub/Sub Connector
• Integrations with Unity Catalog
8. Spark Connect GA in Apache Spark 3.4
Applications
IDEs / Notebooks
Programming Languages / SDKs
Modern data application
Thin client, with full power of Apache Spark
Spark’s Monolith Driver
Application Gateway
Analyzer
Optimizer
Scheduler
Distributed Execution Engine
Spark Connect
Client API
14. active data consumers on
Delta Sharing
data shared with Delta Lake
6,000+
300+ PB per day
Delta Lake
table
Delta
Sharing
protocol
Any
compatible
client
Data consumer
Data provider
An open standard for secure data sharing
15. Delta Sharing Ecosystem
3rd Party Data Vendors/Clean Room
Open Source Clients Business Intelligence/Analytics
Governance SaaS/Multi-Cloud Infrastructure
Hyperscalers
Carto
NEW
18. Manage, govern,
evaluate, and switch
models easily
MLflow AI
Gateway
INTRODUCING
Multiple Generative AI use cases
across the organization
BI Pipelines Apps
MLflow AI Gateway
Multiple Generative AI Models
Credentials Caching Logging Rate limiting
Model Serving
and Monitoring
Users
40. The Open Approach To Sharing
Fully open, without
proprietary lock-in using
any computing platforms
Simple to share live
data with other
organizations
Easily managed
privacy, security, and
compliance
Additional
flexibility and
interoperability
Less data
movement and
complexity
Ability unlock
data with strong
governance
41. Delta
Lake
Delta Sharing
Server
Parquet files
in cloud
storage
Request table
Pre-signed
short-lived URLs
Temporary direct access to files
(parquet format) in the object
store - AWS S3, GCP, ADLS
…
DATA PROVIDER DATA CONSUMER
Delta Sharing
Client
Under the hood
Activation link
42. OSS: Run a Sharing Server
https://github.com/delta-io/delta-sharing
bin/delta-sharing-server -- --config server-config.yaml
OR
docker run -p <host-port>:<container-port>
…
deltaio/delta-sharing-server:0.6.4 -- --config
/config/server-config.yaml
43. Databricks: Sharing Data from SQL
CREATE SHARE loan ;
ALTER SHARE loan ADD TABLE demo.lending.txs;
CREATE RECIPIENT l_recipient
GRANT SELECT ON SHARE loan TO RECIPIENT l_recipient;
50. Delta Sharing Ecosystem
3rd Party Data Vendors/Clean Room
Open Source Clients Business Intelligence/Analytics
Governance SaaS/Multi-Cloud Infrastructure
Hyperscalers
Carto
NEW
51. Adoption of Delta Sharing protocol takes aim at Snowflake
Oracle's adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data
Warehouse. The protocol was adopted, according to Oracle's Wheeler, to avoid vendor lock-ins for data sharing
and sort out issues such as security, version control and access management of data sets.
“With this open approach, customers can now securely share data with anyone using any application or service
that supports the protocol,” the company said in a statement.
Oracle’s decision to adopt the protocol could be primarily due to its popularity and to
counter Snowflake’s product offerings, analysts said.
52. Open for Databricks &
non-Databricks users
Data sets, Notebooks,
ML models and
applications from top
data & solution providers
Public marketplace,
private exchanges
Databricks Marketplace provides an open
marketplace for data, analytics, and AI
1
8
Dashboards
ML
Models
Data
Files
Data
Tables
Solution
Accelerators
Databricks
Marketplace
Notebooks
53. Databricks Clean Rooms
Secure environments to run computations on joint data
Collaborator 1
Mutually approved
jobs on Databricks
trusted compute
Existing tables
Scalable
Scale to multiple
collaborators and any data
size
Interoperable
Any data source with no
replication
Flexible
Your language and workload
of choice
Collaborator N
Existing tables
Delta
Sharing
Delta
Sharing