SlideShare a Scribd company logo
1 of 59
Download to read offline
©2022 Databricks Inc. — All rights reserved
Standing on the Shoulders of
Open-Source Giants
The Real-Time,
Serverless
Lakehouse in Action
Frank Munz, Principal TMM, Databricks / Current.io 2023
@frankmunz
©2022 Databricks Inc. — All rights reserved 2
Databricks
Lakehouse Platform
Lakehouse Platform
Data
Warehousing
Data
Engineering
Data Science
and ML
Data
Streaming
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Simple
Unify your data warehousing and AI
use cases on a single platform
Open
Built on open source and open standards
Multicloud
One consistent data platform across clouds
Standing on
the Shoulders of OSS
What is new?
©2021 Databricks Inc. — All rights reserved
Apache Spark
Annual downloads
> 1 Billion
3,600 contributors, 40,000 commits
#1 in dev activity for 10 years
Subsecond Latency - Project Lightspeed
7
Performance Improvements
• Micro-Batch Pipelining
• Offset Management
• Log Purging
• Consistent Latency for Stateful Pipelines
• State Rebalancing
• Adaptive Query Execution
Enhanced Functionality
• Multiple Stateful Operators
• Arbitrary Stateful Processing in Python
• Drop Duplicates Within Watermark
• Native support for Protobuf
Improved Observability
• Python Query Listener
Connectors & Ecosystem
• Enhanced Fanout (EFO)
• Trigger.AvailableNow support for Amazon Kinesis
• Google Pub/Sub Connector
• Integrations with Unity Catalog
Spark Connect GA in Apache Spark 3.4
Applications
IDEs / Notebooks
Programming Languages / SDKs
Modern data application
Thin client, with full power of Apache Spark
Spark’s Monolith Driver
Application Gateway
Analyzer
Optimizer
Scheduler
Distributed Execution Engine
Spark Connect
Client API
Spark Assistant
Prompt engineering by
Spark experts
New LLM-powered
features
©2021 Databricks Inc. — All rights reserved
Delta.io
supports streaming
from the ground up
Introducing
Delta Kernel
Implements the
complete Delta
Data + Metadata
specification.
Unifies connector
development
=
Java Ecosystem
aws-
pandas-sdk
ray
airbyte
Python
Ecosystem
Power BI
pandas
dask
duck DB
Rust
Ecosystem
Startree
(pinot)
beam
ballista
kafka
data fusion
pulsar flink
prestodb hive
trino glue
athena
emr dlt (spark-r) azure synapse
delta-
spark
redshift datahub
C++
Excel
Golang
Java
Power BI
R-Stats
Rust
Delta Sharing
Others
Delta
Protocol
Delta Kernels
polars arrow
Metadata
Delta Lake
With UniForm
Metadata
Data
Delta
UniForm
Unifying the
lakehouse
formats
Parquet
©2021 Databricks Inc. — All rights reserved
Delta Sharing
Lightning talk
tomorrow at 3.30PM
Meetup Hub
active data consumers on
Delta Sharing
data shared with Delta Lake
6,000+
300+ PB per day
Delta Lake
table
Delta
Sharing
protocol
Any
compatible
client
Data consumer
Data provider
An open standard for secure data sharing
Delta Sharing Ecosystem
3rd Party Data Vendors/Clean Room
Open Source Clients Business Intelligence/Analytics
Governance SaaS/Multi-Cloud Infrastructure
Hyperscalers
Carto
NEW
©2021 Databricks Inc. — All rights reserved
MLFlow
Model
Serving
optimized
for LLMs
INTRODUCING
Model Serving
and Monitoring
Falcon-7B-Instruct whisper-large-v2 stable-diffusion-2-1
MPT-7B-Instruct
Manage, govern,
evaluate, and switch
models easily
MLflow AI
Gateway
INTRODUCING
Multiple Generative AI use cases
across the organization
BI Pipelines Apps
MLflow AI Gateway
Multiple Generative AI Models
Credentials Caching Logging Rate limiting
Model Serving
and Monitoring
Users
©2021 Databricks Inc. — All rights reserved
Demo Audience 1
Let's do the math…
This demo creates a sustained data rate
43 million
events / day
2
Data Engineering
on the Lakehouse
©2022 Databricks Inc. — All rights reserved
Unity Catalog
Delta Lake
BI & Data
Warehousing
Data
Streaming
Data
Science & ML
Data
Engineering
Databricks Workflows
Unified orchestration for data,
analytics, and AI on the
Lakehouse Platform
Lakehouse Platform
● Simple authoring
● Actionable insights
● Proven reliability
YipitData: Why we migrated from
Airflow to Workflows
Workflows
Sessions
Clicks
Join
Featurize
Aggregate Analyze
Train
Orders
22
©2022 Databricks Inc. — All rights reserved
Building Blocks of Databricks Workflows
23
A unit of orchestration in Databricks Workflows is called a Job.
Databricks
Notebooks
Python
Scripts
Python
Wheels
SQL
Files/Queries
Delta Live Tables
Pipeline
dbt Java
JAR file
Spark
Submit
Jobs consist of
one or more Tasks
Sequential Parallel Conditionals
(Run If)
Jobs as a Task
(Modular)
Control flows can
be established
between Tasks.
Jobs supports
different Triggers
Preview
DBSQL
Dashboards
Manual
Trigger
Scheduled
(Cron)
API
Trigger
File
Arrival
Delta Table
Update
Continuous
(Streaming)
Preview
Coming
Soon
©2022 Databricks Inc. — All rights reserved
Serverless Workflows
Hands-off, auto-optimizing compute in Databricks’ account
Benefit from Databricks’ scale of compute and
engineering expertise through Serverless
compute in Databricks’ account:
Problem: Setting up, managing, and optimising clusters is
cumbersome and requires expert knowledge, wasting
valuable time and resources.
● High efficiency: Don’t pay for idle,
auto-optimize compute config
● Reliability so your critical
workloads are shielded from cloud
disruptions
● Faster startup: So users don’t
have to wait and critical data is
always fresh
● Simplicity that enables every user
to set up serverless
2
PREVIEW
©2022 Databricks Inc. — All rights reserved
What is Delta Live Tables?
Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative approach to
building reliable data pipelines. DLT automatically manages your infrastructure at scale so data
analysts and engineers can spend less time on tooling and focus on getting value from data.
Accelerate ETL
Development
Automatically manage
your infrastructure
Have confidence
in your data
Simplify batch and
streaming
https://databricks.com/product/delta-live-tables
Modern software engineering for ETL processing
©2022 Databricks Inc. — All rights reserved
Reference Architecture
Most use cases will use STs for ingestion and MVs for transformation
Bronze
cloud_files
CREATE STREAMING TABLE
Use a short retention
period to avoid
compliance risks and
reduce costs
Avoid complex
transformations
that could have
bugs or drop
important data
Retain infinite history
Easy to perform
GDPR and other
compliance tasks
CREATE MATERIALZIED VIEW
Materialized views
automatically handle
complex joins /
aggregations, and
propagate updates and
deletes.
Silver/Gold
Ad-hoc DML
for GDPR /
Corrections
©2022 Databricks Inc. — All rights reserved
Serverless Streaming optimizations
DLT Serverless also optimizes streaming TCO and latency!
27
PREVIEW
DLT Serverless dynamically
optimizes compute and scheduling
• Pipelined execution of multiple
microbatches
• Dynamically tuning of batches sizes
based on the amount of compute
available
©2021 Databricks Inc. — All rights reserved
Demo Audience 2
©2022 Databricks Inc. — All rights reserved 29
Delta Live Tables
Link to blog
©2022 Databricks Inc. — All rights reserved
Workflows Or DLT?
Often Both: Workflows can orchestrate anything, including DLT
● At some schedule
● After other tasks have
completed
● When a file arrives
● When another table is
updated
30
● Batch and streaming data
transformations / quality
● Easy way to run
Structured Streaming
● Creating/updating delta tables
Use DLT for managing dataflow
Use Workflows to run any
task
©2022 Databricks Inc. — All rights reserved
The core abstractions of DLT
You define datasets, and DLT automatically keeps them up to date
31
A delta table with stream(s)
writing to it.
Used for:
• Ingestion
(files, message brokers)
• Low latency transformations
• Huge scale
The result of a query, stored in a
delta table.
Used for:
• Transforming data
• Building aggregate tables
• Speeding up BI queries and
reports
Streaming Tables Materialized View
©2022 Databricks Inc. — All rights reserved
Streaming does not always mean expensive
Costs: lowest
Latency: highest
Delta live tables lets you choose how often to update the results.
Costs: depends on frequency
Latency: 10 minutes to months
Costs: highest
Latency: minutes to seconds
Triggered: Manually Triggered: On a schedule
using Databricks Jobs
Continually
32
(for some workloads)
©2022 Databricks Inc. — All rights reserved
Challenge
Heavy burden on Data
Engineers to create workflows
for analysts due to the high
complexity of creating custom
workflows with Airflow.
Solution
Migrated from Airflow to
Databricks Workflows for a
unified platform providing
analysts a simple way to own
and manage their own
workflows from data ingestion
to downstream analytics.
60%
Lower database costs
90%
Reduction in
processing time
Impact
33
“If we went back to 2018 and Databricks Workflows was available, we would never
have considered building out a custom Airflow setup. We would just use
Workflows.”
—Hillevi Crognale, Engineering Manager, YipitData
Migrating from Apache Airflow
to Databricks Workflows
©2022 Databricks Inc. — All rights reserved 34
Delta Live Tables
Link to blog
From Zero to Hero
Sharing Streaming
Data with Open
Source Delta Sharing
Frank Munz, Principal TMM, Databricks
@frankmunz
About me
▪ Principal TMM @ Databricks
▪ Based in Munich, 🍻 ⛰ 🥨 󰎲
▪ ❤ all things large scale data & AI
©2021 Databricks Inc. — All rights reserved
What’s the problem with
Data Sharing?
Proprietary
Vendor Solutions
SFTP Cloud Object Store Delta Sharing
Secure ✅ ✅ ✅ ✅
Cheap ✅ ✅ ✅
Vendor agnostic ✅ ✅
Multi-cloud ✅ ✅
Open Source ✅ ✅
Table / Data Frame abstr. ✅ ✅
Live data ✅ ✅
Predicate Pushdown ✅ ✅
Object Store Bandwidth ✅ ✅
Zero compute cost ✅ ✅
Scalability ✅ ✅
How does Delta Sharing Help?
The Open Approach To Sharing
Fully open, without
proprietary lock-in using
any computing platforms
Simple to share live
data with other
organizations
Easily managed
privacy, security, and
compliance
Additional
flexibility and
interoperability
Less data
movement and
complexity
Ability unlock
data with strong
governance
Delta
Lake
Delta Sharing
Server
Parquet files
in cloud
storage
Request table
Pre-signed
short-lived URLs
Temporary direct access to files
(parquet format) in the object
store - AWS S3, GCP, ADLS
…
DATA PROVIDER DATA CONSUMER
Delta Sharing
Client
Under the hood
Activation link
OSS: Run a Sharing Server
https://github.com/delta-io/delta-sharing
bin/delta-sharing-server -- --config server-config.yaml
OR
docker run -p <host-port>:<container-port> 
…
deltaio/delta-sharing-server:0.6.4 -- --config
/config/server-config.yaml
Databricks: Sharing Data from SQL
CREATE SHARE loan ;
ALTER SHARE loan ADD TABLE demo.lending.txs;
CREATE RECIPIENT l_recipient
GRANT SELECT ON SHARE loan TO RECIPIENT l_recipient;
Databricks UI: Create share
(1) create share
(2) add table
Pandas Client
!pip install delta-sharing
client = delta_sharing.SharingClient(profile_f)
table = profile_f+"#share.schema.table"
data = delta_sharing.load_as_pandas(table)
Streaming Support: Spark Structured Streaming
# client code
df = (spark.readStream
.format("deltasharing")
.option("readChangeFeed", "true")
.option("startingTimestamp", "2021-04-21 05:45:46")
.load("<profile>#<share>.<schema>.<table>")
)
Demo
Delta Sharing
https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb
Why Delta Sharing rocks
Delta Sharing Ecosystem
3rd Party Data Vendors/Clean Room
Open Source Clients Business Intelligence/Analytics
Governance SaaS/Multi-Cloud Infrastructure
Hyperscalers
Carto
NEW
Adoption of Delta Sharing protocol takes aim at Snowflake
Oracle's adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data
Warehouse. The protocol was adopted, according to Oracle's Wheeler, to avoid vendor lock-ins for data sharing
and sort out issues such as security, version control and access management of data sets.
“With this open approach, customers can now securely share data with anyone using any application or service
that supports the protocol,” the company said in a statement.
Oracle’s decision to adopt the protocol could be primarily due to its popularity and to
counter Snowflake’s product offerings, analysts said.
Open for Databricks &
non-Databricks users
Data sets, Notebooks,
ML models and
applications from top
data & solution providers
Public marketplace,
private exchanges
Databricks Marketplace provides an open
marketplace for data, analytics, and AI
1
8
Dashboards
ML
Models
Data
Files
Data
Tables
Solution
Accelerators
Databricks
Marketplace
Notebooks
Databricks Clean Rooms
Secure environments to run computations on joint data
Collaborator 1
Mutually approved
jobs on Databricks
trusted compute
Existing tables
Scalable
Scale to multiple
collaborators and any data
size
Interoperable
Any data source with no
replication
Flexible
Your language and workload
of choice
Collaborator N
Existing tables
Delta
Sharing
Delta
Sharing
Conclusion
Conclusion Delta Sharing
● Platform-independent, multi-cloud, OSS for
sharing massive amounts live and streaming of data.
● built into Databricks Accounts, Marketplace, Clean Rooms
● Clients can be:
○ OSS pandas , Apache Spark
○ Enterprise BI Tableau, Power BI.
● Server
○ Pre-built reference implementation
○ OSS binary
○ OSS Docker container
©2022 Databricks Inc. — All rights reserved
Technical Questions?
Sign-up for the Databricks Community!
Ask your technical questions here: https://community.databricks.com/
22
©2022 Databricks Inc. — All rights reserved 23
New Databricks Demo Center
databricks.com/demos
Notebooks for this demo
on GitHub
This demo on Demo
Center
©2022 Databricks Inc. — All rights reserved
Technical Questions?
Sign-up for the Databricks Community!
Ask your technical questions here: https://community.databricks.com/
37
©2022 Databricks Inc. — All rights reserved 38
Thank You!
@frankmunz
Try
Databricks
free

More Related Content

What's hot

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

What's hot (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected ApproachBusiness Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected Approach
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Data Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data IntelligenceData Catalog as the Platform for Data Intelligence
Data Catalog as the Platform for Data Intelligence
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Master Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceMaster Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and Governance
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Similar to Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse

Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
HostedbyConfluent
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not YearsReplatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
VMware Tanzu
 

Similar to Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse (20)

Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdfDagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Snowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSnowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern Analytics
 
Slides-Discover-Power-of-Live-Data(2).pdf
Slides-Discover-Power-of-Live-Data(2).pdfSlides-Discover-Power-of-Live-Data(2).pdf
Slides-Discover-Power-of-Live-Data(2).pdf
 
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Maturing IoT solutions with Microsoft Azure (Sam Vanhoutte & Glenn Colpaert a...
Maturing IoT solutions with Microsoft Azure (Sam Vanhoutte & Glenn Colpaert a...Maturing IoT solutions with Microsoft Azure (Sam Vanhoutte & Glenn Colpaert a...
Maturing IoT solutions with Microsoft Azure (Sam Vanhoutte & Glenn Colpaert a...
 
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not YearsReplatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
Replatform your Teradata to a Next-Gen Cloud Data Platform in Weeks, Not Years
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse

  • 1. ©2022 Databricks Inc. — All rights reserved Standing on the Shoulders of Open-Source Giants The Real-Time, Serverless Lakehouse in Action Frank Munz, Principal TMM, Databricks / Current.io 2023 @frankmunz
  • 2. ©2022 Databricks Inc. — All rights reserved 2 Databricks Lakehouse Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Simple Unify your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds
  • 3. Standing on the Shoulders of OSS What is new?
  • 4. ©2021 Databricks Inc. — All rights reserved Apache Spark
  • 6. 3,600 contributors, 40,000 commits #1 in dev activity for 10 years
  • 7. Subsecond Latency - Project Lightspeed 7 Performance Improvements • Micro-Batch Pipelining • Offset Management • Log Purging • Consistent Latency for Stateful Pipelines • State Rebalancing • Adaptive Query Execution Enhanced Functionality • Multiple Stateful Operators • Arbitrary Stateful Processing in Python • Drop Duplicates Within Watermark • Native support for Protobuf Improved Observability • Python Query Listener Connectors & Ecosystem • Enhanced Fanout (EFO) • Trigger.AvailableNow support for Amazon Kinesis • Google Pub/Sub Connector • Integrations with Unity Catalog
  • 8. Spark Connect GA in Apache Spark 3.4 Applications IDEs / Notebooks Programming Languages / SDKs Modern data application Thin client, with full power of Apache Spark Spark’s Monolith Driver Application Gateway Analyzer Optimizer Scheduler Distributed Execution Engine Spark Connect Client API
  • 9. Spark Assistant Prompt engineering by Spark experts New LLM-powered features
  • 10. ©2021 Databricks Inc. — All rights reserved Delta.io supports streaming from the ground up
  • 11. Introducing Delta Kernel Implements the complete Delta Data + Metadata specification. Unifies connector development = Java Ecosystem aws- pandas-sdk ray airbyte Python Ecosystem Power BI pandas dask duck DB Rust Ecosystem Startree (pinot) beam ballista kafka data fusion pulsar flink prestodb hive trino glue athena emr dlt (spark-r) azure synapse delta- spark redshift datahub C++ Excel Golang Java Power BI R-Stats Rust Delta Sharing Others Delta Protocol Delta Kernels polars arrow
  • 13. ©2021 Databricks Inc. — All rights reserved Delta Sharing Lightning talk tomorrow at 3.30PM Meetup Hub
  • 14. active data consumers on Delta Sharing data shared with Delta Lake 6,000+ 300+ PB per day Delta Lake table Delta Sharing protocol Any compatible client Data consumer Data provider An open standard for secure data sharing
  • 15. Delta Sharing Ecosystem 3rd Party Data Vendors/Clean Room Open Source Clients Business Intelligence/Analytics Governance SaaS/Multi-Cloud Infrastructure Hyperscalers Carto NEW
  • 16. ©2021 Databricks Inc. — All rights reserved MLFlow
  • 17. Model Serving optimized for LLMs INTRODUCING Model Serving and Monitoring Falcon-7B-Instruct whisper-large-v2 stable-diffusion-2-1 MPT-7B-Instruct
  • 18. Manage, govern, evaluate, and switch models easily MLflow AI Gateway INTRODUCING Multiple Generative AI use cases across the organization BI Pipelines Apps MLflow AI Gateway Multiple Generative AI Models Credentials Caching Logging Rate limiting Model Serving and Monitoring Users
  • 19. ©2021 Databricks Inc. — All rights reserved Demo Audience 1
  • 20. Let's do the math… This demo creates a sustained data rate 43 million events / day 2
  • 22. ©2022 Databricks Inc. — All rights reserved Unity Catalog Delta Lake BI & Data Warehousing Data Streaming Data Science & ML Data Engineering Databricks Workflows Unified orchestration for data, analytics, and AI on the Lakehouse Platform Lakehouse Platform ● Simple authoring ● Actionable insights ● Proven reliability YipitData: Why we migrated from Airflow to Workflows Workflows Sessions Clicks Join Featurize Aggregate Analyze Train Orders 22
  • 23. ©2022 Databricks Inc. — All rights reserved Building Blocks of Databricks Workflows 23 A unit of orchestration in Databricks Workflows is called a Job. Databricks Notebooks Python Scripts Python Wheels SQL Files/Queries Delta Live Tables Pipeline dbt Java JAR file Spark Submit Jobs consist of one or more Tasks Sequential Parallel Conditionals (Run If) Jobs as a Task (Modular) Control flows can be established between Tasks. Jobs supports different Triggers Preview DBSQL Dashboards Manual Trigger Scheduled (Cron) API Trigger File Arrival Delta Table Update Continuous (Streaming) Preview Coming Soon
  • 24. ©2022 Databricks Inc. — All rights reserved Serverless Workflows Hands-off, auto-optimizing compute in Databricks’ account Benefit from Databricks’ scale of compute and engineering expertise through Serverless compute in Databricks’ account: Problem: Setting up, managing, and optimising clusters is cumbersome and requires expert knowledge, wasting valuable time and resources. ● High efficiency: Don’t pay for idle, auto-optimize compute config ● Reliability so your critical workloads are shielded from cloud disruptions ● Faster startup: So users don’t have to wait and critical data is always fresh ● Simplicity that enables every user to set up serverless 2 PREVIEW
  • 25. ©2022 Databricks Inc. — All rights reserved What is Delta Live Tables? Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative approach to building reliable data pipelines. DLT automatically manages your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. Accelerate ETL Development Automatically manage your infrastructure Have confidence in your data Simplify batch and streaming https://databricks.com/product/delta-live-tables Modern software engineering for ETL processing
  • 26. ©2022 Databricks Inc. — All rights reserved Reference Architecture Most use cases will use STs for ingestion and MVs for transformation Bronze cloud_files CREATE STREAMING TABLE Use a short retention period to avoid compliance risks and reduce costs Avoid complex transformations that could have bugs or drop important data Retain infinite history Easy to perform GDPR and other compliance tasks CREATE MATERIALZIED VIEW Materialized views automatically handle complex joins / aggregations, and propagate updates and deletes. Silver/Gold Ad-hoc DML for GDPR / Corrections
  • 27. ©2022 Databricks Inc. — All rights reserved Serverless Streaming optimizations DLT Serverless also optimizes streaming TCO and latency! 27 PREVIEW DLT Serverless dynamically optimizes compute and scheduling • Pipelined execution of multiple microbatches • Dynamically tuning of batches sizes based on the amount of compute available
  • 28. ©2021 Databricks Inc. — All rights reserved Demo Audience 2
  • 29. ©2022 Databricks Inc. — All rights reserved 29 Delta Live Tables Link to blog
  • 30. ©2022 Databricks Inc. — All rights reserved Workflows Or DLT? Often Both: Workflows can orchestrate anything, including DLT ● At some schedule ● After other tasks have completed ● When a file arrives ● When another table is updated 30 ● Batch and streaming data transformations / quality ● Easy way to run Structured Streaming ● Creating/updating delta tables Use DLT for managing dataflow Use Workflows to run any task
  • 31. ©2022 Databricks Inc. — All rights reserved The core abstractions of DLT You define datasets, and DLT automatically keeps them up to date 31 A delta table with stream(s) writing to it. Used for: • Ingestion (files, message brokers) • Low latency transformations • Huge scale The result of a query, stored in a delta table. Used for: • Transforming data • Building aggregate tables • Speeding up BI queries and reports Streaming Tables Materialized View
  • 32. ©2022 Databricks Inc. — All rights reserved Streaming does not always mean expensive Costs: lowest Latency: highest Delta live tables lets you choose how often to update the results. Costs: depends on frequency Latency: 10 minutes to months Costs: highest Latency: minutes to seconds Triggered: Manually Triggered: On a schedule using Databricks Jobs Continually 32 (for some workloads)
  • 33. ©2022 Databricks Inc. — All rights reserved Challenge Heavy burden on Data Engineers to create workflows for analysts due to the high complexity of creating custom workflows with Airflow. Solution Migrated from Airflow to Databricks Workflows for a unified platform providing analysts a simple way to own and manage their own workflows from data ingestion to downstream analytics. 60% Lower database costs 90% Reduction in processing time Impact 33 “If we went back to 2018 and Databricks Workflows was available, we would never have considered building out a custom Airflow setup. We would just use Workflows.” —Hillevi Crognale, Engineering Manager, YipitData Migrating from Apache Airflow to Databricks Workflows
  • 34. ©2022 Databricks Inc. — All rights reserved 34 Delta Live Tables Link to blog
  • 35. From Zero to Hero Sharing Streaming Data with Open Source Delta Sharing Frank Munz, Principal TMM, Databricks @frankmunz
  • 36. About me ▪ Principal TMM @ Databricks ▪ Based in Munich, 🍻 ⛰ 🥨 󰎲 ▪ ❤ all things large scale data & AI
  • 37. ©2021 Databricks Inc. — All rights reserved What’s the problem with Data Sharing?
  • 38. Proprietary Vendor Solutions SFTP Cloud Object Store Delta Sharing Secure ✅ ✅ ✅ ✅ Cheap ✅ ✅ ✅ Vendor agnostic ✅ ✅ Multi-cloud ✅ ✅ Open Source ✅ ✅ Table / Data Frame abstr. ✅ ✅ Live data ✅ ✅ Predicate Pushdown ✅ ✅ Object Store Bandwidth ✅ ✅ Zero compute cost ✅ ✅ Scalability ✅ ✅
  • 39. How does Delta Sharing Help?
  • 40. The Open Approach To Sharing Fully open, without proprietary lock-in using any computing platforms Simple to share live data with other organizations Easily managed privacy, security, and compliance Additional flexibility and interoperability Less data movement and complexity Ability unlock data with strong governance
  • 41. Delta Lake Delta Sharing Server Parquet files in cloud storage Request table Pre-signed short-lived URLs Temporary direct access to files (parquet format) in the object store - AWS S3, GCP, ADLS … DATA PROVIDER DATA CONSUMER Delta Sharing Client Under the hood Activation link
  • 42. OSS: Run a Sharing Server https://github.com/delta-io/delta-sharing bin/delta-sharing-server -- --config server-config.yaml OR docker run -p <host-port>:<container-port> … deltaio/delta-sharing-server:0.6.4 -- --config /config/server-config.yaml
  • 43. Databricks: Sharing Data from SQL CREATE SHARE loan ; ALTER SHARE loan ADD TABLE demo.lending.txs; CREATE RECIPIENT l_recipient GRANT SELECT ON SHARE loan TO RECIPIENT l_recipient;
  • 44. Databricks UI: Create share (1) create share (2) add table
  • 45. Pandas Client !pip install delta-sharing client = delta_sharing.SharingClient(profile_f) table = profile_f+"#share.schema.table" data = delta_sharing.load_as_pandas(table)
  • 46. Streaming Support: Spark Structured Streaming # client code df = (spark.readStream .format("deltasharing") .option("readChangeFeed", "true") .option("startingTimestamp", "2021-04-21 05:45:46") .load("<profile>#<share>.<schema>.<table>") )
  • 50. Delta Sharing Ecosystem 3rd Party Data Vendors/Clean Room Open Source Clients Business Intelligence/Analytics Governance SaaS/Multi-Cloud Infrastructure Hyperscalers Carto NEW
  • 51. Adoption of Delta Sharing protocol takes aim at Snowflake Oracle's adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data Warehouse. The protocol was adopted, according to Oracle's Wheeler, to avoid vendor lock-ins for data sharing and sort out issues such as security, version control and access management of data sets. “With this open approach, customers can now securely share data with anyone using any application or service that supports the protocol,” the company said in a statement. Oracle’s decision to adopt the protocol could be primarily due to its popularity and to counter Snowflake’s product offerings, analysts said.
  • 52. Open for Databricks & non-Databricks users Data sets, Notebooks, ML models and applications from top data & solution providers Public marketplace, private exchanges Databricks Marketplace provides an open marketplace for data, analytics, and AI 1 8 Dashboards ML Models Data Files Data Tables Solution Accelerators Databricks Marketplace Notebooks
  • 53. Databricks Clean Rooms Secure environments to run computations on joint data Collaborator 1 Mutually approved jobs on Databricks trusted compute Existing tables Scalable Scale to multiple collaborators and any data size Interoperable Any data source with no replication Flexible Your language and workload of choice Collaborator N Existing tables Delta Sharing Delta Sharing
  • 55. Conclusion Delta Sharing ● Platform-independent, multi-cloud, OSS for sharing massive amounts live and streaming of data. ● built into Databricks Accounts, Marketplace, Clean Rooms ● Clients can be: ○ OSS pandas , Apache Spark ○ Enterprise BI Tableau, Power BI. ● Server ○ Pre-built reference implementation ○ OSS binary ○ OSS Docker container
  • 56. ©2022 Databricks Inc. — All rights reserved Technical Questions? Sign-up for the Databricks Community! Ask your technical questions here: https://community.databricks.com/ 22
  • 57. ©2022 Databricks Inc. — All rights reserved 23 New Databricks Demo Center databricks.com/demos Notebooks for this demo on GitHub This demo on Demo Center
  • 58. ©2022 Databricks Inc. — All rights reserved Technical Questions? Sign-up for the Databricks Community! Ask your technical questions here: https://community.databricks.com/ 37
  • 59. ©2022 Databricks Inc. — All rights reserved 38 Thank You! @frankmunz Try Databricks free