SlideShare a Scribd company logo
©2022 Databricks Inc. — All rights reserved
The Databricks
Lakehouse
Platform
Modern data, analytics
and AI workloads
1
©2022 Databricks Inc. — All rights reserved
Jason Pohl
Principal Solutions Architect
Sean Owen
Principal Solutions Architect
Presenters
©2022 Databricks Inc. — All rights reserved
The future is here
...it’s just not evenly distributed
3
83% 85%
$3.9T 87%
of CEOs say AI is a
strategic priority
of big data
projects fail
in business value
created by AI in 2022
of data science projects
never make it into production
©2022 Databricks Inc. — All rights reserved
The Data and AI Maturity Curve
From descriptive to prescriptive
Analytics Maturity
Competitive
Advantage
Raw
Data
Cleaned
Data
Reports
Ad Hoc
Queries
Data
Exploration
Predictive
Modeling
Prescriptive
Analytics
What happened?
Why did it happen?
What will happen?
How can we make it happen?
©2022 Databricks Inc. — All rights reserved
Modern Data
Teams
5
Data Engineers Data Scientists
Data Analysts
©2022 Databricks Inc. — All rights reserved
The vast majority of companies
struggle to get this right
©2022 Databricks Inc. — All rights reserved
Most enterprises struggle with data
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering Streaming Data Science and ML
Extract Load
Transform
Streaming data sources
Streaming Data Engine
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Structured, semi-
structured
and unstructured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
Structured, semi-
structured
and unstructured data
7
©2022 Databricks Inc. — All rights reserved
Most enterprises struggle with data
Disconnected systems and proprietary data formats make integration difficult
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering Streaming Data Science and ML
Amazon Redshift Teradata
Azure Synapse Google BigQuery
Snowflake IBM Db2
SAP Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow
Amazon EMR Apache Spark
Google Dataproc Cloudera
Jupyter Amazon SageMaker
Azure ML Studio MatLAB
Domino Data Labs SAS
TensorFlow PyTorch
Apache Kafka Apache Spark
Apache Flink Amazon Kinesis
Azure Stream Analytics Google Dataflow
Tibco Spotfire Confluent
Extract Load
Transform
Streaming data sources
Streaming Data Engine
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Structured, semi-
structured
and unstructured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
Structured, semi-
structured
and unstructured data
8
©2022 Databricks Inc. — All rights reserved
Most enterprises struggle with data
Siloed data teams decrease productivity
Disconnected systems and proprietary data formats make integration difficult
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering Streaming Data Science and ML
Data Analysts Data Engineers Data Engineers Data Scientists
Amazon Redshift Teradata
Azure Synapse Google BigQuery
Snowflake IBM Db2
SAP Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow
Amazon EMR Apache Spark
Google Dataproc Cloudera
Jupyter Amazon SageMaker
Azure ML Studio MatLAB
Domino Data Labs SAS
TensorFlow PyTorch
Apache Kafka Apache Spark
Apache Flink Amazon Kinesis
Azure Stream Analytics Google Dataflow
Tibco Spotfire Confluent
Extract Load
Transform
Streaming data sources
Streaming Data Engine
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Structured, semi-
structured
and unstructured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
Structured, semi-
structured
and unstructured data
9
©2022 Databricks Inc. — All rights reserved
Where does all this complexity
come from?
data warehouses vs. data lakes
©2022 Databricks Inc. — All rights reserved 11
Data Warehouse Data Lake
vs.
©2022 Databricks Inc. — All rights reserved
Warehouses and lakes create complexity
Two separate copies of the data
Warehouses
Proprietary
Lakes
Open
Incompatible interfaces
Warehouses
SQL
Lakes
Python
Incompatible security and governance models
Warehouses
Tables
Lakes
Files
©2022 Databricks Inc. — All rights reserved
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lakehouse
One platform to unify all of
your data, analytics, and AI workloads
©2022 Databricks Inc. — All rights reserved
The data lakehouse offers a better path
Lake-first approach that builds upon where
the freshest, most complete data resides
AI/ML from the ground up
Single approach to managing data
Support for all use cases on a single
platform:
• Data engineering
• Data warehousing
• Real time streaming
• Data science and ML
Built on open source and open standards
High reliability and performance
Data processing and management built on open source and
open standards
Common security, governance, and administration
Modern Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
Integrated and collaborative role-based experiences
with open API’s
Cloud Data Lake
Structured, semi-structured, and unstructured data
Multi-cloud, work with your cloud of choice
©2022 Databricks Inc. — All rights reserved
Databricks
The Data Lakehouse Foundation
Modern Data Engineering on the Lakehouse
Analytics & Data Warehousing on the Lakehouse
Governance on the Lakehouse
1
2
3
4
Machine Learning on the Lakehouse
5
©2022 Databricks Inc. — All rights reserved
The Data Lakehouse
Foundation
16
1
©2021 Databricks Inc. — All rights reserved
An open approach to bringing
data management and governance
to data lakes
Better reliability with transactions
48x faster data processing with indexing
Data governance at scale with fine-
grained access control lists
Data
Warehouse
Data
Lake
©2022 Databricks Inc. — All rights reserved
Delta Lake solves challenges with data lakes
RELIABILITY &
QUALITY
PERFORMANCE &
LATENCY
GOVERNANCE
ACID transactions
Advanced indexing & caching
Governance with Data Catalogs
©2022 Databricks Inc. — All rights reserved
Delta Lake adoption
Already >50% of Databricks workload
Broad industry support
Ingest
from
Query
from
Store data in
©2022 Databricks Inc. — All rights reserved
What Delta Lake can do for you
Scale data insights
throughout your
organization with a
simplified solution
Provide best
price/performance
Enable a multi-cloud,
secure infrastructure
The foundation of your lakehouse
©2022 Databricks Inc. — All rights reserved
Demo: Delta Lake
21
©2022 Databricks Inc. — All rights reserved
Modern Data
Engineering on the
Lakehouse
22
2
©2022 Databricks Inc. — All rights reserved
Modern data
engineering
on the lakehouse
Integrated and collaborative role-based
experiences with open APIs
Data processing and management built on open
source and open standards
Common security, governance, and administration
Cloud Data Lake
Structured, semi-structured, and unstructured data
Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
©2022 Databricks Inc. — All rights reserved
Problems with Today’s Architectures
Data reliability suffers:
• Multiple storage systems with different
semantics, SQL dialects, etc
• Extra ETL steps that can go wrong
Timeliness suffers:
• Extra ETL steps before data available in DW
High cost:
• Continuous ETL, duplicated storage
BI
Data
Science
Machine
Learning
Structured, Semi-structured & Unstructured Data
Reports
Data Warehouses
ETL
Data Lake
Cheap to store all the data, but the 2-tier architecture is much more complex!
©2022 Databricks Inc. — All rights reserved
Lakehouse Systems
Implement data warehouse management and performance features on top
of directly-accessible data in open formats
Structured, Semi & Unstructured Data
Data Lake Storage
BI
Data
Science
Machine
Learning
Reports
Management &
Performance Layer
ETL
Can we get state-of-the-art
performance & governance
features with this design?
Cheap storage in
open formats for
all data
Direct access
to data files
SQL
©2022 Databricks Inc. — All rights reserved
Modern data engineering on the lakehouse
Data Engineering on the Databricks Lakehouse Platform
Open format storage
Data transformation
Scheduling &
orchestration
Automatic deployment & operations
BI / Reporting
Dashboarding
Machine
Learning / Data
Science
Data & ML
Sharing
Data Products
Databases
Streaming
Sources
Cloud Object
Stores
SaaS
Applications
NoSQL
On-premises
Systems
Data Sources Data Consumers
Observability, lineage, and end-to-end pipeline visibility
Data quality management
Data
ingestion
©2022 Databricks Inc. — All rights reserved 27
Building the foundation of a Lakehouse
Greatly improve the quality of your data for end users
BRONZE
Filtered, Cleaned,
Augmented
SILVER
Business-level
Aggregates
GOLD
Raw Ingestion
and History
©2022 Databricks Inc. — All rights reserved 28
Building the foundation of a Lakehouse
Greatly improve the quality of your data for end users
BRONZE
Filtered, Cleaned,
Augmented
SILVER
Business-level
Aggregates
GOLD
Raw Ingestion
and History
Data Lake
CSV,
JSON, TXT…
Kinesis
Quality
BI &
Reporting
Streaming
Analytics
Data Science
& ML
©2022 Databricks Inc. — All rights reserved
But the reality is not so simple
Maintaining data quality and reliability at scale is complex and brittle
Data Lake
CSV,
JSON, TXT…
Kinesis
BI &
Reporting
Streaming
Analytics
Data Science &
ML
©2022 Databricks Inc. — All rights reserved
Large scale ETL is complex and brittle
• Hard to build and
maintain table
dependencies
• Difficult to switch
between batch and
stream processing
• Difficult to monitor &
enforce data quality
• Impossible to trace
data lineage
• Poor observability at
granular, data level
• Error handling and
recovery is laborious
Complex pipeline
development
Poor data quality Difficult pipeline
operations
©2022 Databricks Inc. — All rights reserved
Demo: Modern Data
Engineering on the
Lakehouse
31
©2022 Databricks Inc. — All rights reserved
Data Warehousing
on the Lakehouse
32
3
©2022 Databricks Inc. — All rights reserved
Data Warehousing
on the Lakehouse
Integrated and collaborative role-based
experiences with open APIs
Data processing and management built on open
source and open standards
Common security, governance, and administration
Cloud Data Lake
Structured, semi-structured, and unstructured data
Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
©2022 Databricks Inc. — All rights reserved
BI & SQL workloads (DW) on Databricks
• Great performance and concurrency
for BI and SQL workloads on
Delta Lake
• Native SQL interface for analysts
• Support for BI tools to directly query
your most recent data in Delta Lake
• Serverless
©2022 Databricks Inc. — All rights reserved
SQL Analytics
World-Class Performance
and Data Lake Economics
• Fast and predictable
performance for all queries
• Simplified administration
and fine-grained governance
• Analytics on the freshest data
with your tools of choice
Analytics on the Lakehouse Integrated and collaborative role-based
experiences with open APIs
Data processing and management built on open
source and open standards
Common security, governance, and administration
Cloud Data Lake
Structured, semi-structured, and unstructured data
Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
©2022 Databricks Inc. — All rights reserved
A platform for your tools of choice
Get critical business data in with one
click integrations, and benefit from fast
performance, low latency, and high user
concurrency for your existing BI tools.
Ingest & ETL
BI & Analytics
©2022 Databricks Inc. — All rights reserved
A first-class SQL development experience
Query data lake data using
familiar ANSI SQL, and find and
share new insights faster with
the built-in SQL query editor,
alerts, visualizations, and
interactive dashboards.
©2022 Databricks Inc. — All rights reserved
Simple administration and governance
Quickly setup instant, elastic SQL
compute decoupled from
storage. Databricks automatically
determines instance types and
configuration for the best
price/performance.
Then, easily manage users, data,
and resources with endpoint
monitoring, query history, and fine-
grained governance.
©2022 Databricks Inc. — All rights reserved
Demo: Data Warehousing
on the Lakehouse
39
©2022 Databricks Inc. — All rights reserved
Data Governance
on the Lakehouse
40
4
©2022 Databricks Inc. — All rights reserved
Governance requirements for
data are quickly evolving
©2022 Databricks Inc. — All rights reserved
Governance is hard to enforce on data lakes
42
Cloud 2
Cloud 3
Structured
Semi-structured
Unstructured
Streaming
Cloud 1
©2022 Databricks Inc. — All rights reserved
The problem is getting bigger
Enterprises need a way to share and govern a wide variety of data products
Files Dashboards Models Tables
©2022 Databricks Inc. — All rights reserved
Data governance
on the lakehouse
Integrated and collaborative role-based
experiences with open APIs
Data processing and management built on open
source and open standards
Common security, governance, and administration
Cloud Data Lake
Structured, semi-structured, and unstructured data
Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
©2022 Databricks Inc. — All rights reserved
Unity Catalog for Lakehouse Governance
• Centrally catalog, Search, and discover
data and AI assets
• Simplify governance with a unified Cross-
cloud governance model
• Easily integrate with your existing
Enterprise Data Catalogs
• Securely share live data across platforms
with delta sharing
©2022 Databricks Inc. — All rights reserved
Data Sharing is Critical in the
Digital Economy
©2022 Databricks Inc. — All rights reserved
Data Sharing is Critical in the Digital Economy
To fully realize the value locked in data, enterprises must be able to securely
exchange data with trusted customers, partners, and suppliers
Current options are not fit for purpose
Homegrown Solutions
Built on APIs, direct connectivity, or
SFTP
● Not scalable
● Complex
● Brittle
Commercial Solutions
Vendor provided technology
● Expensive
● Locked in
● Inflexible
©2022 Databricks Inc. — All rights reserved
The open approach to sharing
Fully open, without
proprietary lock-in
using any
computing
platform
Simple to share
data with other
organizations
Easily managed
privacy, security,
and compliance
©2022 Databricks Inc. — All rights reserved
Permissible Access via Open Source Protocol
Delta Lake Table Delta Sharing Server
Delta Sharing Protocol
Data Provider Data Recipient
Any Sharing
Client
Access
permissions
©2022 Databricks Inc. — All rights reserved
Machine Learning on
the Lakehouse
50
Sean Owen
Principal Solution Architect @ Databricks
5
©2022 Databricks Inc. — All rights reserved
Machine Learning on
the Lakehouse
Integrated and collaborative role-based
experiences with open APIs
Data processing and management built on open
source and open standards
Common security, governance, and administration
Cloud Data Lake
Structured, semi-structured, and unstructured data
Data
Engineering
BI and SQL
Analytics
Data Science
and ML
©2021 Databricks Inc. — All rights reserved
Open Multi-Cloud Data Lakehouse and Feature Store
Collaborative Multi-Language Notebooks
← Full ML Lifecycle →
Model Tracking
and Registry
Model Training
and Tuning
Model Serving
and Monitoring
Automation and
Governance
Data Science and Machine Learning
A data-native and collaborative solution for the full ML lifecycle
©2022 Databricks Inc. — All rights reserved 54
Three Data Users
• SQL and BI tools
• Prepare and run reports
• Summarize data
• Visualize data
• (Sometimes) Big Data
• Data Warehouse data store
• R, SAS, some Python
• Statistical analysis
• Explain data
• Visualize data
• Often small data sets
• Database, data warehouse
data store; local files
Business Intelligence Data Science
• Python
• Deep learning and
specialized GPU hardware
• Create predictive models
• Deploy models to prod
• Often big data sets
• Unstructured data in files
Machine Learning
©2022 Databricks Inc. — All rights reserved
Data Warehouse vs Data Lake
Which appeals to each user?
55
Business Intelligence Data Science Machine Learning
Data Warehouse
Data Lake
©2022 Databricks Inc. — All rights reserved 56
What ML Needs from a
Lakehouse
©2022 Databricks Inc. — All rights reserved
How Is ML Different?
57
• Operates on unstructured data like
text and images
• Can require learning from massive
data sets, not just analysis of a
sample
• Uses open source tooling to
manipulate data as “DataFrames”
rather than with SQL
• Outputs are models rather than data
or reports
• Sometimes needs special hardware
©2022 Databricks Inc. — All rights reserved 58
What Does ML Need from a Lakehouse?
Your subtitle here
Access to Unstructured Data
• Images, text, audio, custom formats
• Libraries understand files, not tables
• Must scale to petabytes
Open Source Libraries
• OSS dominates ML tooling (Tensorflow, scikit-
learn, xgboost, R, etc)
• Must be able to apply these in Python, R
Specialized Hardware, Distributed Compute
• Scalability of algorithms
• GPUs, for deep learning
• Cloud elasticity to manage that cost!
Model Lifecycle Management
• Outputs are model artifacts
• Artifact lineage
• Productionization of model
©2022 Databricks Inc. — All rights reserved 59
Architecture Sketches
Your subtitle here
Data Warehouse Data Lakehouse
©2022 Databricks Inc. — All rights reserved 61
MLOps
©2022 Databricks Inc. — All rights reserved
MLOps and the Lakehouse
62
• Applying open tools in-place to data
in the lakehouse is a win for training
• Applying them for operating models is
important too!
• "Models are data too"
• Need to apply models to data
• MLFlow for MLOps on the lakehouse
• Track and manage model data,
lineage, inputs
• Deploy models as lakehouse "services"
©2022 Databricks Inc. — All rights reserved 63
Example:
Chest X-Ray
Classification
©2022 Databricks Inc. — All rights reserved
Classifying Chest X-rays
64
• 45,000 X-ray images
• About 50GB
• Includes correct doctor diagnosis
• From National Institute of Health
• Relatively easy deep learning problem
• If you have access to the data
• If you have deep learning open
source software
• If you have GPU hardware
• (Don't diagnose at home!) https://databricks.com/p/webinar/operationalizing-machine-learning-at-scale
©2022 Databricks Inc. — All rights reserved 65
Architecture
©2022 Databricks Inc. — All rights reserved 66
Feature Stores
©2022 Databricks Inc. — All rights reserved
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
Serving
©2022 Databricks Inc. — All rights reserved
csv
csv
A day (or 6 months) in the life of an ML model
Client
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving
©2022 Databricks Inc. — All rights reserved
csv
csv
Client
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving
A day (or 6 months) in the life of an ML model
need to be equivalent
©2022 Databricks Inc. — All rights reserved
Feature Stores for Model Inputs
70
• Tables are OK for managing model input
• Input often structured
• Well understood, easy to access
• … but not quite enough
• Upstream lineage: how were
features computed?
• Downstream lineage: where is the
feature used?
• Model caller has to read, feed inputs
• How to do (also) access in real
time?
©2022 Databricks Inc. — All rights reserved 71
Example:
Telco Churn
Classification
©2022 Databricks Inc. — All rights reserved 72
Bonus:
Auto ML
©2022 Databricks Inc. — All rights reserved 73
Learn More
©2022 Databricks Inc. — All rights reserved
The data lakehouse offers a better path
Lake-first approach that builds upon where
the freshest, most complete data resides
AI/ML from the ground up
Single approach to managing data
Support for all use cases on a single
platform:
• Data engineering
• Data warehousing
• Real time streaming
• Data science and ML
Built on open source and open standards
High reliability and performance
Multi-cloud, work with your cloud of choice
Integrated and collaborative role-based
experiences with open APIs
Data processing and management built on open
source and open standards
Common security, governance, and administration
Cloud Data Lake
Structured, semi-structured, and unstructured data
Data
Engineering
BI and SQL
Analytics
Data Science
and ML
©2022 Databricks Inc. — All rights reserved
©2022 Databricks Inc. — All rights reserved
Thank you
76

More Related Content

What's hot

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
Kent Graziano
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 

What's hot (20)

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 

Similar to Data Lakehouse Symposium | Day 4

Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Denodo
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
HostedbyConfluent
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
HostedbyConfluent
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
Kent Graziano
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI Standards
Arcadia Data
 
Slides-Discover-Power-of-Live-Data(2).pdf
Slides-Discover-Power-of-Live-Data(2).pdfSlides-Discover-Power-of-Live-Data(2).pdf
Slides-Discover-Power-of-Live-Data(2).pdf
butthead7
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelines
Paul Van Siclen
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Matt Stubbs
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Denodo
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Exploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & FutureExploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & Future
Agilisium Consulting
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data Integration
Eric Kavanagh
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 

Similar to Data Lakehouse Symposium | Day 4 (20)

Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
 
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI Standards
 
Slides-Discover-Power-of-Live-Data(2).pdf
Slides-Discover-Power-of-Live-Data(2).pdfSlides-Discover-Power-of-Live-Data(2).pdf
Slides-Discover-Power-of-Live-Data(2).pdf
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelines
 
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKESBig Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Exploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & FutureExploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & Future
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data Integration
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 

More from Databricks

Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 

Recently uploaded

Get admission in various courses and boost your employment opportunities.
Get admission in various courses and boost your employment opportunities.Get admission in various courses and boost your employment opportunities.
Get admission in various courses and boost your employment opportunities.
complete knowledge
 
How to Kickstart Content Marketing With A Small Team - Dennis Shiao
How to Kickstart Content Marketing With A Small Team - Dennis ShiaoHow to Kickstart Content Marketing With A Small Team - Dennis Shiao
How to Kickstart Content Marketing With A Small Team - Dennis Shiao
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
What Software is Used in Marketing in 2024.
What Software is Used in Marketing in 2024.What Software is Used in Marketing in 2024.
What Software is Used in Marketing in 2024.
Ishaaq6
 
Efficient Website Management for Digital Marketing Pros
Efficient Website Management for Digital Marketing ProsEfficient Website Management for Digital Marketing Pros
Efficient Website Management for Digital Marketing Pros
Lauren Polinsky
 
Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...
Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...
Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Mastering Your Online Visibility - Fernando Angulo
Mastering Your Online Visibility - Fernando AnguloMastering Your Online Visibility - Fernando Angulo
Advanced Storytelling Concepts for Marketers
Advanced Storytelling Concepts for MarketersAdvanced Storytelling Concepts for Marketers
Advanced Storytelling Concepts for Marketers
Ed Shimp
 
Mastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis YuMastering SEO for Google in the AI Era - Dennis Yu
Boost Your Instagram Views Instantly Proven Free Strategies.pptx
Boost Your Instagram Views Instantly Proven Free Strategies.pptxBoost Your Instagram Views Instantly Proven Free Strategies.pptx
Boost Your Instagram Views Instantly Proven Free Strategies.pptx
InstBlast Marketing
 
Mastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis YuMastering SEO for Google in the AI Era - Dennis Yu
Mindfulness Techniques Cultivating Calm in a Chaotic World.pptx
Mindfulness Techniques Cultivating Calm in a Chaotic World.pptxMindfulness Techniques Cultivating Calm in a Chaotic World.pptx
Mindfulness Techniques Cultivating Calm in a Chaotic World.pptx
elizabethella096
 
Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...
Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...
Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...
Amsive
 
From Subreddits To Search: Maximizing Your Brand's Impact On Reddit
From Subreddits To Search: Maximizing Your Brand's Impact On RedditFrom Subreddits To Search: Maximizing Your Brand's Impact On Reddit
From Subreddits To Search: Maximizing Your Brand's Impact On Reddit
Search Engine Journal
 
Breaking Silos To Break Bank: Shattering The Divide Between Search And Social
Breaking Silos To Break Bank: Shattering The Divide Between Search And SocialBreaking Silos To Break Bank: Shattering The Divide Between Search And Social
Breaking Silos To Break Bank: Shattering The Divide Between Search And Social
Navah Hopkins
 
Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...
Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...
Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Story Telling Master Class - Jennifer Morilla
Story Telling Master Class - Jennifer MorillaStory Telling Master Class - Jennifer Morilla
Top Strategies for Building High-Quality Backlinks in 2024 PPT.pdf
Top Strategies for Building High-Quality Backlinks in 2024 PPT.pdfTop Strategies for Building High-Quality Backlinks in 2024 PPT.pdf
Top Strategies for Building High-Quality Backlinks in 2024 PPT.pdf
1Solutions Pvt. Ltd.
 
在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样
在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样
在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样
5ys5mvlp
 

Recently uploaded (20)

Get admission in various courses and boost your employment opportunities.
Get admission in various courses and boost your employment opportunities.Get admission in various courses and boost your employment opportunities.
Get admission in various courses and boost your employment opportunities.
 
How to Kickstart Content Marketing With A Small Team - Dennis Shiao
How to Kickstart Content Marketing With A Small Team - Dennis ShiaoHow to Kickstart Content Marketing With A Small Team - Dennis Shiao
How to Kickstart Content Marketing With A Small Team - Dennis Shiao
 
Amazing and On Point - Ramon Ray, USA TODAY
Amazing and On Point - Ramon Ray, USA TODAYAmazing and On Point - Ramon Ray, USA TODAY
Amazing and On Point - Ramon Ray, USA TODAY
 
What Software is Used in Marketing in 2024.
What Software is Used in Marketing in 2024.What Software is Used in Marketing in 2024.
What Software is Used in Marketing in 2024.
 
Efficient Website Management for Digital Marketing Pros
Efficient Website Management for Digital Marketing ProsEfficient Website Management for Digital Marketing Pros
Efficient Website Management for Digital Marketing Pros
 
Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...
Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...
Data-Driven Personalization - Build a Competitive Advantage by Knowing Your C...
 
Mastering Your Online Visibility - Fernando Angulo
Mastering Your Online Visibility - Fernando AnguloMastering Your Online Visibility - Fernando Angulo
Mastering Your Online Visibility - Fernando Angulo
 
Advanced Storytelling Concepts for Marketers
Advanced Storytelling Concepts for MarketersAdvanced Storytelling Concepts for Marketers
Advanced Storytelling Concepts for Marketers
 
Mastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis YuMastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis Yu
 
Boost Your Instagram Views Instantly Proven Free Strategies.pptx
Boost Your Instagram Views Instantly Proven Free Strategies.pptxBoost Your Instagram Views Instantly Proven Free Strategies.pptx
Boost Your Instagram Views Instantly Proven Free Strategies.pptx
 
Mastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis YuMastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis Yu
 
Mindfulness Techniques Cultivating Calm in a Chaotic World.pptx
Mindfulness Techniques Cultivating Calm in a Chaotic World.pptxMindfulness Techniques Cultivating Calm in a Chaotic World.pptx
Mindfulness Techniques Cultivating Calm in a Chaotic World.pptx
 
Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...
Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...
Lily Ray - Optimize the Forest, Not the Trees: Move Beyond SEO Checklist - Mo...
 
From Subreddits To Search: Maximizing Your Brand's Impact On Reddit
From Subreddits To Search: Maximizing Your Brand's Impact On RedditFrom Subreddits To Search: Maximizing Your Brand's Impact On Reddit
From Subreddits To Search: Maximizing Your Brand's Impact On Reddit
 
Breaking Silos To Break Bank: Shattering The Divide Between Search And Social
Breaking Silos To Break Bank: Shattering The Divide Between Search And SocialBreaking Silos To Break Bank: Shattering The Divide Between Search And Social
Breaking Silos To Break Bank: Shattering The Divide Between Search And Social
 
Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...
Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...
Crafting Seamless B2B Customer Journeys - Strategies for Exceptional Experien...
 
Unleash the Power of Storytelling - Win Hearts, Change Minds, Get Results - R...
Unleash the Power of Storytelling - Win Hearts, Change Minds, Get Results - R...Unleash the Power of Storytelling - Win Hearts, Change Minds, Get Results - R...
Unleash the Power of Storytelling - Win Hearts, Change Minds, Get Results - R...
 
Story Telling Master Class - Jennifer Morilla
Story Telling Master Class - Jennifer MorillaStory Telling Master Class - Jennifer Morilla
Story Telling Master Class - Jennifer Morilla
 
Top Strategies for Building High-Quality Backlinks in 2024 PPT.pdf
Top Strategies for Building High-Quality Backlinks in 2024 PPT.pdfTop Strategies for Building High-Quality Backlinks in 2024 PPT.pdf
Top Strategies for Building High-Quality Backlinks in 2024 PPT.pdf
 
在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样
在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样
在线办理(英国UWS毕业证书)西苏格兰大学毕业证学位证一模一样
 

Data Lakehouse Symposium | Day 4

  • 1. ©2022 Databricks Inc. — All rights reserved The Databricks Lakehouse Platform Modern data, analytics and AI workloads 1
  • 2. ©2022 Databricks Inc. — All rights reserved Jason Pohl Principal Solutions Architect Sean Owen Principal Solutions Architect Presenters
  • 3. ©2022 Databricks Inc. — All rights reserved The future is here ...it’s just not evenly distributed 3 83% 85% $3.9T 87% of CEOs say AI is a strategic priority of big data projects fail in business value created by AI in 2022 of data science projects never make it into production
  • 4. ©2022 Databricks Inc. — All rights reserved The Data and AI Maturity Curve From descriptive to prescriptive Analytics Maturity Competitive Advantage Raw Data Cleaned Data Reports Ad Hoc Queries Data Exploration Predictive Modeling Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make it happen?
  • 5. ©2022 Databricks Inc. — All rights reserved Modern Data Teams 5 Data Engineers Data Scientists Data Analysts
  • 6. ©2022 Databricks Inc. — All rights reserved The vast majority of companies struggle to get this right
  • 7. ©2022 Databricks Inc. — All rights reserved Most enterprises struggle with data Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science and ML Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi- structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Structured, semi- structured and unstructured data 7
  • 8. ©2022 Databricks Inc. — All rights reserved Most enterprises struggle with data Disconnected systems and proprietary data formats make integration difficult Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science and ML Amazon Redshift Teradata Azure Synapse Google BigQuery Snowflake IBM Db2 SAP Oracle Autonomous Data Warehouse Hadoop Apache Airflow Amazon EMR Apache Spark Google Dataproc Cloudera Jupyter Amazon SageMaker Azure ML Studio MatLAB Domino Data Labs SAS TensorFlow PyTorch Apache Kafka Apache Spark Apache Flink Amazon Kinesis Azure Stream Analytics Google Dataflow Tibco Spotfire Confluent Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi- structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Structured, semi- structured and unstructured data 8
  • 9. ©2022 Databricks Inc. — All rights reserved Most enterprises struggle with data Siloed data teams decrease productivity Disconnected systems and proprietary data formats make integration difficult Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science and ML Data Analysts Data Engineers Data Engineers Data Scientists Amazon Redshift Teradata Azure Synapse Google BigQuery Snowflake IBM Db2 SAP Oracle Autonomous Data Warehouse Hadoop Apache Airflow Amazon EMR Apache Spark Google Dataproc Cloudera Jupyter Amazon SageMaker Azure ML Studio MatLAB Domino Data Labs SAS TensorFlow PyTorch Apache Kafka Apache Spark Apache Flink Amazon Kinesis Azure Stream Analytics Google Dataflow Tibco Spotfire Confluent Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi- structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Structured, semi- structured and unstructured data 9
  • 10. ©2022 Databricks Inc. — All rights reserved Where does all this complexity come from? data warehouses vs. data lakes
  • 11. ©2022 Databricks Inc. — All rights reserved 11 Data Warehouse Data Lake vs.
  • 12. ©2022 Databricks Inc. — All rights reserved Warehouses and lakes create complexity Two separate copies of the data Warehouses Proprietary Lakes Open Incompatible interfaces Warehouses SQL Lakes Python Incompatible security and governance models Warehouses Tables Lakes Files
  • 13. ©2022 Databricks Inc. — All rights reserved Data Warehouse Data Lake Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lakehouse One platform to unify all of your data, analytics, and AI workloads
  • 14. ©2022 Databricks Inc. — All rights reserved The data lakehouse offers a better path Lake-first approach that builds upon where the freshest, most complete data resides AI/ML from the ground up Single approach to managing data Support for all use cases on a single platform: • Data engineering • Data warehousing • Real time streaming • Data science and ML Built on open source and open standards High reliability and performance Data processing and management built on open source and open standards Common security, governance, and administration Modern Data Engineering Analytics and Data Warehousing Data Science and ML Integrated and collaborative role-based experiences with open API’s Cloud Data Lake Structured, semi-structured, and unstructured data Multi-cloud, work with your cloud of choice
  • 15. ©2022 Databricks Inc. — All rights reserved Databricks The Data Lakehouse Foundation Modern Data Engineering on the Lakehouse Analytics & Data Warehousing on the Lakehouse Governance on the Lakehouse 1 2 3 4 Machine Learning on the Lakehouse 5
  • 16. ©2022 Databricks Inc. — All rights reserved The Data Lakehouse Foundation 16 1
  • 17. ©2021 Databricks Inc. — All rights reserved An open approach to bringing data management and governance to data lakes Better reliability with transactions 48x faster data processing with indexing Data governance at scale with fine- grained access control lists Data Warehouse Data Lake
  • 18. ©2022 Databricks Inc. — All rights reserved Delta Lake solves challenges with data lakes RELIABILITY & QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing & caching Governance with Data Catalogs
  • 19. ©2022 Databricks Inc. — All rights reserved Delta Lake adoption Already >50% of Databricks workload Broad industry support Ingest from Query from Store data in
  • 20. ©2022 Databricks Inc. — All rights reserved What Delta Lake can do for you Scale data insights throughout your organization with a simplified solution Provide best price/performance Enable a multi-cloud, secure infrastructure The foundation of your lakehouse
  • 21. ©2022 Databricks Inc. — All rights reserved Demo: Delta Lake 21
  • 22. ©2022 Databricks Inc. — All rights reserved Modern Data Engineering on the Lakehouse 22 2
  • 23. ©2022 Databricks Inc. — All rights reserved Modern data engineering on the lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  • 24. ©2022 Databricks Inc. — All rights reserved Problems with Today’s Architectures Data reliability suffers: • Multiple storage systems with different semantics, SQL dialects, etc • Extra ETL steps that can go wrong Timeliness suffers: • Extra ETL steps before data available in DW High cost: • Continuous ETL, duplicated storage BI Data Science Machine Learning Structured, Semi-structured & Unstructured Data Reports Data Warehouses ETL Data Lake Cheap to store all the data, but the 2-tier architecture is much more complex!
  • 25. ©2022 Databricks Inc. — All rights reserved Lakehouse Systems Implement data warehouse management and performance features on top of directly-accessible data in open formats Structured, Semi & Unstructured Data Data Lake Storage BI Data Science Machine Learning Reports Management & Performance Layer ETL Can we get state-of-the-art performance & governance features with this design? Cheap storage in open formats for all data Direct access to data files SQL
  • 26. ©2022 Databricks Inc. — All rights reserved Modern data engineering on the lakehouse Data Engineering on the Databricks Lakehouse Platform Open format storage Data transformation Scheduling & orchestration Automatic deployment & operations BI / Reporting Dashboarding Machine Learning / Data Science Data & ML Sharing Data Products Databases Streaming Sources Cloud Object Stores SaaS Applications NoSQL On-premises Systems Data Sources Data Consumers Observability, lineage, and end-to-end pipeline visibility Data quality management Data ingestion
  • 27. ©2022 Databricks Inc. — All rights reserved 27 Building the foundation of a Lakehouse Greatly improve the quality of your data for end users BRONZE Filtered, Cleaned, Augmented SILVER Business-level Aggregates GOLD Raw Ingestion and History
  • 28. ©2022 Databricks Inc. — All rights reserved 28 Building the foundation of a Lakehouse Greatly improve the quality of your data for end users BRONZE Filtered, Cleaned, Augmented SILVER Business-level Aggregates GOLD Raw Ingestion and History Data Lake CSV, JSON, TXT… Kinesis Quality BI & Reporting Streaming Analytics Data Science & ML
  • 29. ©2022 Databricks Inc. — All rights reserved But the reality is not so simple Maintaining data quality and reliability at scale is complex and brittle Data Lake CSV, JSON, TXT… Kinesis BI & Reporting Streaming Analytics Data Science & ML
  • 30. ©2022 Databricks Inc. — All rights reserved Large scale ETL is complex and brittle • Hard to build and maintain table dependencies • Difficult to switch between batch and stream processing • Difficult to monitor & enforce data quality • Impossible to trace data lineage • Poor observability at granular, data level • Error handling and recovery is laborious Complex pipeline development Poor data quality Difficult pipeline operations
  • 31. ©2022 Databricks Inc. — All rights reserved Demo: Modern Data Engineering on the Lakehouse 31
  • 32. ©2022 Databricks Inc. — All rights reserved Data Warehousing on the Lakehouse 32 3
  • 33. ©2022 Databricks Inc. — All rights reserved Data Warehousing on the Lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  • 34. ©2022 Databricks Inc. — All rights reserved BI & SQL workloads (DW) on Databricks • Great performance and concurrency for BI and SQL workloads on Delta Lake • Native SQL interface for analysts • Support for BI tools to directly query your most recent data in Delta Lake • Serverless
  • 35. ©2022 Databricks Inc. — All rights reserved SQL Analytics World-Class Performance and Data Lake Economics • Fast and predictable performance for all queries • Simplified administration and fine-grained governance • Analytics on the freshest data with your tools of choice Analytics on the Lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  • 36. ©2022 Databricks Inc. — All rights reserved A platform for your tools of choice Get critical business data in with one click integrations, and benefit from fast performance, low latency, and high user concurrency for your existing BI tools. Ingest & ETL BI & Analytics
  • 37. ©2022 Databricks Inc. — All rights reserved A first-class SQL development experience Query data lake data using familiar ANSI SQL, and find and share new insights faster with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
  • 38. ©2022 Databricks Inc. — All rights reserved Simple administration and governance Quickly setup instant, elastic SQL compute decoupled from storage. Databricks automatically determines instance types and configuration for the best price/performance. Then, easily manage users, data, and resources with endpoint monitoring, query history, and fine- grained governance.
  • 39. ©2022 Databricks Inc. — All rights reserved Demo: Data Warehousing on the Lakehouse 39
  • 40. ©2022 Databricks Inc. — All rights reserved Data Governance on the Lakehouse 40 4
  • 41. ©2022 Databricks Inc. — All rights reserved Governance requirements for data are quickly evolving
  • 42. ©2022 Databricks Inc. — All rights reserved Governance is hard to enforce on data lakes 42 Cloud 2 Cloud 3 Structured Semi-structured Unstructured Streaming Cloud 1
  • 43. ©2022 Databricks Inc. — All rights reserved The problem is getting bigger Enterprises need a way to share and govern a wide variety of data products Files Dashboards Models Tables
  • 44. ©2022 Databricks Inc. — All rights reserved Data governance on the lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  • 45. ©2022 Databricks Inc. — All rights reserved Unity Catalog for Lakehouse Governance • Centrally catalog, Search, and discover data and AI assets • Simplify governance with a unified Cross- cloud governance model • Easily integrate with your existing Enterprise Data Catalogs • Securely share live data across platforms with delta sharing
  • 46. ©2022 Databricks Inc. — All rights reserved Data Sharing is Critical in the Digital Economy
  • 47. ©2022 Databricks Inc. — All rights reserved Data Sharing is Critical in the Digital Economy To fully realize the value locked in data, enterprises must be able to securely exchange data with trusted customers, partners, and suppliers Current options are not fit for purpose Homegrown Solutions Built on APIs, direct connectivity, or SFTP ● Not scalable ● Complex ● Brittle Commercial Solutions Vendor provided technology ● Expensive ● Locked in ● Inflexible
  • 48. ©2022 Databricks Inc. — All rights reserved The open approach to sharing Fully open, without proprietary lock-in using any computing platform Simple to share data with other organizations Easily managed privacy, security, and compliance
  • 49. ©2022 Databricks Inc. — All rights reserved Permissible Access via Open Source Protocol Delta Lake Table Delta Sharing Server Delta Sharing Protocol Data Provider Data Recipient Any Sharing Client Access permissions
  • 50. ©2022 Databricks Inc. — All rights reserved Machine Learning on the Lakehouse 50 Sean Owen Principal Solution Architect @ Databricks 5
  • 51. ©2022 Databricks Inc. — All rights reserved Machine Learning on the Lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering BI and SQL Analytics Data Science and ML
  • 52. ©2021 Databricks Inc. — All rights reserved Open Multi-Cloud Data Lakehouse and Feature Store Collaborative Multi-Language Notebooks ← Full ML Lifecycle → Model Tracking and Registry Model Training and Tuning Model Serving and Monitoring Automation and Governance Data Science and Machine Learning A data-native and collaborative solution for the full ML lifecycle
  • 53. ©2022 Databricks Inc. — All rights reserved 54 Three Data Users • SQL and BI tools • Prepare and run reports • Summarize data • Visualize data • (Sometimes) Big Data • Data Warehouse data store • R, SAS, some Python • Statistical analysis • Explain data • Visualize data • Often small data sets • Database, data warehouse data store; local files Business Intelligence Data Science • Python • Deep learning and specialized GPU hardware • Create predictive models • Deploy models to prod • Often big data sets • Unstructured data in files Machine Learning
  • 54. ©2022 Databricks Inc. — All rights reserved Data Warehouse vs Data Lake Which appeals to each user? 55 Business Intelligence Data Science Machine Learning Data Warehouse Data Lake
  • 55. ©2022 Databricks Inc. — All rights reserved 56 What ML Needs from a Lakehouse
  • 56. ©2022 Databricks Inc. — All rights reserved How Is ML Different? 57 • Operates on unstructured data like text and images • Can require learning from massive data sets, not just analysis of a sample • Uses open source tooling to manipulate data as “DataFrames” rather than with SQL • Outputs are models rather than data or reports • Sometimes needs special hardware
  • 57. ©2022 Databricks Inc. — All rights reserved 58 What Does ML Need from a Lakehouse? Your subtitle here Access to Unstructured Data • Images, text, audio, custom formats • Libraries understand files, not tables • Must scale to petabytes Open Source Libraries • OSS dominates ML tooling (Tensorflow, scikit- learn, xgboost, R, etc) • Must be able to apply these in Python, R Specialized Hardware, Distributed Compute • Scalability of algorithms • GPUs, for deep learning • Cloud elasticity to manage that cost! Model Lifecycle Management • Outputs are model artifacts • Artifact lineage • Productionization of model
  • 58. ©2022 Databricks Inc. — All rights reserved 59 Architecture Sketches Your subtitle here Data Warehouse Data Lakehouse
  • 59. ©2022 Databricks Inc. — All rights reserved 61 MLOps
  • 60. ©2022 Databricks Inc. — All rights reserved MLOps and the Lakehouse 62 • Applying open tools in-place to data in the lakehouse is a win for training • Applying them for operating models is important too! • "Models are data too" • Need to apply models to data • MLFlow for MLOps on the lakehouse • Track and manage model data, lineage, inputs • Deploy models as lakehouse "services"
  • 61. ©2022 Databricks Inc. — All rights reserved 63 Example: Chest X-Ray Classification
  • 62. ©2022 Databricks Inc. — All rights reserved Classifying Chest X-rays 64 • 45,000 X-ray images • About 50GB • Includes correct doctor diagnosis • From National Institute of Health • Relatively easy deep learning problem • If you have access to the data • If you have deep learning open source software • If you have GPU hardware • (Don't diagnose at home!) https://databricks.com/p/webinar/operationalizing-machine-learning-at-scale
  • 63. ©2022 Databricks Inc. — All rights reserved 65 Architecture
  • 64. ©2022 Databricks Inc. — All rights reserved 66 Feature Stores
  • 65. ©2022 Databricks Inc. — All rights reserved A day (or 6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving
  • 66. ©2022 Databricks Inc. — All rights reserved csv csv A day (or 6 months) in the life of an ML model Client Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving
  • 67. ©2022 Databricks Inc. — All rights reserved csv csv Client Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving A day (or 6 months) in the life of an ML model need to be equivalent
  • 68. ©2022 Databricks Inc. — All rights reserved Feature Stores for Model Inputs 70 • Tables are OK for managing model input • Input often structured • Well understood, easy to access • … but not quite enough • Upstream lineage: how were features computed? • Downstream lineage: where is the feature used? • Model caller has to read, feed inputs • How to do (also) access in real time?
  • 69. ©2022 Databricks Inc. — All rights reserved 71 Example: Telco Churn Classification
  • 70. ©2022 Databricks Inc. — All rights reserved 72 Bonus: Auto ML
  • 71. ©2022 Databricks Inc. — All rights reserved 73 Learn More
  • 72. ©2022 Databricks Inc. — All rights reserved The data lakehouse offers a better path Lake-first approach that builds upon where the freshest, most complete data resides AI/ML from the ground up Single approach to managing data Support for all use cases on a single platform: • Data engineering • Data warehousing • Real time streaming • Data science and ML Built on open source and open standards High reliability and performance Multi-cloud, work with your cloud of choice Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering BI and SQL Analytics Data Science and ML
  • 73. ©2022 Databricks Inc. — All rights reserved
  • 74. ©2022 Databricks Inc. — All rights reserved Thank you 76