SlideShare a Scribd company logo
The Databricks
Platform Introduction
All your data, analytics and
AI on one platform
Alex Ivanichev
March 2022
DataBricks is a unified & open Data
and Analytics Platform
What is DataBricks ?
Modern Data
Teams
5
Data Engineers Data Scientists
Data Analysts
How the data management looks
like today ?
Data management complexity
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering
Streaming
decrease productivity
Data Science and ML
Data Analysts Data Engineers Data Engineers
Disconnected systems and proprietary data formats make integration difficult
Data Scientists
Amazon Redshift
Azure Synapse
Snowflake
SAP
Teradata
Google BigQuery
IBM Db2
Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
Tibco Spotfire Confluent TensorFlow PyTorch
Extract Load
Transform Real-time Database
Analytics and BI
Data marts Data prep
Machine
Learning
Data
Science
Streaming Data Engine
Data Lake Data Lake
Data warehouse
Structured, semi-
structured
and unstructured data
Structured, semi-
structured
and unstructured data
Structured data
Streaming data sources
5
Data Warehouse Data Lake
vs.
Warehouses and lakes create complexity
Two separate copies of the data
Warehouses
Proprietary
Lakes
Open
Incompatible interfaces
Warehouses
SQL
Lakes
Python
Incompatible security and governance models
Warehouses
Tables
Lakes
Files
Data Warehouse Data Lake
Streaming
Analytics
B
I
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lakehouse
One platform to unify all of
your data, analytics, and AI workloads
Why choose Databricks ?
The data lakehouse offers a better path
Data processing and management built on open source and open
standards
Common security, governance, and administration
Modern Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
Integrated and collaborative role-based experiences with
open API’s
Cloud Data Lake
Structured, semi-structured, and unstructured data
Lake-first approach that builds upon where
the freshest, most complete data resides
AI/ML from the ground up
High reliability and performance Single
approach to managing data
Support for all use cases on a single
platform:
• Data engineering
• Data warehousing
• Real time streaming
• Data science and ML
Built on open source and open standards
Multi-cloud, work with your cloud of choice
The Data Lakehouse
Foundation
©2021 Databricks Inc. — All rights r eserved
An open approach to bringing data
management and governance to data
lakes
Better reliability with transactions
48x faster data processing with indexing
Data governance at scale with fine-
grained access control lists
Data
Warehouse
Data
Lake
What is Delta Lake?
● A open source project that enables building a Lakehouse architecture on top of data lakes.
● An storage layer that brings scalable, ACID transactions to Apache Spark and other big-data
engines.
● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and
batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.
● ACID Transactions
● Scalable Metadata Handling
● Time Travel (data versioning)
● Open Format
● Delta Lake change data feed
● Unified Batch and Streaming Source and Sink
● Schema Enforcement
● Schema Evolution
● Audit History
● Updates and Delete
● 100% Compatible with Apache Spark API
● Data Clean-up
Key Features:
https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
Delta Lake solves challenges with data lakes
RELIABILITY &
QUALITY
PERFORMANCE &
LATENCY
GOVERNANCE
ACID transactions
Advanced indexing & caching
Governance with Data Catalogs
Delta Lake key feature - ACID transaction
● Add File: It adds the data file
● Remove File: It removes the data file
● Update Metadata: It updates the table metadata.
● Set Transaction: It records that a structure streaming job created a micro-batch with ID
● Change Protocol: Makes more secure by transferring Delta Lakes to the latest securing
protocol.
● Commit Info: It contains the information about the Commits.
State Recomputing With Checkpoint Files
● Delta Lake automatically generates checkpoint files every 10 commits
● Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory.
Building the foundation of a Lakehouse
Filtered, Cleaned,
Augmented
Business-level
Aggregates
Greatly improve the quality of your data for end users
BRONZE SILVER
GOLD
Raw Ingestion
and History
Kinesis
CSV,
JSON, TXT…
Data Lake
Quality
BI &
Reporting
Streaming
Analytics
Data Science &
ML
But the reality is not so simple
Maintaining data quality and reliability at scale is complex and brittle
CSV,
JSON, TXT…
Data
Lake
Kinesis
BI &
Reporting
Streaming
Analytics
Data Science &
ML
Modern data engineering on the lakehouse
Data Engineering on the Databricks Lakehouse Platform
Open format storage
Data transformation
Scheduling &
orchestration
Automatic deployment & operations
BI / Reporting
Dashboarding
Machine
Learning / Data
Science
Data & ML
Sharing
Data Products
Databases
Streaming
Sources
Cloud Object
Stores
SaaS
Applications
NoSQL
On-premises
Systems
Data Sources Data Consumers
Observability, lineage, and end-to-end pipeline visibility
Data quality management
Data
ingestion
Data Science & Engineering
Workspace
Databricks Workspaces: Clusters
It is a set of computation resources where a developer can run Data Analytics,
Data Science, or Data Engineering workloads.
The workloads can be executed in the form of a set of commands written in a notebook
Databricks Workspaces: Notebooks
It is a Web Interface where a developer can write and execute codes. Notebook contains
a sequence of runnable cells that helps a developer to work with files, manipulate
tables, create visualizations, and add narrative texts
Databricks Workspaces: AutoLoader
Auto Loader incrementally and efficiently processes new data files as they arrive in
cloud storage. Auto Loader can load data files from Google Cloud Storage (GCS, gs://) in
addition to Databricks File System (DBFS, dbfs:/)
** Supports: JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats. **
val checkpoint_path = "/tmp/delta/population_data/_checkpoints"
val write_path = "/tmp/delta/population_data"
// Set up the stream to begin reading incoming files from the
// upload_path location.
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.schema("city string, year int, population long")
.load(upload_path)
// Start the stream.
// Use the checkpoint_path location to keep a record of all files that
// have already been uploaded to the upload_path location.
// For those that have been uploaded since the last check,
// write the newly-uploaded files' data to the write_path location.
df.writeStream.format("delta")
.option("checkpointLocation", checkpoint_path)
.start(write_path)
https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html
Databricks Workspaces: Jobs
Jobs allow a user to run notebooks on a scheduled basis. It is a method of executing or
automating specific tasks like ETL, Model Building, and more.
The pipeline of the ML workflow can be organized
into jobs so that it sequentially runs the series of
steps one after another
Databricks Workspaces:Delta Live Tables
Delta Live Tables is a framework designed to enable declaratively define, deploy, test &
upgrade data pipelines and eliminate operational burdens associated with the
management of such pipelines.
Databricks Workspaces: Repos
To empower the process of ML application development, repo’s provide repository-
level integration with Git-based hosting providers such as GitHub, GitLab, bitBucket,
and Azure DevOps
Developers can write code in a Notebook
and Sync it with the hosting provider,
allowing developers to clone, manage
branches, push changes, pull changes,
etc.
Databricks Workspaces: Models
It refers to a Developer’s ML Workflow Model registered in the MLflow Model Registry,
a centralized model store that manages the entire life cycle of MLflow models.
MLflow Model Registry provides all
the information about modern
lineage, model versioning, present
condition, workflow, and stage
transition (whether promoted to
production or archived).
Governance requirements for
data are quickly evolving
Governance is hard to enforce on data lakes
42
Cloud 2
Cloud 3
Structured
Semi-structured
Unstructured
Streaming
Cloud 1
The problem is getting bigger
Enterprises need a way to share and govern a wide variety of data products
Files Dashboards Models Tables
Unity Catalog for Lakehouse Governance
• Centrally catalog, Search, and discover
data and AI assets
• Simplify governance with a unified Cross- cloud
governance model
• Easily integrate with your existing
Enterprise Data Catalogs
• Securely share live data across platforms
with delta sharing
Delta Sharing on Databricks
Delta Lake
Table
Delta Sharing
Server
Delta Sharing
Protocol
Data
Provider
Data Recipient
Any Sharing Client
Access
permissions
Machine Learning
Workspace
ML Architecture: Data Warehouse VS Data Lakehouse
Data Warehouse Data Lakehouse
Open Multi-Cloud Data Lakehouse and Feature Store
Collaborative Multi-Language Notebooks
← Full ML Lifecycle →
Model Tracking
and Registry
Model Training
and Tuning
Model Serving
and Monitoring
Automation and
Governance
Data Science and Machine Learning
A data-native and collaborative solution for the full ML lifecycle
What Does ML Need from a Lakehouse?
58
Access to Unstructured Data
• Images, text, audio, custom formats
• Libraries understand files, not tables
• Must scale to petabytes
Open Source Libraries
• OSS dominates ML tooling (Tensorflow, scikit-
learn, xgboost, R, etc)
• Must be able to apply these in Python, R
Specialized Hardware, Distributed Compute
• Scalability of algorithms
• GPUs, for deep learning
• Cloud elasticity to manage that cost!
Model Lifecycle Management
• Outputs are model artifacts
• Artifact lineage
• Productionization of model
Three Data Users
• SQL and BI tools
• Prepare and run reports
• Summarize data
• Visualize data
• (Sometimes) Big Data
• Data Warehouse data store
• R, SAS, some Python
• Statistical analysis
• Explain data
• Visualize data
• Often small data sets
• Database, data warehouse
data store; local files
Business Intelligence Data Science
• Python
• Deep learning and
specialized GPU hardware
• Create predictive models
• Deploy models to prod
• Often big data sets
• Unstructured data in files
Machine Learning
How Is ML Different?
• Operates on unstructured data like text
and images
• Can require learning from massive
data sets, not just analysis of a sample
• Uses open source tooling to
manipulate data as “DataFrames”
rather than with SQL
• Outputs are models rather than data or
reports
• Sometimes needs special hardware
MLOps and the Lakehouse
• Applying open tools in-place to data in
the lakehouse is a win for training
• Applying them for operating models is
important too!
• "Models are data too"
• Need to apply models to data
• MLFlow for MLOps on the lakehouse
• Track and manage model data,
lineage, inputs
• Deploy models as lakehouse "services"
Feature Stores for Model Inputs
• Tables are OK for managing model input
• Input often structured
• Well understood, easy to access
• … but not quite enough
• Upstream lineage: how were
features computed?
• Downstream lineage: where is the
feature used?
• Model caller has to read, feed inputs
• How to do (also) access in real
time?
SQL Analytics Workspace
Query data lake data using familiar ANSI SQL, and find and share new insights faster
with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
Databricks Workspaces: Queries
Provides a simplified control (which is SQL only) to query the data
Databricks Workspaces: Dashboards
A Databricks SQL dashboard lets you combine visualizations and text boxes that provide
context with your data.
Databricks Workspaces: Alerts
Alerts notify you when a field returned by a scheduled query meets a threshold.
Alerts complement scheduled queries, but their criteria are checked after every
execution.
Databricks Workspaces: Query History
The query history shows SQL queries performed using SQL endpoints.
Thank you

More Related Content

What's hot

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
Rodney Joyce
 
Data Mesh
Data MeshData Mesh
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 

What's hot (20)

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 

Similar to Databricks Platform.pptx

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
HostedbyConfluent
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
Amazon Web Services
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Travis Wright
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
Crishantha Nanayakkara
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks
 
What Is Delta Lake ???
What Is Delta Lake ???What Is Delta Lake ???
What Is Delta Lake ???
✪Computants✪IBM_BP
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Mark Kromer
 

Similar to Databricks Platform.pptx (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
What Is Delta Lake ???
What Is Delta Lake ???What Is Delta Lake ???
What Is Delta Lake ???
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 

Recently uploaded

Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
felixwold
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Massimo Talia
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
Seetal Daas
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
q30122000
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
mahaffeycheryld
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
OKORIE1
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
drshikhapandey2022
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
EV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptx
EV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptxEV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptx
EV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptx
nikshimanasa
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
wafawafa52
 

Recently uploaded (20)

Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
EV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptx
EV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptxEV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptx
EV BMS WITH CHARGE MONITOR AND FIRE DETECTION.pptx
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
 

Databricks Platform.pptx

  • 1. The Databricks Platform Introduction All your data, analytics and AI on one platform Alex Ivanichev March 2022
  • 2. DataBricks is a unified & open Data and Analytics Platform What is DataBricks ?
  • 3. Modern Data Teams 5 Data Engineers Data Scientists Data Analysts
  • 4. How the data management looks like today ?
  • 5. Data management complexity Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming decrease productivity Data Science and ML Data Analysts Data Engineers Data Engineers Disconnected systems and proprietary data formats make integration difficult Data Scientists Amazon Redshift Azure Synapse Snowflake SAP Teradata Google BigQuery IBM Db2 Oracle Autonomous Data Warehouse Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS Tibco Spotfire Confluent TensorFlow PyTorch Extract Load Transform Real-time Database Analytics and BI Data marts Data prep Machine Learning Data Science Streaming Data Engine Data Lake Data Lake Data warehouse Structured, semi- structured and unstructured data Structured, semi- structured and unstructured data Structured data Streaming data sources 5
  • 7. Warehouses and lakes create complexity Two separate copies of the data Warehouses Proprietary Lakes Open Incompatible interfaces Warehouses SQL Lakes Python Incompatible security and governance models Warehouses Tables Lakes Files
  • 8. Data Warehouse Data Lake Streaming Analytics B I Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lakehouse One platform to unify all of your data, analytics, and AI workloads
  • 10. The data lakehouse offers a better path Data processing and management built on open source and open standards Common security, governance, and administration Modern Data Engineering Analytics and Data Warehousing Data Science and ML Integrated and collaborative role-based experiences with open API’s Cloud Data Lake Structured, semi-structured, and unstructured data Lake-first approach that builds upon where the freshest, most complete data resides AI/ML from the ground up High reliability and performance Single approach to managing data Support for all use cases on a single platform: • Data engineering • Data warehousing • Real time streaming • Data science and ML Built on open source and open standards Multi-cloud, work with your cloud of choice
  • 12. ©2021 Databricks Inc. — All rights r eserved An open approach to bringing data management and governance to data lakes Better reliability with transactions 48x faster data processing with indexing Data governance at scale with fine- grained access control lists Data Warehouse Data Lake
  • 13. What is Delta Lake? ● A open source project that enables building a Lakehouse architecture on top of data lakes. ● An storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. ● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. ● ACID Transactions ● Scalable Metadata Handling ● Time Travel (data versioning) ● Open Format ● Delta Lake change data feed ● Unified Batch and Streaming Source and Sink ● Schema Enforcement ● Schema Evolution ● Audit History ● Updates and Delete ● 100% Compatible with Apache Spark API ● Data Clean-up Key Features: https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
  • 14. Delta Lake solves challenges with data lakes RELIABILITY & QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing & caching Governance with Data Catalogs
  • 15. Delta Lake key feature - ACID transaction ● Add File: It adds the data file ● Remove File: It removes the data file ● Update Metadata: It updates the table metadata. ● Set Transaction: It records that a structure streaming job created a micro-batch with ID ● Change Protocol: Makes more secure by transferring Delta Lakes to the latest securing protocol. ● Commit Info: It contains the information about the Commits.
  • 16. State Recomputing With Checkpoint Files ● Delta Lake automatically generates checkpoint files every 10 commits ● Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory.
  • 17. Building the foundation of a Lakehouse Filtered, Cleaned, Augmented Business-level Aggregates Greatly improve the quality of your data for end users BRONZE SILVER GOLD Raw Ingestion and History Kinesis CSV, JSON, TXT… Data Lake Quality BI & Reporting Streaming Analytics Data Science & ML
  • 18. But the reality is not so simple Maintaining data quality and reliability at scale is complex and brittle CSV, JSON, TXT… Data Lake Kinesis BI & Reporting Streaming Analytics Data Science & ML
  • 19. Modern data engineering on the lakehouse Data Engineering on the Databricks Lakehouse Platform Open format storage Data transformation Scheduling & orchestration Automatic deployment & operations BI / Reporting Dashboarding Machine Learning / Data Science Data & ML Sharing Data Products Databases Streaming Sources Cloud Object Stores SaaS Applications NoSQL On-premises Systems Data Sources Data Consumers Observability, lineage, and end-to-end pipeline visibility Data quality management Data ingestion
  • 20. Data Science & Engineering Workspace
  • 21. Databricks Workspaces: Clusters It is a set of computation resources where a developer can run Data Analytics, Data Science, or Data Engineering workloads. The workloads can be executed in the form of a set of commands written in a notebook
  • 22. Databricks Workspaces: Notebooks It is a Web Interface where a developer can write and execute codes. Notebook contains a sequence of runnable cells that helps a developer to work with files, manipulate tables, create visualizations, and add narrative texts
  • 23. Databricks Workspaces: AutoLoader Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from Google Cloud Storage (GCS, gs://) in addition to Databricks File System (DBFS, dbfs:/) ** Supports: JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats. ** val checkpoint_path = "/tmp/delta/population_data/_checkpoints" val write_path = "/tmp/delta/population_data" // Set up the stream to begin reading incoming files from the // upload_path location. val df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "csv") .option("header", "true") .schema("city string, year int, population long") .load(upload_path) // Start the stream. // Use the checkpoint_path location to keep a record of all files that // have already been uploaded to the upload_path location. // For those that have been uploaded since the last check, // write the newly-uploaded files' data to the write_path location. df.writeStream.format("delta") .option("checkpointLocation", checkpoint_path) .start(write_path) https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html
  • 24. Databricks Workspaces: Jobs Jobs allow a user to run notebooks on a scheduled basis. It is a method of executing or automating specific tasks like ETL, Model Building, and more. The pipeline of the ML workflow can be organized into jobs so that it sequentially runs the series of steps one after another
  • 25. Databricks Workspaces:Delta Live Tables Delta Live Tables is a framework designed to enable declaratively define, deploy, test & upgrade data pipelines and eliminate operational burdens associated with the management of such pipelines.
  • 26. Databricks Workspaces: Repos To empower the process of ML application development, repo’s provide repository- level integration with Git-based hosting providers such as GitHub, GitLab, bitBucket, and Azure DevOps Developers can write code in a Notebook and Sync it with the hosting provider, allowing developers to clone, manage branches, push changes, pull changes, etc.
  • 27. Databricks Workspaces: Models It refers to a Developer’s ML Workflow Model registered in the MLflow Model Registry, a centralized model store that manages the entire life cycle of MLflow models. MLflow Model Registry provides all the information about modern lineage, model versioning, present condition, workflow, and stage transition (whether promoted to production or archived).
  • 28. Governance requirements for data are quickly evolving
  • 29. Governance is hard to enforce on data lakes 42 Cloud 2 Cloud 3 Structured Semi-structured Unstructured Streaming Cloud 1
  • 30. The problem is getting bigger Enterprises need a way to share and govern a wide variety of data products Files Dashboards Models Tables
  • 31. Unity Catalog for Lakehouse Governance • Centrally catalog, Search, and discover data and AI assets • Simplify governance with a unified Cross- cloud governance model • Easily integrate with your existing Enterprise Data Catalogs • Securely share live data across platforms with delta sharing
  • 32. Delta Sharing on Databricks Delta Lake Table Delta Sharing Server Delta Sharing Protocol Data Provider Data Recipient Any Sharing Client Access permissions
  • 34. ML Architecture: Data Warehouse VS Data Lakehouse Data Warehouse Data Lakehouse
  • 35. Open Multi-Cloud Data Lakehouse and Feature Store Collaborative Multi-Language Notebooks ← Full ML Lifecycle → Model Tracking and Registry Model Training and Tuning Model Serving and Monitoring Automation and Governance Data Science and Machine Learning A data-native and collaborative solution for the full ML lifecycle
  • 36. What Does ML Need from a Lakehouse? 58 Access to Unstructured Data • Images, text, audio, custom formats • Libraries understand files, not tables • Must scale to petabytes Open Source Libraries • OSS dominates ML tooling (Tensorflow, scikit- learn, xgboost, R, etc) • Must be able to apply these in Python, R Specialized Hardware, Distributed Compute • Scalability of algorithms • GPUs, for deep learning • Cloud elasticity to manage that cost! Model Lifecycle Management • Outputs are model artifacts • Artifact lineage • Productionization of model
  • 37. Three Data Users • SQL and BI tools • Prepare and run reports • Summarize data • Visualize data • (Sometimes) Big Data • Data Warehouse data store • R, SAS, some Python • Statistical analysis • Explain data • Visualize data • Often small data sets • Database, data warehouse data store; local files Business Intelligence Data Science • Python • Deep learning and specialized GPU hardware • Create predictive models • Deploy models to prod • Often big data sets • Unstructured data in files Machine Learning
  • 38. How Is ML Different? • Operates on unstructured data like text and images • Can require learning from massive data sets, not just analysis of a sample • Uses open source tooling to manipulate data as “DataFrames” rather than with SQL • Outputs are models rather than data or reports • Sometimes needs special hardware
  • 39. MLOps and the Lakehouse • Applying open tools in-place to data in the lakehouse is a win for training • Applying them for operating models is important too! • "Models are data too" • Need to apply models to data • MLFlow for MLOps on the lakehouse • Track and manage model data, lineage, inputs • Deploy models as lakehouse "services"
  • 40. Feature Stores for Model Inputs • Tables are OK for managing model input • Input often structured • Well understood, easy to access • … but not quite enough • Upstream lineage: how were features computed? • Downstream lineage: where is the feature used? • Model caller has to read, feed inputs • How to do (also) access in real time?
  • 41. SQL Analytics Workspace Query data lake data using familiar ANSI SQL, and find and share new insights faster with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
  • 42. Databricks Workspaces: Queries Provides a simplified control (which is SQL only) to query the data
  • 43. Databricks Workspaces: Dashboards A Databricks SQL dashboard lets you combine visualizations and text boxes that provide context with your data.
  • 44. Databricks Workspaces: Alerts Alerts notify you when a field returned by a scheduled query meets a threshold. Alerts complement scheduled queries, but their criteria are checked after every execution.
  • 45. Databricks Workspaces: Query History The query history shows SQL queries performed using SQL endpoints.