Building End-to-End Delta Pipelines on GCP

Building End-to-End Delta Pipelines
on GCP
Himanish Kushary,
Practice Lead @Databricks
Molly Nagamuthu,
Sr Resident Solutions Architect@Databricks

Agenda
▪ Delta Lake Overview
▪ Delta Lake on GCP
▪ Reference Architecture
▪ Demo
▪ Questions

Who are we ?
Practice Lead @
Databricks
Sr Resident Solutions
Architect@ Databricks
Himanish Kushary Molly Nagamuthu

Reliability Challenges
Hard to append data
✗
Modification of
existing data diﬀicult
Jobs failing mid way
Real-time operations hard
Costly to keep historical
versions of the data

Difficult to handle large metadata
Performance Challenges
Hard to get great performance
✗
"Too many files” problems

Governance and Data Quality Challenges
Lack of transactional
traceability
Fine grained access control
diﬀicult
Data quality Issues

Open Source & Open Format
Adds Reliability & Performance
Best of data warehouse and data
lakes
Fully Compatible with Spark APIs
A New Standard for Building Data Lakes

Data Lake : Reality to the Ideal
● Efficient Upserts
● Data Versioning
● Time Travel
● ACID Transactions
● Schema Enforcement
● Data Quality Checks
● Scalable Metadata Handling
● Streaming/Batch Uniﬁcation
● Small ﬁle compaction
DATA LAKE DELTA LAKE

Data
Warehouse
Lakehouse
One platform to unify all of
your data, analytics, and AI workloads
Data
Lake

Structured Semi-structured Unstructured Streaming
Lakehouse Architecture
Data Engineering
BI & SQL
Analytics
Real-time Data
Applications
Data Science
& Machine Learning
Data Management & Governance
Open Data Lake

Build a Modern Lakehouse Architecture
Semi-Structured
Curated Data
Data
Lake
Raw Data
Unstructured Data Structured
Data Engineering BI and SQL based
Analytics
Dashboards
and reporting
Data Science and
Machine Learning
Streaming
Analytics

Open , reliable and performant foundation
Semi-Structured
Curated Data
Data
Lake
Raw Data
Data Engineering BI and SQL based
Analytics
Dashboards
and reporting
Data Science and
Machine Learning
Streaming
Analytics

Select the Right Tool
Semi-Structured
Curated Data
Data
Lake
Raw Data
Streaming
Analytics
BI and SQL based
Analytics
Dashboards and
Reporting
Data Science and
Machine Learning
Data Engineering

Simplify the Architecture
Curated Data
ETL/ELT, Streaming, SQL Analytics on
Data Lake, Data Science and ML
SQL Analytics and BI
Raw Data
Semi-Structured Unstructured Data
Structured
Reporting and
Dashboards
Data
Lake

Delta Lake Reference Architecture
Uniﬁed architecture that
facilitates:
● Reliability and
performance of batch and
streaming ETL and ML at
scale
● Interactive Data Science
and production machine
learning
● Streamline collaboration
across the whole data
ecosystem

Databricks on Google Cloud
A jointly developed service for data engineering, data science and analytics
Optimized Performance
Databricks on Google Kubernetes Engine is the ﬁrst Kubernetes-based Databricks runtime for
performant, scalable runtime
Streamlined Integrations
Faster, easier data access with built-in connectors to all Google Cloud
including integrations to Google Data Analytics (BigQuery, Pub/Sub,
Looker) as well Cloud Infrastructure and storage.
Integrated Security & Administration
1-click access to Databricks from the Google Cloud Console with SSO, Identity
Passthrough and SCIM support
Innovate Faster with a Uniﬁed, Fully Integrated Platform for all Analytics
Make data lakes more Reliable and Scalable
Collaborate & Simplify DS/ML at Scale

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Building End-to-End Delta Pipelines on GCP

More Related Content

What's hot

Similar to Building End-to-End Delta Pipelines on GCP

More from Databricks

Recently uploaded

In this document

Building End-to-End Delta Pipelines on GCP