Building End-to-End Delta Pipelines
on GCP
Himanish Kushary,
Practice Lead @Databricks
Molly Nagamuthu,
Sr Resident Solutions Architect@Databricks
Agenda
▪ Delta Lake Overview
▪ Delta Lake on GCP
▪ Reference Architecture
▪ Demo
▪ Questions
Who are we ?
Practice Lead @
Databricks
Sr Resident Solutions
Architect@ Databricks
Himanish Kushary Molly Nagamuthu
Delta Lake Overview
Reliability Challenges
Hard to append data
✗
Modification of
existing data difficult
Jobs failing mid way
Real-time operations hard
Costly to keep historical
versions of the data
Difficult to handle large metadata
Performance Challenges
Hard to get great performance
✗
"Too many files” problems
Governance and Data Quality Challenges
Lack of transactional
traceability
Fine grained access control
difficult
Data quality Issues
Open Source & Open Format
Adds Reliability & Performance
Best of data warehouse and data
lakes
Fully Compatible with Spark APIs
A New Standard for Building Data Lakes
Data Lake : Reality to the Ideal
● Efficient Upserts
● Data Versioning
● Time Travel
● ACID Transactions
● Schema Enforcement
● Data Quality Checks
● Scalable Metadata Handling
● Streaming/Batch Unification
● Small file compaction
DATA LAKE DELTA LAKE
Data
Warehouse
Lakehouse
One platform to unify all of
your data, analytics, and AI workloads
Data
Lake
Structured Semi-structured Unstructured Streaming
Lakehouse Architecture
Data Engineering
BI & SQL
Analytics
Real-time Data
Applications
Data Science
& Machine Learning
Data Management & Governance
Open Data Lake
Delta Lake on GCP
Build a Modern Lakehouse Architecture
Semi-Structured
Curated Data
Data
Lake
Raw Data
Unstructured Data Structured
Data Engineering BI and SQL based
Analytics
Dashboards
and reporting
Data Science and
Machine Learning
Streaming
Analytics
Open , reliable and performant foundation
Semi-Structured
Curated Data
Data
Lake
Raw Data
Unstructured Data Structured
Data Engineering BI and SQL based
Analytics
Dashboards
and reporting
Data Science and
Machine Learning
Streaming
Analytics
Select the Right Tool
Semi-Structured
Curated Data
Data
Lake
Raw Data
Unstructured Data Structured
Streaming
Analytics
BI and SQL based
Analytics
Dashboards and
Reporting
Data Science and
Machine Learning
Data Engineering
Simplify the Architecture
Curated Data
ETL/ELT, Streaming, SQL Analytics on
Data Lake, Data Science and ML
SQL Analytics and BI
Raw Data
Semi-Structured Unstructured Data
Structured
Reporting and
Dashboards
Data
Lake
Reference Architecture
Delta Lake Reference Architecture
Unified architecture that
facilitates:
● Reliability and
performance of batch and
streaming ETL and ML at
scale
● Interactive Data Science
and production machine
learning
● Streamline collaboration
across the whole data
ecosystem
Demo Overview
Demo Architecture
Demo
Additional Resources
Databricks on Google Cloud
A jointly developed service for data engineering, data science and analytics
Optimized Performance
Databricks on Google Kubernetes Engine is the first Kubernetes-based Databricks runtime for
performant, scalable runtime
Streamlined Integrations
Faster, easier data access with built-in connectors to all Google Cloud
including integrations to Google Data Analytics (BigQuery, Pub/Sub,
Looker) as well Cloud Infrastructure and storage.
Integrated Security & Administration
1-click access to Databricks from the Google Cloud Console with SSO, Identity
Passthrough and SCIM support
Innovate Faster with a Unified, Fully Integrated Platform for all Analytics
Make data lakes more Reliable and Scalable
Collaborate & Simplify DS/ML at Scale
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Thank you

Building End-to-End Delta Pipelines on GCP

  • 1.
    Building End-to-End DeltaPipelines on GCP Himanish Kushary, Practice Lead @Databricks Molly Nagamuthu, Sr Resident Solutions Architect@Databricks
  • 2.
    Agenda ▪ Delta LakeOverview ▪ Delta Lake on GCP ▪ Reference Architecture ▪ Demo ▪ Questions
  • 3.
    Who are we? Practice Lead @ Databricks Sr Resident Solutions Architect@ Databricks Himanish Kushary Molly Nagamuthu
  • 4.
  • 5.
    Reliability Challenges Hard toappend data ✗ Modification of existing data difficult Jobs failing mid way Real-time operations hard Costly to keep historical versions of the data
  • 6.
    Difficult to handlelarge metadata Performance Challenges Hard to get great performance ✗ "Too many files” problems
  • 7.
    Governance and DataQuality Challenges Lack of transactional traceability Fine grained access control difficult Data quality Issues
  • 8.
    Open Source &Open Format Adds Reliability & Performance Best of data warehouse and data lakes Fully Compatible with Spark APIs A New Standard for Building Data Lakes
  • 9.
    Data Lake :Reality to the Ideal ● Efficient Upserts ● Data Versioning ● Time Travel ● ACID Transactions ● Schema Enforcement ● Data Quality Checks ● Scalable Metadata Handling ● Streaming/Batch Unification ● Small file compaction DATA LAKE DELTA LAKE
  • 10.
    Data Warehouse Lakehouse One platform tounify all of your data, analytics, and AI workloads Data Lake
  • 11.
    Structured Semi-structured UnstructuredStreaming Lakehouse Architecture Data Engineering BI & SQL Analytics Real-time Data Applications Data Science & Machine Learning Data Management & Governance Open Data Lake
  • 12.
  • 13.
    Build a ModernLakehouse Architecture Semi-Structured Curated Data Data Lake Raw Data Unstructured Data Structured Data Engineering BI and SQL based Analytics Dashboards and reporting Data Science and Machine Learning Streaming Analytics
  • 14.
    Open , reliableand performant foundation Semi-Structured Curated Data Data Lake Raw Data Unstructured Data Structured Data Engineering BI and SQL based Analytics Dashboards and reporting Data Science and Machine Learning Streaming Analytics
  • 15.
    Select the RightTool Semi-Structured Curated Data Data Lake Raw Data Unstructured Data Structured Streaming Analytics BI and SQL based Analytics Dashboards and Reporting Data Science and Machine Learning Data Engineering
  • 16.
    Simplify the Architecture CuratedData ETL/ELT, Streaming, SQL Analytics on Data Lake, Data Science and ML SQL Analytics and BI Raw Data Semi-Structured Unstructured Data Structured Reporting and Dashboards Data Lake
  • 17.
  • 18.
    Delta Lake ReferenceArchitecture Unified architecture that facilitates: ● Reliability and performance of batch and streaming ETL and ML at scale ● Interactive Data Science and production machine learning ● Streamline collaboration across the whole data ecosystem
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    Databricks on GoogleCloud A jointly developed service for data engineering, data science and analytics Optimized Performance Databricks on Google Kubernetes Engine is the first Kubernetes-based Databricks runtime for performant, scalable runtime Streamlined Integrations Faster, easier data access with built-in connectors to all Google Cloud including integrations to Google Data Analytics (BigQuery, Pub/Sub, Looker) as well Cloud Infrastructure and storage. Integrated Security & Administration 1-click access to Databricks from the Google Cloud Console with SSO, Identity Passthrough and SCIM support Innovate Faster with a Unified, Fully Integrated Platform for all Analytics Make data lakes more Reliable and Scalable Collaborate & Simplify DS/ML at Scale
  • 24.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.
  • 25.