Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
5. Reliability Challenges
Hard to append data
✗
Modification of
existing data difficult
Jobs failing mid way
Real-time operations hard
Costly to keep historical
versions of the data
6. Difficult to handle large metadata
Performance Challenges
Hard to get great performance
✗
"Too many files” problems
7. Governance and Data Quality Challenges
Lack of transactional
traceability
Fine grained access control
difficult
Data quality Issues
8. Open Source & Open Format
Adds Reliability & Performance
Best of data warehouse and data
lakes
Fully Compatible with Spark APIs
A New Standard for Building Data Lakes
9. Data Lake : Reality to the Ideal
● Efficient Upserts
● Data Versioning
● Time Travel
● ACID Transactions
● Schema Enforcement
● Data Quality Checks
● Scalable Metadata Handling
● Streaming/Batch Unification
● Small file compaction
DATA LAKE DELTA LAKE
11. Structured Semi-structured Unstructured Streaming
Lakehouse Architecture
Data Engineering
BI & SQL
Analytics
Real-time Data
Applications
Data Science
& Machine Learning
Data Management & Governance
Open Data Lake
13. Build a Modern Lakehouse Architecture
Semi-Structured
Curated Data
Data
Lake
Raw Data
Unstructured Data Structured
Data Engineering BI and SQL based
Analytics
Dashboards
and reporting
Data Science and
Machine Learning
Streaming
Analytics
14. Open , reliable and performant foundation
Semi-Structured
Curated Data
Data
Lake
Raw Data
Unstructured Data Structured
Data Engineering BI and SQL based
Analytics
Dashboards
and reporting
Data Science and
Machine Learning
Streaming
Analytics
15. Select the Right Tool
Semi-Structured
Curated Data
Data
Lake
Raw Data
Unstructured Data Structured
Streaming
Analytics
BI and SQL based
Analytics
Dashboards and
Reporting
Data Science and
Machine Learning
Data Engineering
16. Simplify the Architecture
Curated Data
ETL/ELT, Streaming, SQL Analytics on
Data Lake, Data Science and ML
SQL Analytics and BI
Raw Data
Semi-Structured Unstructured Data
Structured
Reporting and
Dashboards
Data
Lake
18. Delta Lake Reference Architecture
Unified architecture that
facilitates:
● Reliability and
performance of batch and
streaming ETL and ML at
scale
● Interactive Data Science
and production machine
learning
● Streamline collaboration
across the whole data
ecosystem
23. Databricks on Google Cloud
A jointly developed service for data engineering, data science and analytics
Optimized Performance
Databricks on Google Kubernetes Engine is the first Kubernetes-based Databricks runtime for
performant, scalable runtime
Streamlined Integrations
Faster, easier data access with built-in connectors to all Google Cloud
including integrations to Google Data Analytics (BigQuery, Pub/Sub,
Looker) as well Cloud Infrastructure and storage.
Integrated Security & Administration
1-click access to Databricks from the Google Cloud Console with SSO, Identity
Passthrough and SCIM support
Innovate Faster with a Unified, Fully Integrated Platform for all Analytics
Make data lakes more Reliable and Scalable
Collaborate & Simplify DS/ML at Scale