Massive Data Processing in Adobe Using Delta Lake

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Massive Data
Processing in Adobe
using Delta Lake
Yeshwanth Vijayakumar
Sr. Engineering Manager/Architect @ Adobe
Agenda
§ Introduction
§ What are we storing?
§ Data Representation and
Nested Schema Evolution
§ Writer Worries and How to
Wipe them Away
§ Staging Tables FTW
§ Datalake Replication Lag
Tracking
§ Performance Time!
Unified Profile Data Ingestion
Unified Profile
Experience Data Model
Adobe Campaign
AEM
Adobe Analytics
Adobe
AdCloud
Change Feed Streaming
Stats Generation
Single Tenant
Multi Tenant
Linking Identities
Data Layout At a Glance
An Idea about how the graph linkages are stored
primaryId relatedIds field1 field2 field1000
123 123 a b c
456 456 d e f
123 123 d e l
789 789,101 x y z
101 789,101 x u p
Conditions
• primaryId does not change
• relatedIds can change
New Record comes in
Indicating a new linkage, causing a change in graph membership
primaryId relatedId field1 field2 field1000
103 103,789,101 q w r
789 103,789,101 x y z
101 103,789,101 x y z
primaryId relatedId field1 field2 field1000
103 103,789,101 q w r
New Record comes in linking 103 with 789 and 101
Causes a cascading change in rows of 789 and 101
Main Access Pattern
Query 1
Query 2
Query 3
Query 1000
Multiple Queries over 1 consolidated row
Complexities?
• Nested Fields
• a.b.c.d[*].e nested hairiness!
• Arrays!
• MapType
• Every Tenant has a different Schema!
• Schema evolves constantly
• Fields can get deleted, updated.
• Multiple Sources
• Streaming
• Batch
Scale?
• Tenants have 10+ Billions of rows
• PBs of data
• Million RPS peak across the system
• Triggers multiple downstream applications
• Segmentation
• Activation
What is DeltaLake?
From delta.io : Delta Lake is an open-source project that enables building a Lakehouse architecture
on top of existing storage systems such as S3, ADLS, GCS, and HDFS.
ACID
Transactions
Time Travel
(data
versioning)
Uses Parquet
Underneath
Schema
Enforcement
and Schema
Evolution
Audit History
Updates and
Deletes Support
Key Features
Delta lake in Practice
UPSERT
SQL Compatible
Writer Worries and How to Wipe them Away
• Concurrency Conflicts
• Column size
• When individual column data exceeds 2GB, we see degradation in writes or OOM
• Update frequency
• Too frequent updates cause underlying filestore metadata issues.
• This is because every transacation on an individual parquet causes CoW,
• More updates => more rewrites on HDFS
• Too Many small files !!!
CDC (existing)
Batch Ingestion / Streaming
Ingestion /
API based Ingest
Mutation Apps
CosmosDB
CDC
1. Send Request to
Cosmos
2.Ack
3.Emit CDC
Consumed by
• Stats
• Edge
• etc
Dataflow with DeltaLake
primary
Id
relatedId
field
1
field2 field1000
103 103,789,101 q w r
789 103,789,101 x y z
101 103,789,101 x y z
Cosmos
DB
primaryId relatedId field1 field1000
103 103,789,101 q r
primaryId relatedId jsonString
103 103,789,101 <jsonStr>
789 103,789,101
<jsonStr>
101 103,789,101 <jsonStr>
Staging Table
Change Feed CDC
Raw Table (per tenant)
Check for Work every
X minutes
UPSERT/DELETE into
Raw Table
Fetch
Records
to process
APPEND only!
CDC
Dumper
Backfill
Long Running
Streaming
Application
Processor
Partitioned by tenant and 15 min time intervals
TenantLock in Redis
Staging Tables FTW
Fan-In pattern vs Fan-out
• Multiple Source Writers Issue Solved
• By centralizing all reads from CDC, since ALL writes generate a CDC
• Staging Table in APPEND ONLY mode
• No conflicts while writing to it
• Filter out. Bad data > thresholds before making it to Raw
Table
• Batch Writes by reading larger blocks of data from Staging
Table
• Since it acts time aware message buffer
Staging Table Logical View
ProgressMap
Why choose JSON String format?
§ We are doing a lazy Schema on-read approach.
▪ Yes. this is an anti-pattern.
§ Nested Schema Evolution was not supported on update in delta in 2020
▪ Supported with latest version
§ We want to apply conflict resolution before upsert-ing
▪ Eg. resolveAndMerge(newData, oldData)
▪ UDF’s are strict on types, with the plethora of difference schemas , it is crazy to manage UDF per
org in Multi tenant fashion
▪ Now we just have simple JSON merge udfs
▪ We use json-iter which is very efficient in loading partial bits of json and in manipulating them.
§ Don’t you lose predicate pushdown?
▪ We have pulled out all main push-down filters to individual columns
▪ Eg. timestamp, recordType, id, etc.
▪ Profile workloads are mainly scan based since we can run 1000’s of queries at a single time.
▪ Reading the whole JSON string from datalake is much faster and cheaper than reading from
Cosmos for 20% of all fields.
Schema On Read is more
future safe approach for
raw data
§ Wrangling Spark Structs is not
user friendly
§ JSON schema is messy
▪ Crazy nesting
▪ Add maps to the equation, just the
schema will be in MBs
§ Schema on Read using Json-iter
means we can read what we
need on a row by row basis
§ Materialized Views WILL have
structs!
Partition Scheme of Raw records
• RawRecords Delta Table
• recordType
• sourceId
• timestamp (key-value records will use DEFAULT value)
z-order on primaryId
z-order - Colocate column information in the same set of files using locality-preserving space-filling curves
Massive Data Processing in Adobe Using Delta Lake
Replication Lag – 2 types
• CDC Lag from Kafka
• Tells us how much more work we need to do to catch up to write to Staging
Table
• How we track Lag on a per tenant basis
• We track Max(TimeStamp) in CDC per org
• We track Max(TSKEY) processed in Processor
• Difference gives us rough lag of replication
Merge/UPSERT Performance
Action: UPSERT CDC stage into fragment
Time Taken
170 K CDC Records – Maps to 100k
Rows in Raw Table
15 seconds
1.7 Million CDC Records – Maps to 1
Million Rows in Raw Table
61 seconds
Live Traffic Usecase: How long does it take X CDC messages to get upserted into Raw Table
Job Performance Time!
Hot Store (NoSQL Store) Delta Lake
Size of Data 1 TB 64 GB
Number of Partitions 80 189
Job Cores Used 112 112
Job Runtime 3 hours 25 mins
TakeAways
• Scan IO speed from datalake >>> Read from Hot Store
• Reasonably fast eventually consistent replication within
minutes
• More partitions means better Spark executor core utilization
• Potential to aggressively TTL data in hot store
• More downstream materialization !!!
• Incremental Computation Framework thanks to Staging tables!
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
1 of 25

Recommended

Introduction SQL Analytics on Lakehouse Architecture by
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
5.8K views52 slides
Making Apache Spark Better with Delta Lake by
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K views40 slides
Databricks Delta Lake and Its Benefits by
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
5.1K views21 slides
Making Data Timelier and More Reliable with Lakehouse Technology by
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
1.6K views44 slides
SQL Analytics Powering Telemetry Analysis at Comcast by
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
341 views35 slides
Building Data Quality pipelines with Apache Spark and Delta Lake by
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
1.1K views14 slides

More Related Content

What's hot

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard by
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
1.3K views42 slides
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic by
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
77 views42 slides
Building Lakehouses on Delta Lake with SQL Analytics Primer by
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
430 views32 slides
Free Training: How to Build a Lakehouse by
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
3.4K views42 slides
Migration to Databricks - On-prem HDFS.pptx by
Migration to Databricks - On-prem HDFS.pptxMigration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptxKshitija(KJ) Gupte
150 views28 slides
Apache Iceberg - A Table Format for Hige Analytic Datasets by
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
6.6K views28 slides

What's hot(20)

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard by Paris Data Engineers !
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic by DataScienceConferenc1
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Building Lakehouses on Delta Lake with SQL Analytics Primer by Databricks
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks430 views
Free Training: How to Build a Lakehouse by Databricks
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks3.4K views
Apache Iceberg - A Table Format for Hige Analytic Datasets by Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K views
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud by Noritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama33.3K views
Iceberg: a fast table format for S3 by DataWorks Summit
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K views
Intro to Delta Lake by Databricks
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks1.5K views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
Batch Processing at Scale with Flink & Iceberg by Flink Forward
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward592 views
Iceberg: A modern table format for big data (Strata NY 2018) by Ryan Blue
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue2K views
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga... by DataScienceConferenc1
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake by Databricks
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks2.2K views
A Thorough Comparison of Delta Lake, Iceberg and Hudi by Databricks
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks11.1K views
Spark with Delta Lake by Knoldus Inc.
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.295 views
Achieving Lakehouse Models with Spark 3.0 by Databricks
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks622 views
Introduction to Azure Databricks by James Serra
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra27.2K views
Change Data Feed in Delta by Databricks
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks1.6K views
DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views

Similar to Massive Data Processing in Adobe Using Delta Lake

Getting Started with Amazon Redshift by
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
1.6K views65 slides
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... by
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
1.5K views50 slides
Best Practices for Migrating Your Data Warehouse to Amazon Redshift by
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
791 views37 slides
Building Analytic Apps for SaaS: “Analytics as a Service” by
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
888 views37 slides
System Design.pdf by
System Design.pdfSystem Design.pdf
System Design.pdfJitendraYadav351971
2 views39 slides
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013 by
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
26.8K views77 slides

Similar to Massive Data Processing in Adobe Using Delta Lake(20)

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... by Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K views
Best Practices for Migrating Your Data Warehouse to Amazon Redshift by Amazon Web Services
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Building Analytic Apps for SaaS: “Analytics as a Service” by Amazon Web Services
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013 by Amazon Web Services
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Amazon Web Services26.8K views
NewSQL - Deliverance from BASE and back to SQL and ACID by Tony Rogerson
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
Tony Rogerson1.5K views
Healthcare Claim Reimbursement using Apache Spark by Databricks
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks493 views
New Developments in Spark by Databricks
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks9.7K views
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -... by StreamNative
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
StreamNative173 views
Building and deploying large scale real time news system with my sql and dist... by Tao Cheng
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...
Tao Cheng2.5K views
Building a high-performance data lake analytics engine at Alibaba Cloud with ... by Alluxio, Inc.
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.205 views
Delta lake and the delta architecture by Adam Doyle
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle1K views
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam... by HostedbyConfluent
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent433 views
TenMax Data Pipeline Experience Sharing by Chen-en Lu
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
Chen-en Lu2K views

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
743 views16 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides
Learn to Use Databricks for Data Science by
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
1.6K views12 slides
Why APM Is Not the Same As ML Monitoring by
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
743 views26 slides

More from Databricks(20)

Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks743 views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks689 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks606 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks551 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks450 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks513 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views
Jeeves Grows Up: An AI Chatbot for Performance and Quality by Databricks
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks260 views
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue by Databricks
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks348 views
Infrastructure Agnostic Machine Learning Workload Deployment by Databricks
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks347 views

Recently uploaded

[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptxDataScienceConferenc1
11 views16 slides
K-Drama Recommendation Using Python by
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using PythonFridaPutriassa
5 views20 slides
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...DataScienceConferenc1
9 views77 slides
Report on OSINT by
Report on OSINTReport on OSINT
Report on OSINTAyonDebnathCertified
5 views15 slides
DGST Methodology Presentation.pdf by
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdfmaddierlegum
7 views9 slides
Oral presentation (1).pdf by
Oral presentation (1).pdfOral presentation (1).pdf
Oral presentation (1).pdfreemalmazroui8
5 views10 slides

Recently uploaded(20)

[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
K-Drama Recommendation Using Python by FridaPutriassa
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using Python
FridaPutriassa5 views
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402314 views
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views

Massive Data Processing in Adobe Using Delta Lake

  • 1. Massive Data Processing in Adobe using Delta Lake Yeshwanth Vijayakumar Sr. Engineering Manager/Architect @ Adobe
  • 2. Agenda § Introduction § What are we storing? § Data Representation and Nested Schema Evolution § Writer Worries and How to Wipe them Away § Staging Tables FTW § Datalake Replication Lag Tracking § Performance Time!
  • 3. Unified Profile Data Ingestion Unified Profile Experience Data Model Adobe Campaign AEM Adobe Analytics Adobe AdCloud Change Feed Streaming Stats Generation Single Tenant Multi Tenant
  • 5. Data Layout At a Glance An Idea about how the graph linkages are stored primaryId relatedIds field1 field2 field1000 123 123 a b c 456 456 d e f 123 123 d e l 789 789,101 x y z 101 789,101 x u p Conditions • primaryId does not change • relatedIds can change
  • 6. New Record comes in Indicating a new linkage, causing a change in graph membership primaryId relatedId field1 field2 field1000 103 103,789,101 q w r 789 103,789,101 x y z 101 103,789,101 x y z primaryId relatedId field1 field2 field1000 103 103,789,101 q w r New Record comes in linking 103 with 789 and 101 Causes a cascading change in rows of 789 and 101
  • 7. Main Access Pattern Query 1 Query 2 Query 3 Query 1000 Multiple Queries over 1 consolidated row
  • 8. Complexities? • Nested Fields • a.b.c.d[*].e nested hairiness! • Arrays! • MapType • Every Tenant has a different Schema! • Schema evolves constantly • Fields can get deleted, updated. • Multiple Sources • Streaming • Batch
  • 9. Scale? • Tenants have 10+ Billions of rows • PBs of data • Million RPS peak across the system • Triggers multiple downstream applications • Segmentation • Activation
  • 10. What is DeltaLake? From delta.io : Delta Lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. ACID Transactions Time Travel (data versioning) Uses Parquet Underneath Schema Enforcement and Schema Evolution Audit History Updates and Deletes Support Key Features
  • 11. Delta lake in Practice UPSERT SQL Compatible
  • 12. Writer Worries and How to Wipe them Away • Concurrency Conflicts • Column size • When individual column data exceeds 2GB, we see degradation in writes or OOM • Update frequency • Too frequent updates cause underlying filestore metadata issues. • This is because every transacation on an individual parquet causes CoW, • More updates => more rewrites on HDFS • Too Many small files !!!
  • 13. CDC (existing) Batch Ingestion / Streaming Ingestion / API based Ingest Mutation Apps CosmosDB CDC 1. Send Request to Cosmos 2.Ack 3.Emit CDC Consumed by • Stats • Edge • etc
  • 14. Dataflow with DeltaLake primary Id relatedId field 1 field2 field1000 103 103,789,101 q w r 789 103,789,101 x y z 101 103,789,101 x y z Cosmos DB primaryId relatedId field1 field1000 103 103,789,101 q r primaryId relatedId jsonString 103 103,789,101 <jsonStr> 789 103,789,101 <jsonStr> 101 103,789,101 <jsonStr> Staging Table Change Feed CDC Raw Table (per tenant) Check for Work every X minutes UPSERT/DELETE into Raw Table Fetch Records to process APPEND only! CDC Dumper Backfill Long Running Streaming Application Processor Partitioned by tenant and 15 min time intervals TenantLock in Redis
  • 15. Staging Tables FTW Fan-In pattern vs Fan-out • Multiple Source Writers Issue Solved • By centralizing all reads from CDC, since ALL writes generate a CDC • Staging Table in APPEND ONLY mode • No conflicts while writing to it • Filter out. Bad data > thresholds before making it to Raw Table • Batch Writes by reading larger blocks of data from Staging Table • Since it acts time aware message buffer
  • 16. Staging Table Logical View ProgressMap
  • 17. Why choose JSON String format? § We are doing a lazy Schema on-read approach. ▪ Yes. this is an anti-pattern. § Nested Schema Evolution was not supported on update in delta in 2020 ▪ Supported with latest version § We want to apply conflict resolution before upsert-ing ▪ Eg. resolveAndMerge(newData, oldData) ▪ UDF’s are strict on types, with the plethora of difference schemas , it is crazy to manage UDF per org in Multi tenant fashion ▪ Now we just have simple JSON merge udfs ▪ We use json-iter which is very efficient in loading partial bits of json and in manipulating them. § Don’t you lose predicate pushdown? ▪ We have pulled out all main push-down filters to individual columns ▪ Eg. timestamp, recordType, id, etc. ▪ Profile workloads are mainly scan based since we can run 1000’s of queries at a single time. ▪ Reading the whole JSON string from datalake is much faster and cheaper than reading from Cosmos for 20% of all fields.
  • 18. Schema On Read is more future safe approach for raw data § Wrangling Spark Structs is not user friendly § JSON schema is messy ▪ Crazy nesting ▪ Add maps to the equation, just the schema will be in MBs § Schema on Read using Json-iter means we can read what we need on a row by row basis § Materialized Views WILL have structs!
  • 19. Partition Scheme of Raw records • RawRecords Delta Table • recordType • sourceId • timestamp (key-value records will use DEFAULT value) z-order on primaryId z-order - Colocate column information in the same set of files using locality-preserving space-filling curves
  • 21. Replication Lag – 2 types • CDC Lag from Kafka • Tells us how much more work we need to do to catch up to write to Staging Table • How we track Lag on a per tenant basis • We track Max(TimeStamp) in CDC per org • We track Max(TSKEY) processed in Processor • Difference gives us rough lag of replication
  • 22. Merge/UPSERT Performance Action: UPSERT CDC stage into fragment Time Taken 170 K CDC Records – Maps to 100k Rows in Raw Table 15 seconds 1.7 Million CDC Records – Maps to 1 Million Rows in Raw Table 61 seconds Live Traffic Usecase: How long does it take X CDC messages to get upserted into Raw Table
  • 23. Job Performance Time! Hot Store (NoSQL Store) Delta Lake Size of Data 1 TB 64 GB Number of Partitions 80 189 Job Cores Used 112 112 Job Runtime 3 hours 25 mins
  • 24. TakeAways • Scan IO speed from datalake >>> Read from Hot Store • Reasonably fast eventually consistent replication within minutes • More partitions means better Spark executor core utilization • Potential to aggressively TTL data in hot store • More downstream materialization !!! • Incremental Computation Framework thanks to Staging tables!
  • 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.