A High Performance Mutable Engagement Activity Delta Lake

A High Performance
Mutable Engagement
Activity Delta Lake
Zhidong Ke, Heng Zhang

● Engagement Delta Lake
● Pipeline Requirements
● Pipeline Design
● Performance Benchmarking Results
● Q & A
Agenda

● At salesforce, our customers are using High Velocity Sales(HVS) to intelligently convert
leads and create new opportunities
● We built the Engagement Activity Platform(EAP) to capture and store user engagement
activities
● Engagement activity delta lake is the key component on EAP
● This large amount of data can only scale using engagement activity delta lake, built on
top of Delta Lake
What’s Engagement Activity Delta Lake

Key Use Case of Engagement Delta Lake
● Use engagement metrics/rates to help identify which cadence, template is more
effective
● Use engagements such as open/reply rate to identify which customers are more
engaged
● Leverage engagement dashboard to drive intelligence into sales productivity

Delta Lake Requirements
● Independent Stream Process to Support Engagement Data Lifecycle
● Downstream Batch/Incremental Read
● High Throughput Transactions in Engagement ID Mutation
● High Data Consistency and Integrity

Independent Stream Process to Support Engagement Data Lifecycle

● Created a separate table called Notification Table that was partitioned by organization
ID and ingestion timestamp.
● Downstream consumers can use streaming mode
○ Pulling from Notification Table to get delta changes metadata (table
name/orgId/timestamp)
○ Use the metadata to pull engagement data from Data Table
● Consumers can also pull directly from Data Table using batch mode
Downstream Batch/Incremental Read

Downstream Batch/Incremental Read
● We extended this design pattern to mutation/TTL/GDPR jobs
● We keep the insert/update/delete counter per batch for auditing
Notification Table

High Throughput Transactions in Engagement ID Mutation
Engagement ID Mutation
● Support : Convert, Merge and Delete
● Convert: A lead L could become a contact C with a new ID, and all
engagements that belong to L will have a new Engagement ID.

Engagement ID Mutation
Mutation Request Table
EngagementData Table
Id: string

Use Graph To Detect Cascading Mutation

Partitioned by OrgId and Z-Order by Engagement Date
● To have the data of Engagement table evenly distributed across
reasonably-sized files
● Data are written to a partition directory of org, clustered by z-order
column and
● Benefits:
○ Manage granularity, small files per org/date are merged into a
bigger one which helps reduce the number of small files.
○ We can tune partition granularity with spark.databricks.delta.optimize.maxFileSize

Query by I/O Pruning -- Data Skipping and Z-Order
● Data Skipping
○ Delta Lake automatically maintains the min and max value for up
to 32 fields in delta table and stores those values as part of the
metadata
○ By leveraging min-max ranges, Delta Lake is able to skip the files
that are out of the range of the querying field values
● Z-Order
○ In order to make it effective, data can be clustered by Z-Order
columns so that min-max ranges are narrow and, ideally,
non-overlapping

Query by I/O Pruning -- Data Skipping and Z-Order

High Data Consistency and Integrity
Exact Once Write Across Tables
● Checkpoint Store that stores the start, end offset, Kafka metadata, and last job state for
a given checkpoint
● We created a batch metadata store to store job name, batch ID (the last succeed batch
ID provided by Spark foreachBatch API), process name, and timestamp of last modified
time.

Happy Path Flow
job_name batch_id process_name last_modified
Ingestion 10 data_ingestion 1611683367
Ingestion 10 data_notification 1611673367

Unhappy Path Flow
job_name batch_id process_name last_modified
Ingestion 10 data_ingestion 1611683367
Ingestion 9 data_notification 1611673367

Global Synchronization and Ordering
● Avoid Conflicting Commit
● Ensure Engagement Lifecycle Order:
○ Ingestion -> Mutation -> Deletion
● Apply to micro batch

● Global Synchronization
○ ZK Distributed Lock
● Ordering
○ Compare & Swap

1. Streaming Job starts and Job Coordinator is
initialized with Zookeeper.
2. Streaming job pulls data from Kafka periodically
and starts a micro-batch process when message
arrives in Kafka.
3. Within a micro-batch process, Job Coordinator
first tries to obtain a distributed lock with the
given resource name set in
job.coordinator.lock.name. If it cannot obtain a
lock within a given time, it gives up. The next pull
will start from the last checkpoint.
4. Once it obtains the lock, it reads the the
Predecessor field and compares it with the
expected one set in job.coordinator.predecessor.
(4.1) If the predecessor is not expected, it gives
up this turn, releases the lock, and the next pull
will start from the last checkpoint. (4.2) If the
predecessor is expected, it registers its name set
in job.coordinator.name.
5. The micro-batch process starts.
6. The checkpoint is saved.

Engagement ID Mutation Performance Benchmarking Result
28 million update/delete requests within 8 minutes
cluster of 32 - i3 8x Large
spark.databricks.delta.optimize.maxFileSize = 128 MB

Resources
● Engagement Activity Delta Lake
○ Blog , Video
● Boost Delta Lake Performance with Data Skipping and Z-Order
○ Blog, Video
● Global Synchronization and Ordering in Delta Lake
○ Blog, Video

A High Performance Mutable Engagement Activity Delta Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A High Performance Mutable Engagement Activity Delta Lake

Similar to A High Performance Mutable Engagement Activity Delta Lake (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

A High Performance Mutable Engagement Activity Delta Lake