In Salesforce, our customers are using High Velocity Sales to intelligently convert leads and create new opportunities. To support it, we built the engagement activity platform to automatically capture and store user engagement activities using delta lake, which is one of the key components supporting Einstein Analytics for creating powerful reports and dashboards and Sales Cloud Einstein for training machine learning models.
To convert leads and create new opportunities requires our engagement activity delta lake to handle data mutations at scale. In this presentation, we will share the challenges and learnings from building a high performance mutable data lake using delta lake which will include:
Independent Stream Process to Support Engagement Data Life cycle
Downstream Incremental Read
High Throughput Transactions in Engagement ID Mutation
Detect Cascading ID Mutation with Graph
Data Skipping and Z-Order with I/O Pruning
High Data Consistency and Integrity
Exact Once Write Across Tables
Global Synchronization and Ordering
2. ● Engagement Delta Lake
● Pipeline Requirements
● Pipeline Design
● Performance Benchmarking Results
● Q & A
Agenda
3. ● At salesforce, our customers are using High Velocity Sales(HVS) to intelligently convert
leads and create new opportunities
● We built the Engagement Activity Platform(EAP) to capture and store user engagement
activities
● Engagement activity delta lake is the key component on EAP
● This large amount of data can only scale using engagement activity delta lake, built on
top of Delta Lake
What’s Engagement Activity Delta Lake
4. Key Use Case of Engagement Delta Lake
● Use engagement metrics/rates to help identify which cadence, template is more
effective
● Use engagements such as open/reply rate to identify which customers are more
engaged
● Leverage engagement dashboard to drive intelligence into sales productivity
5. Delta Lake Requirements
● Independent Stream Process to Support Engagement Data Lifecycle
● Downstream Batch/Incremental Read
● High Throughput Transactions in Engagement ID Mutation
● High Data Consistency and Integrity
7. ● Created a separate table called Notification Table that was partitioned by organization
ID and ingestion timestamp.
● Downstream consumers can use streaming mode
○ Pulling from Notification Table to get delta changes metadata (table
name/orgId/timestamp)
○ Use the metadata to pull engagement data from Data Table
● Consumers can also pull directly from Data Table using batch mode
Downstream Batch/Incremental Read
8. Downstream Batch/Incremental Read
● We extended this design pattern to mutation/TTL/GDPR jobs
● We keep the insert/update/delete counter per batch for auditing
Notification Table
9. High Throughput Transactions in Engagement ID Mutation
Engagement ID Mutation
● Support : Convert, Merge and Delete
● Convert: A lead L could become a contact C with a new ID, and all
engagements that belong to L will have a new Engagement ID.
10. High Throughput Transactions in Engagement ID Mutation
Engagement ID Mutation
Mutation Request Table
EngagementData Table
Id: string
12. High Throughput Transactions in Engagement ID Mutation
Partitioned by OrgId and Z-Order by Engagement Date
● To have the data of Engagement table evenly distributed across
reasonably-sized files
● Data are written to a partition directory of org, clustered by z-order
column and
● Benefits:
○ Manage granularity, small files per org/date are merged into a
bigger one which helps reduce the number of small files.
○ We can tune partition granularity with spark.databricks.delta.optimize.maxFileSize
13. High Throughput Transactions in Engagement ID Mutation
Query by I/O Pruning -- Data Skipping and Z-Order
● Data Skipping
○ Delta Lake automatically maintains the min and max value for up
to 32 fields in delta table and stores those values as part of the
metadata
○ By leveraging min-max ranges, Delta Lake is able to skip the files
that are out of the range of the querying field values
● Z-Order
○ In order to make it effective, data can be clustered by Z-Order
columns so that min-max ranges are narrow and, ideally,
non-overlapping
15. High Data Consistency and Integrity
Exact Once Write Across Tables
● Checkpoint Store that stores the start, end offset, Kafka metadata, and last job state for
a given checkpoint
● We created a batch metadata store to store job name, batch ID (the last succeed batch
ID provided by Spark foreachBatch API), process name, and timestamp of last modified
time.
16. High Data Consistency and Integrity
Exact Once Write Across Tables
Happy Path Flow
job_name batch_id process_name last_modified
Ingestion 10 data_ingestion 1611683367
Ingestion 10 data_notification 1611673367
17. High Data Consistency and Integrity
Exact Once Write Across Tables
Unhappy Path Flow
job_name batch_id process_name last_modified
Ingestion 10 data_ingestion 1611683367
Ingestion 9 data_notification 1611673367
18. High Data Consistency and Integrity
Global Synchronization and Ordering
● Avoid Conflicting Commit
● Ensure Engagement Lifecycle Order:
○ Ingestion -> Mutation -> Deletion
● Apply to micro batch
19. High Data Consistency and Integrity
Global Synchronization and Ordering
● Global Synchronization
○ ZK Distributed Lock
● Ordering
○ Compare & Swap
20. High Data Consistency and Integrity
Global Synchronization and Ordering
1. Streaming Job starts and Job Coordinator is
initialized with Zookeeper.
2. Streaming job pulls data from Kafka periodically
and starts a micro-batch process when message
arrives in Kafka.
3. Within a micro-batch process, Job Coordinator
first tries to obtain a distributed lock with the
given resource name set in
job.coordinator.lock.name. If it cannot obtain a
lock within a given time, it gives up. The next pull
will start from the last checkpoint.
4. Once it obtains the lock, it reads the the
Predecessor field and compares it with the
expected one set in job.coordinator.predecessor.
(4.1) If the predecessor is not expected, it gives
up this turn, releases the lock, and the next pull
will start from the last checkpoint. (4.2) If the
predecessor is expected, it registers its name set
in job.coordinator.name.
5. The micro-batch process starts.
6. The checkpoint is saved.
21. Engagement ID Mutation Performance Benchmarking Result
28 million update/delete requests within 8 minutes
cluster of 32 - i3 8x Large
spark.databricks.delta.optimize.maxFileSize = 128 MB
22. Resources
● Engagement Activity Delta Lake
○ Blog , Video
● Boost Delta Lake Performance with Data Skipping and Z-Order
○ Blog, Video
● Global Synchronization and Ordering in Delta Lake
○ Blog, Video