Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Mutable data @ scale
1. Mutable Data @ Scale
afinkelstein@salesforce.com
Alexey Finkelstein, Software Engineer
2. Private & Confidential
Datorama At-A-Glance
Founded in
Employees
& growing
quickly
Acquired in
October 2018
Brands
Agencies Publishers
Industry verticalsBy Ran Sarig, Efi Cohen
& Katrin Ribant
450+
2012
192018
Offices
worldwide
2000+
300+
23
50+
Private & Confidential
3. Private & Confidential
+20
Verticals
Broad blue-chip customer base
+23
Verticals
300
Agencies
+2000
Brands
Every agency holding group that has run an RFP for a global client reporting
solution in the last 3 years has selected Datorama as their platform of record.
4. Datorama
Connect & Unify Marketing Data Sources
Integrate, cleanse, and classify data into a unified
view using AI
Visualize AI-Powered Insights
Surface insights to optimize channel and campaign
performance in real-time
Report Across Channels and Campaigns
Powerful one-click dashboards, custom
visualizations, and shareable reports
Collaborate and Act to Drive ROI
Make every insight actionable with cross-platform
alerts and activations
Enable cross-platform marketing intelligence
+
5. Spend Your Time Wisely
+80% On Insights
+80% On Preparations
Time to Insight
10. DatoLakes (Datorama Data Lakes)
Granular data support in reduced cost
● Your granular data together with your
aggregated data in one view
● Aimed for Raw data, including ETL, storage,
SQL access and reporting.
● Aimed to support data which is accessed less
frequently and in low concurrency, in lower
cost.
● Raw data can later on be aggregated and
joined with the rest of the data model.
11. DatoLakes (Datorama Data Lakes)
● Managing a data lake is a big hassle. (ETL, queries & other controls)
● Merging between granular and aggregate sources is a must
● Datorama to provide “lake as a service”
Challenges
12. Data is NOT immutable
● External vendors have windows of reconciliations (up to 6 months)
● Our users want to update/delete specific rows/set
● Our users love to backdate
● Most (if not all) big data solutions are append only and updating the data is considered a
heavy process
● Transactional updates required
14. Requirements
● Separation of compute and storage - MUST
● MPP query engine - MUST
● ANSI SQL - MUST
● JDBC (for external clients) - MUST
● Transactional and not append only - MUST
● Cloud Vendor Agnostic - MUST
● Linear Scale - MUST
The solution we decided on was Presto and S3/Azure Storage
15. High Level Update Flow
1. Read the input file
2. Determine what data segments it operates on
3. Read the corresponding segments of the table from storage
4. Update the segments with input data
5. Store to a new location with the new version number
6. Add the updated partitions to Hive
7. Outdated partitions are cleaned in the background
A
B
C
A
B
A*
B*
C*
16. Mutable Data - Swap Partition Requirements
● The ETL process should trigger a swap partition(s) at the end of the process
● We need the swap to be transactional (to avoid dirty reads)
● It needs to support transactional change of multiple partitions in multiple tables at the
same time
18. Solution #1 - First Attempt (Past)
1. Partition the table by “key_version” field
a. key = actual column value
b. version = incremental number
c. e.g. 20190101_009
2. Create an external metastore that holds the
active versions of each partition (per table)
3. Commit the changes at the end of the ETL
(cross partition/ cross tables) to support a
transactional process
4. Connect the metastore table into hive and
include a subquery in every generated query.
19. Solution #2 - Present
Inline SQL didn’t initiate partition pruning by
Presto
1. Query the meta store while generating the
query to get the list of the relevant partitions for
the query
2. Inline the filter in the query
20. Solution #3 - Future
Process requires 2 steps (query meta + query
presto) and does not support direct SQL
access to clients
1. Update hive database (MySQL) directly in a
transactional manner just like we updated our
own metastore.
2. Refresh presto/hive caches to refresh the
metastore
21. Retrospective
● We’re able to “check” all the required items from our requirements
○ Separation of compute and storage, MPP query engine, ANSI SQL, JDBC, Transactional, Cloud
Vendor Agnostic & Linearly Scaled
● Data is stored in ORC files (due to the nature of our queries it was a big performance boost)
● Everybody is happy :)
Talk Track: (added by Idit)
Started Datorama 6 years ago, in 2012 (by Ran, Efi and Kathryn). Focusing on Marketers and Marketers only
Datorama is a SaaS (software as a service) platform that gives marketers everything they need to connect all of their data sources together into a single source of truth for analysis and insights.
Has 17 offices around the globe and over 380 employees and keep growing
Let’s talk about the challenge we solve. If you’re a modern marketer you’re engaging audiences with your brand across different regions, using different campaigns. By definition you’re using a lot of different technologies to do that. Bringing everything together – all the data that is extremely siloed across those different technologies – is a real operational problem.
Talk track for this Flash slide:
We have a lot of great customers even before joining Salesforce
We solve a painful problem that exists at scale
Call out IBM, Salesforce, EA, Ticketmaster etc
Agency groups have been quick to adopt the platform at scale – we are the preferred supplier for 4 top 5 groups…
This is not a coincidence – we are the best at solving this
70-30 split but evolving….
This is where the power of Datorama comes in. Datorama enables cross-platform marketing intelligence. What does that mean? It means one single place to:
•Connect and unify all of your marketing data and insights in one centralized place across Marketing Cloud technologies and any tools and technologies in the market – all clicks, no code.
•Visualize AI-powered insights across all your data so you can take action at scale to achieve your KPIs
•Easily report across all your channels and campaigns so every stakeholder in your organization has the right information at their fingertips
•And collaborate and take action to drive ROI to bring your organization together towards common goals
This helps marketers hold every investment and activity accountable!
Talk Track: (added by Idit)
Scalable - horizontal scale in every module / service
Biggest challenge for all growing channels, customers, processing jobs is to have a scalable solution
Multi-tenancy is a big challenge
S3 TB usage
Total Row - is customer Data
API steams - connection to external customers accounts with updated data