SlideShare a Scribd company logo
Mutable Data @ Scale
afinkelstein@salesforce.com
Alexey Finkelstein, Software Engineer
Private & Confidential
Datorama At-A-Glance
Founded in
Employees
& growing
quickly
Acquired in
October 2018
Brands
Agencies Publishers
Industry verticalsBy Ran Sarig, Efi Cohen
& Katrin Ribant
450+
2012
192018
Offices
worldwide
2000+
300+
23
50+
Private & Confidential
Private & Confidential
+20
Verticals
Broad blue-chip customer base
+23
Verticals
300
Agencies
+2000
Brands
Every agency holding group that has run an RFP for a global client reporting
solution in the last 3 years has selected Datorama as their platform of record.
Datorama
Connect & Unify Marketing Data Sources
Integrate, cleanse, and classify data into a unified
view using AI
Visualize AI-Powered Insights
Surface insights to optimize channel and campaign
performance in real-time
Report Across Channels and Campaigns
Powerful one-click dashboards, custom
visualizations, and shareable reports
Collaborate and Act to Drive ROI
Make every insight actionable with cross-platform
alerts and activations
Enable cross-platform marketing intelligence
+
Spend Your Time Wisely
+80% On Insights
+80% On Preparations
Time to Insight
Data To Insights In Minutes
Scale - in numbers
• 3.5M interactive
analytical queries served
per day
• 700,000 Data Stream
processed daily
• 100,000 Reports
generated daily
• 25,000 Workspaces
• 30,000 Users
• 1.5 PB of Data available
for interactive querying
• 99.9% Uptime
• 4 Different fully
redundant geographical
deployments
• ~600 Servers
• >50 microservices
Salesforce Acquisition
August 2018 - $850M
Data Lake
DatoLakes (Datorama Data Lakes)
Granular data support in reduced cost
● Your granular data together with your
aggregated data in one view
● Aimed for Raw data, including ETL, storage,
SQL access and reporting.
● Aimed to support data which is accessed less
frequently and in low concurrency, in lower
cost.
● Raw data can later on be aggregated and
joined with the rest of the data model.
DatoLakes (Datorama Data Lakes)
● Managing a data lake is a big hassle. (ETL, queries & other controls)
● Merging between granular and aggregate sources is a must
● Datorama to provide “lake as a service”
Challenges
Data is NOT immutable
● External vendors have windows of reconciliations (up to 6 months)
● Our users want to update/delete specific rows/set
● Our users love to backdate
● Most (if not all) big data solutions are append only and updating the data is considered a
heavy process
● Transactional updates required
The Solution
Requirements
● Separation of compute and storage - MUST
● MPP query engine - MUST
● ANSI SQL - MUST
● JDBC (for external clients) - MUST
● Transactional and not append only - MUST
● Cloud Vendor Agnostic - MUST
● Linear Scale - MUST
The solution we decided on was Presto and S3/Azure Storage
High Level Update Flow
1. Read the input file
2. Determine what data segments it operates on
3. Read the corresponding segments of the table from storage
4. Update the segments with input data
5. Store to a new location with the new version number
6. Add the updated partitions to Hive
7. Outdated partitions are cleaned in the background
A
B
C
A
B
A*
B*
C*
Mutable Data - Swap Partition Requirements
● The ETL process should trigger a swap partition(s) at the end of the process
● We need the swap to be transactional (to avoid dirty reads)
● It needs to support transactional change of multiple partitions in multiple tables at the
same time
Architecture
S3/AzureBlobStorage
Meta
Store
ETL Q
Queue
Resource
Manager
Query
Solution #1 - First Attempt (Past)
1. Partition the table by “key_version” field
a. key = actual column value
b. version = incremental number
c. e.g. 20190101_009
2. Create an external metastore that holds the
active versions of each partition (per table)
3. Commit the changes at the end of the ETL
(cross partition/ cross tables) to support a
transactional process
4. Connect the metastore table into hive and
include a subquery in every generated query.
Solution #2 - Present
Inline SQL didn’t initiate partition pruning by
Presto
1. Query the meta store while generating the
query to get the list of the relevant partitions for
the query
2. Inline the filter in the query
Solution #3 - Future
Process requires 2 steps (query meta + query
presto) and does not support direct SQL
access to clients
1. Update hive database (MySQL) directly in a
transactional manner just like we updated our
own metastore.
2. Refresh presto/hive caches to refresh the
metastore
Retrospective
● We’re able to “check” all the required items from our requirements
○ Separation of compute and storage, MPP query engine, ANSI SQL, JDBC, Transactional, Cloud
Vendor Agnostic & Linearly Scaled
● Data is stored in ORC files (due to the nature of our queries it was a big performance boost)
● Everybody is happy :)
We’re Hiring!
Contact us at
http://datorama.com/join-us
https://engineering.datorama.com/
Mutable data @ scale

More Related Content

What's hot

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
Building a Machine Learning Recommendation Engine in SQL
Building a Machine Learning Recommendation Engine in SQLBuilding a Machine Learning Recommendation Engine in SQL
Building a Machine Learning Recommendation Engine in SQL
SingleStore
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
Databricks
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Databricks
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
SingleStore
 
Converging Database Transactions and Analytics
Converging Database Transactions and Analytics Converging Database Transactions and Analytics
Converging Database Transactions and Analytics
SingleStore
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Spark Summit
 
Presto Summit 2018 - 08 - FINRA
Presto Summit 2018  - 08 - FINRAPresto Summit 2018  - 08 - FINRA
Presto Summit 2018 - 08 - FINRA
kbajda
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
SingleStore
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
Mark Kromer
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free Life
SingleStore
 
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...
HostedbyConfluent
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Databricks
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Databricks
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
HostedbyConfluent
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
Pat Patterson
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
HostedbyConfluent
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
confluent
 

What's hot (20)

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
 
Building a Machine Learning Recommendation Engine in SQL
Building a Machine Learning Recommendation Engine in SQLBuilding a Machine Learning Recommendation Engine in SQL
Building a Machine Learning Recommendation Engine in SQL
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
 
Converging Database Transactions and Analytics
Converging Database Transactions and Analytics Converging Database Transactions and Analytics
Converging Database Transactions and Analytics
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
 
Presto Summit 2018 - 08 - FINRA
Presto Summit 2018  - 08 - FINRAPresto Summit 2018  - 08 - FINRA
Presto Summit 2018 - 08 - FINRA
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free Life
 
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
 

Similar to Mutable data @ scale

World2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverviewWorld2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverview
Farah Omer
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith Kumar Pampatti
 
Informaticapowercenter pennon soft
Informaticapowercenter pennon softInformaticapowercenter pennon soft
Informaticapowercenter pennon soft
PennonSoft
 
Informatica PowerCenter
Informatica PowerCenterInformatica PowerCenter
Informatica PowerCenter
Ramy Mahrous
 
Data exposure in Azure - production use-case
Data exposure in Azure - production use-caseData exposure in Azure - production use-case
Data exposure in Azure - production use-case
Alexander Laysha
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
Mohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_Datastage
Mohammed Shaukath
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Resume
ResumeResume
Resume
rajeswari p
 
Proposed Solution Design (PD) - Reporting & Analytics Solution_v1.0.pptx
Proposed Solution Design (PD) -  Reporting & Analytics Solution_v1.0.pptxProposed Solution Design (PD) -  Reporting & Analytics Solution_v1.0.pptx
Proposed Solution Design (PD) - Reporting & Analytics Solution_v1.0.pptx
AtanuMandal39
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
kiran14360
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Analyti x mapping manager product overview presentation
Analyti x mapping manager product overview presentationAnalyti x mapping manager product overview presentation
Analyti x mapping manager product overview presentation
AnalytixDataServices
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Databricks
 
How to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptx
How to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptxHow to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptx
How to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptx
ssuser225811
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT Approach
Databricks
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
Cloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
 

Similar to Mutable data @ scale (20)

World2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverviewWorld2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverview
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
 
Informaticapowercenter pennon soft
Informaticapowercenter pennon softInformaticapowercenter pennon soft
Informaticapowercenter pennon soft
 
Informatica PowerCenter
Informatica PowerCenterInformatica PowerCenter
Informatica PowerCenter
 
Data exposure in Azure - production use-case
Data exposure in Azure - production use-caseData exposure in Azure - production use-case
Data exposure in Azure - production use-case
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Mohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_Datastage
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Resume
ResumeResume
Resume
 
Proposed Solution Design (PD) - Reporting & Analytics Solution_v1.0.pptx
Proposed Solution Design (PD) -  Reporting & Analytics Solution_v1.0.pptxProposed Solution Design (PD) -  Reporting & Analytics Solution_v1.0.pptx
Proposed Solution Design (PD) - Reporting & Analytics Solution_v1.0.pptx
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Analyti x mapping manager product overview presentation
Analyti x mapping manager product overview presentationAnalyti x mapping manager product overview presentation
Analyti x mapping manager product overview presentation
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
 
How to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptx
How to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptxHow to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptx
How to transport PeopleSoft Crystal to BIP via automation_M.... (1).pptx
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT Approach
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 

Recently uploaded

"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 

Recently uploaded (20)

"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 

Mutable data @ scale

  • 1. Mutable Data @ Scale afinkelstein@salesforce.com Alexey Finkelstein, Software Engineer
  • 2. Private & Confidential Datorama At-A-Glance Founded in Employees & growing quickly Acquired in October 2018 Brands Agencies Publishers Industry verticalsBy Ran Sarig, Efi Cohen & Katrin Ribant 450+ 2012 192018 Offices worldwide 2000+ 300+ 23 50+ Private & Confidential
  • 3. Private & Confidential +20 Verticals Broad blue-chip customer base +23 Verticals 300 Agencies +2000 Brands Every agency holding group that has run an RFP for a global client reporting solution in the last 3 years has selected Datorama as their platform of record.
  • 4. Datorama Connect & Unify Marketing Data Sources Integrate, cleanse, and classify data into a unified view using AI Visualize AI-Powered Insights Surface insights to optimize channel and campaign performance in real-time Report Across Channels and Campaigns Powerful one-click dashboards, custom visualizations, and shareable reports Collaborate and Act to Drive ROI Make every insight actionable with cross-platform alerts and activations Enable cross-platform marketing intelligence +
  • 5. Spend Your Time Wisely +80% On Insights +80% On Preparations Time to Insight
  • 6. Data To Insights In Minutes
  • 7. Scale - in numbers • 3.5M interactive analytical queries served per day • 700,000 Data Stream processed daily • 100,000 Reports generated daily • 25,000 Workspaces • 30,000 Users • 1.5 PB of Data available for interactive querying • 99.9% Uptime • 4 Different fully redundant geographical deployments • ~600 Servers • >50 microservices
  • 10. DatoLakes (Datorama Data Lakes) Granular data support in reduced cost ● Your granular data together with your aggregated data in one view ● Aimed for Raw data, including ETL, storage, SQL access and reporting. ● Aimed to support data which is accessed less frequently and in low concurrency, in lower cost. ● Raw data can later on be aggregated and joined with the rest of the data model.
  • 11. DatoLakes (Datorama Data Lakes) ● Managing a data lake is a big hassle. (ETL, queries & other controls) ● Merging between granular and aggregate sources is a must ● Datorama to provide “lake as a service” Challenges
  • 12. Data is NOT immutable ● External vendors have windows of reconciliations (up to 6 months) ● Our users want to update/delete specific rows/set ● Our users love to backdate ● Most (if not all) big data solutions are append only and updating the data is considered a heavy process ● Transactional updates required
  • 14. Requirements ● Separation of compute and storage - MUST ● MPP query engine - MUST ● ANSI SQL - MUST ● JDBC (for external clients) - MUST ● Transactional and not append only - MUST ● Cloud Vendor Agnostic - MUST ● Linear Scale - MUST The solution we decided on was Presto and S3/Azure Storage
  • 15. High Level Update Flow 1. Read the input file 2. Determine what data segments it operates on 3. Read the corresponding segments of the table from storage 4. Update the segments with input data 5. Store to a new location with the new version number 6. Add the updated partitions to Hive 7. Outdated partitions are cleaned in the background A B C A B A* B* C*
  • 16. Mutable Data - Swap Partition Requirements ● The ETL process should trigger a swap partition(s) at the end of the process ● We need the swap to be transactional (to avoid dirty reads) ● It needs to support transactional change of multiple partitions in multiple tables at the same time
  • 18. Solution #1 - First Attempt (Past) 1. Partition the table by “key_version” field a. key = actual column value b. version = incremental number c. e.g. 20190101_009 2. Create an external metastore that holds the active versions of each partition (per table) 3. Commit the changes at the end of the ETL (cross partition/ cross tables) to support a transactional process 4. Connect the metastore table into hive and include a subquery in every generated query.
  • 19. Solution #2 - Present Inline SQL didn’t initiate partition pruning by Presto 1. Query the meta store while generating the query to get the list of the relevant partitions for the query 2. Inline the filter in the query
  • 20. Solution #3 - Future Process requires 2 steps (query meta + query presto) and does not support direct SQL access to clients 1. Update hive database (MySQL) directly in a transactional manner just like we updated our own metastore. 2. Refresh presto/hive caches to refresh the metastore
  • 21. Retrospective ● We’re able to “check” all the required items from our requirements ○ Separation of compute and storage, MPP query engine, ANSI SQL, JDBC, Transactional, Cloud Vendor Agnostic & Linearly Scaled ● Data is stored in ORC files (due to the nature of our queries it was a big performance boost) ● Everybody is happy :)
  • 22. We’re Hiring! Contact us at http://datorama.com/join-us https://engineering.datorama.com/

Editor's Notes

  1. Talk Track: (added by Idit) Started Datorama 6 years ago, in 2012 (by Ran, Efi and Kathryn). Focusing on Marketers and Marketers only Datorama is a SaaS (software as a service) platform that gives marketers everything they need to connect all of their data sources together into a single source of truth for analysis and insights. Has 17 offices around the globe and over 380 employees and keep growing Let’s talk about the challenge we solve. If you’re a modern marketer you’re engaging audiences with your brand across different regions, using different campaigns. By definition you’re using a lot of different technologies to do that. Bringing everything together – all the data that is extremely siloed across those different technologies – is a real operational problem.
  2. Talk track for this Flash slide: We have a lot of great customers even before joining Salesforce We solve a painful problem that exists at scale Call out IBM, Salesforce, EA, Ticketmaster etc Agency groups have been quick to adopt the platform at scale – we are the preferred supplier for 4 top 5 groups… This is not a coincidence – we are the best at solving this 70-30 split but evolving….
  3. This is where the power of Datorama comes in. Datorama enables cross-platform marketing intelligence. What does that mean? It means one single place to: •Connect and unify all of your marketing data and insights in one centralized place across Marketing Cloud technologies and any tools and technologies in the market – all clicks, no code. •Visualize AI-powered insights across all your data so you can take action at scale to achieve your KPIs •Easily report across all your channels and campaigns so every stakeholder in your organization has the right information at their fingertips •And collaborate and take action to drive ROI to bring your organization together towards common goals This helps marketers hold every investment and activity accountable!
  4. Talk Track: (added by Idit) Scalable - horizontal scale in every module / service Biggest challenge for all growing channels, customers, processing jobs is to have a scalable solution Multi-tenancy is a big challenge S3 TB usage Total Row - is customer Data API steams - connection to external customers accounts with updated data