SlideShare a Scribd company logo
1 of 20
Download to read offline
[Linkedin Live]: Notionʼs Journey
through different stages of scale
Dec 13th, 2023
Thomas Chow Nathan Louie
Notionʼs Data
Everything in Notion is a “Block”
Data Scale
Doubling Rate: 6 months - 1 year
2021 start: 20B block rows
2022 end: 70B block rows
2023 end: >200B block rows
File size at rest: 10TB -> ~50TB (compressed)
Timeline
Single Postgres OLTP - before 2020
Postgres Sharding - H2ʼ 2020
https://www.notion.so/blog/sharding-postgres-at-notion
15 logical shards
32 database instances
Data Warehouse Architecture - 2021
Data Warehouse Architecture - Challenges
1% upserts day over day
>90% of upserts are
updates
Why HUDI?
● Incremental processing
○ Random upserts
● Out-of-box CDC (Debezium)
● Good with indexing (bloom filter)
● Directory partitioning
● Open Source velocity and relationships
Data Lake
Data Lake Architecture - 2022
HUDI Incremental Processing
Learnings
Tuning file size for write amplification: ~300MB
Sort key on last_updated_at
● Recently changed records are clustered together
Consistent sharding scheme
● Borrow sharding from Postgres
Improvements
● Net saving: $1.25M/year
● Fivetran full re-sync dropped from 1 week to
2 hours
● Historical fivetran re-sync can be done
without maxing out resources on live DBs
● Reliable incremental sync every 4 hours
Product Use Case Spotlight: Notion AI Q&A
● Ask Notion AI questions in chat interface
● Get response based on your Notion pages
and databases
AI Product Architecture
● Generate embeddings from user data in
offline batch job
● Load into Vector DB
● Continuously update embeddings as
updates come in online Kafka job
Insert Offline and
Online Path diagram
AI Embeddings: Hudi Usage in Batch Indexing
● How many vectors do we generate in the
offline batch
● Once per day
● 4 hour Hudi update cadence enables us to
index and catch up quickly
● How many rows (vectors) we write per
batch
● How long does the full pipeline take
●
Insert diagram of
Datalake -> derived
hudi table of
embeddings -> Spark
load to Pinecone
Thanks to the OneHouse team
Vinoth Chandar, Alexey Kudinkin, Ethan Guo, Bhavani Sudha Saktheeswaran, Kyle Weller
Thanks!
Questions?

More Related Content

Similar to A Hudi Live Event: Notion's journey through different stages of data scale

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Similar to A Hudi Live Event: Notion's journey through different stages of data scale (20)

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
Webinar: NoSQL as the New Normal
Webinar: NoSQL as the New NormalWebinar: NoSQL as the New Normal
Webinar: NoSQL as the New Normal
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Belgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep WaterBelgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep Water
 
H2O at BelgradeR Meetup
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR Meetup
 
PyData London Bokeh Tutorial - Bryan Van de Ven
PyData London Bokeh Tutorial - Bryan Van de VenPyData London Bokeh Tutorial - Bryan Van de Ven
PyData London Bokeh Tutorial - Bryan Van de Ven
 
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
 
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Keepin’ It Real(-Time) With Nadine Farah | Current 2022Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
lakeFS Community Call no. 2
lakeFS Community Call no. 2lakeFS Community Call no. 2
lakeFS Community Call no. 2
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de venCreative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 

A Hudi Live Event: Notion's journey through different stages of data scale

  • 1. [Linkedin Live]: Notionʼs Journey through different stages of scale Dec 13th, 2023 Thomas Chow Nathan Louie
  • 3. Everything in Notion is a “Block”
  • 4. Data Scale Doubling Rate: 6 months - 1 year 2021 start: 20B block rows 2022 end: 70B block rows 2023 end: >200B block rows File size at rest: 10TB -> ~50TB (compressed)
  • 6. Single Postgres OLTP - before 2020
  • 7. Postgres Sharding - H2ʼ 2020 https://www.notion.so/blog/sharding-postgres-at-notion 15 logical shards 32 database instances
  • 9. Data Warehouse Architecture - Challenges 1% upserts day over day >90% of upserts are updates
  • 10. Why HUDI? ● Incremental processing ○ Random upserts ● Out-of-box CDC (Debezium) ● Good with indexing (bloom filter) ● Directory partitioning ● Open Source velocity and relationships
  • 14. Learnings Tuning file size for write amplification: ~300MB Sort key on last_updated_at ● Recently changed records are clustered together Consistent sharding scheme ● Borrow sharding from Postgres
  • 15. Improvements ● Net saving: $1.25M/year ● Fivetran full re-sync dropped from 1 week to 2 hours ● Historical fivetran re-sync can be done without maxing out resources on live DBs ● Reliable incremental sync every 4 hours
  • 16. Product Use Case Spotlight: Notion AI Q&A ● Ask Notion AI questions in chat interface ● Get response based on your Notion pages and databases
  • 17. AI Product Architecture ● Generate embeddings from user data in offline batch job ● Load into Vector DB ● Continuously update embeddings as updates come in online Kafka job Insert Offline and Online Path diagram
  • 18. AI Embeddings: Hudi Usage in Batch Indexing ● How many vectors do we generate in the offline batch ● Once per day ● 4 hour Hudi update cadence enables us to index and catch up quickly ● How many rows (vectors) we write per batch ● How long does the full pipeline take ● Insert diagram of Datalake -> derived hudi table of embeddings -> Spark load to Pinecone
  • 19. Thanks to the OneHouse team Vinoth Chandar, Alexey Kudinkin, Ethan Guo, Bhavani Sudha Saktheeswaran, Kyle Weller