In this Hudi Live Event, Notion's Thomas Chow and Nathan Louie will talk about how their data infrastructure transformed to support their exponential growth and novel product use cases.
Notion has experienced exponential user growth that led them to re-think their technical infrastructure. A common challenge they experienced are write heavy changes that spread randomly across millions of document trees. Check out the live event slides to see how Hudi addresses these workload complexities and understand the considerations and design strategies that drove the evolution of Notion's data infrastructure.
14. Learnings
Tuning file size for write amplification: ~300MB
Sort key on last_updated_at
● Recently changed records are clustered together
Consistent sharding scheme
● Borrow sharding from Postgres
15. Improvements
● Net saving: $1.25M/year
● Fivetran full re-sync dropped from 1 week to
2 hours
● Historical fivetran re-sync can be done
without maxing out resources on live DBs
● Reliable incremental sync every 4 hours
16. Product Use Case Spotlight: Notion AI Q&A
● Ask Notion AI questions in chat interface
● Get response based on your Notion pages
and databases
17. AI Product Architecture
● Generate embeddings from user data in
offline batch job
● Load into Vector DB
● Continuously update embeddings as
updates come in online Kafka job
Insert Offline and
Online Path diagram
18. AI Embeddings: Hudi Usage in Batch Indexing
● How many vectors do we generate in the
offline batch
● Once per day
● 4 hour Hudi update cadence enables us to
index and catch up quickly
● How many rows (vectors) we write per
batch
● How long does the full pipeline take
●
Insert diagram of
Datalake -> derived
hudi table of
embeddings -> Spark
load to Pinecone
19. Thanks to the OneHouse team
Vinoth Chandar, Alexey Kudinkin, Ethan Guo, Bhavani Sudha Saktheeswaran, Kyle Weller