A Hudi Live Event: Notion's journey through different stages of data scale

•

0 likes•92 views

In this Hudi Live Event, Notion's Thomas Chow and Nathan Louie will talk about how their data infrastructure transformed to support their exponential growth and novel product use cases. Notion has experienced exponential user growth that led them to re-think their technical infrastructure. A common challenge they experienced are write heavy changes that spread randomly across millions of document trees. Check out the live event slides to see how Hudi addresses these workload complexities and understand the considerations and design strategies that drove the evolution of Notion's data infrastructure.

Software

[Linkedin Live]: Notionʼs Journey
through different stages of scale
Dec 13th, 2023
Thomas Chow Nathan Louie

Data Scale
Doubling Rate: 6 months - 1 year
2021 start: 20B block rows
2022 end: 70B block rows
2023 end: >200B block rows
File size at rest: 10TB -> ~50TB (compressed)

Postgres Sharding - H2ʼ 2020
https://www.notion.so/blog/sharding-postgres-at-notion
15 logical shards
32 database instances

Data Warehouse Architecture - Challenges
1% upserts day over day
>90% of upserts are
updates

Why HUDI?
● Incremental processing
○ Random upserts
● Out-of-box CDC (Debezium)
● Good with indexing (bloom filter)
● Directory partitioning
● Open Source velocity and relationships

Learnings
Tuning file size for write amplification: ~300MB
Sort key on last_updated_at
● Recently changed records are clustered together
Consistent sharding scheme
● Borrow sharding from Postgres

Improvements
● Net saving: $1.25M/year
● Fivetran full re-sync dropped from 1 week to
2 hours
● Historical fivetran re-sync can be done
without maxing out resources on live DBs
● Reliable incremental sync every 4 hours

Product Use Case Spotlight: Notion AI Q&A
● Ask Notion AI questions in chat interface
● Get response based on your Notion pages
and databases

AI Product Architecture
● Generate embeddings from user data in
offline batch job
● Load into Vector DB
● Continuously update embeddings as
updates come in online Kafka job
Insert Offline and
Online Path diagram

AI Embeddings: Hudi Usage in Batch Indexing
● How many vectors do we generate in the
offline batch
● Once per day
● 4 hour Hudi update cadence enables us to
index and catch up quickly
● How many rows (vectors) we write per
batch
● How long does the full pipeline take
●
Insert diagram of
Datalake -> derived
hudi table of
embeddings -> Spark
load to Pinecone

Thanks to the OneHouse team
Vinoth Chandar, Alexey Kudinkin, Ethan Guo, Bhavani Sudha Saktheeswaran, Kyle Weller

Similar to A Hudi Live Event: Notion's journey through different stages of data scale

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022 Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs. Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

HostedbyConfluent

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Alluxio, Inc.

Marco Pozzan Power BI consultant & Trainer Scenario di utilizzo del real-time di Power BI. In questa sessione verrà introdotta la teoria sul real-time dashboarding offerto da Power BI. Poi ci si focalizzerà sun un caso pratico di real-time dataset in modalità ibrida per la realizzazione di una dashboard di controllo con la possibilità di effettuare il write back e permettere all’utente di effettuare analisi what-if.

CCI2018 - Real-time dashboard whatif analysis

walk2talk srl

Workflow Hacks #1 - dots. Tokyo

Taro L. Saito

Benjamin Hopp (Solutions Architect) @ Imply: Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets. This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit. Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics. Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack. The most important contributor to a fast analytical setup is getting the data model right. The talk will center around various choices you can make to prepare your data to get best possible query performance. We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes. We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed. We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage. You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more. And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.

A Day in the Life of a Druid Implementor and Druid's Roadmap

Itai Yaffe

Alluxio Global Online Meetup May 7, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Rohit Jain, Facebook Yutian "James" Sun, Facebook Bin Fan, Alluxio For many latency-sensitive SQL workloads, Presto is often bound by retrieving distant data. In this talk, Rohit Jain, James Sun from Facebook and Bin Fan from Alluxio will introduce their teams’ collaboration on adding a local on-SSD Alluxio cache inside Presto workers to improve unsatisfied Presto latency. This talk will focus on: - Insights of the Presto workloads at Facebook w.r.t. cache effectiveness - API and internals of the Alluxio local cache, from design trade-offs (e.g. caching granularity, concurrency level and etc) to performance optimizations. - Initial performance analysis and timeline to deliver this feature for general Presto users. - Discussion on our future work to optimize cache performance with deeper integration with Presto

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...

Alluxio, Inc.

What started as a way for web giants to solve problems of serious scale has become the default way all enterprises manage Big Data. Despite having a catchy, if inaccurate title, there really isn't a coherent "NoSQL" category, nor is there a simple future for the range of NoSQL databases. In this presentation, Matt Asay will outline the reasons for NoSQL's existence and persistence, how the different NoSQL technologies help enterprises get control of Big Data, and will identify the trends that point to a bright future for post-relational databases.

Webinar: NoSQL as the New Normal

MongoDB

Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...

MongoDB

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...

MongoDB

Presto@Uber

Zhenxiao Luo

Introduction to Azure DocumentDB

Radenko Zec

Belgrade R - Intro to H2O and Deep Water

Sri Ambati

H2O at BelgradeR Meetup

Jo-fai Chow

PyData London Bokeh Tutorial - Bryan Van de Ven

PyData

SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...

serge luca

Keepin’ It Real(-Time) With Nadine Farah | Current 2022 Let’s get real— companies are incorporating more streaming sources as part of their data stack to unlock their customers’ and business needs and trends in real-time. While many data engineers ingest streaming data into their data warehouses or data lakes, they are not unlocking the full potential of the data. In order to extract the most value from your streaming data, you’ll need to consider: - data freshness - query latency - storage - concurrency - data mutability - analyzing streaming data in context (i.e. JOINing) with data from other data sources In this tech talk, we’ll cover these aforementioned considerations in detail. We’ll show you how to build a SQL-based, real-time recommendation engine and customer 360 data application using Kafka, Rockset, and Retool. By the end, you’ll be equipped to effectively evaluate databases and tools to meet your real-time needs with streaming data.

Keepin’ It Real(-Time) With Nadine Farah | Current 2022

HostedbyConfluent

Introduction to Machine Learning with H2O and Python

Jo-fai Chow

lakeFS Community Call no. 2

lakeFS

Hadoop from Hive with Stinger to Tez

Jan Pieter Posthuma

Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven

PyData

Similar to A Hudi Live Event: Notion's journey through different stages of data scale (20)

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

CCI2018 - Real-time dashboard whatif analysis

Workflow Hacks #1 - dots. Tokyo

A Day in the Life of a Druid Implementor and Druid's Roadmap

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...

Webinar: NoSQL as the New Normal

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...

Presto@Uber

Introduction to Azure DocumentDB

Belgrade R - Intro to H2O and Deep Water

H2O at BelgradeR Meetup

PyData London Bokeh Tutorial - Bryan Van de Ven

SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...

Keepin’ It Real(-Time) With Nadine Farah | Current 2022

Introduction to Machine Learning with H2O and Python

lakeFS Community Call no. 2

Hadoop from Hive with Stinger to Tez

Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven

Recently uploaded

Pharm-D Biostatistics and Research methodology

Anusha Are

A Secure and Reliable Document Management System is Essential.docx

ComplianceQuest1

At the recent Microsoft Ignite 2023 conference, Microsoft unveiled a groundbreaking strategy that will redefine enterprise work management. The plan involves integrating Microsoft’s key planning tools, Microsoft To Do, Microsoft Planner, and Microsoft Project for the web into a unified experience called “Microsoft Planner.” What does this new strategy from Microsoft mean for current users? Join us and learn how best to take advantage of this announcement while gaining a clear path on how to elevate the current state of Microsoft Planner from a basic task manager to a comprehensive tool for Enterprise Work Management using OnePlan. Learn how OnePlan’s integration with Microsoft Planner allows for strategic alignment with business goals through advanced features like strategic planning, portfolio management, resource management, financial management, and more!

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

OnePlan Solutions

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Philip Schwarz

Azure Native Qumulo scales elastically for common High Performance Compute (HPC) workloads based on application requirements for: Financial Services, Automotive, Genomics / Life Sciences, Media and Entertainment, Energy, Oil and Gas, etc. Performance can be dialed UP (and back down) much higher than the examples shown here. These slides offer a glimpse into ANQ's HPC capabilities, although at a smaller scale. We invite YOU to do your own testing (with a free ANQ trial) and work with us to test your HPC workloads in Azure.

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

ryanfarris8

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

masabamasaba

+971565801893 Mtp-Kit (500MG) Prices » Dubai [(+971565801893**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Leen Whatsapp +971565801893 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971565801893''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971565801893' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Clinic in Abu Dhabi, United Arab Emirates.+971565801893

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

Health

Data spaces in distributed environments should be allowed to evolve in agile ways providing data space owners with large flexibility about which data they store. Agility and heterogeneity, however, jeopardize data exchanges because representations may build on varying ontologies and data consumers may not rely on the semantic correctness of their queries in the context of semantically heterogeneous, evolving data spaces. Graph data spaces are one example of a powerful model for representing and querying data whose semantics may change over time. To assert and enforce conditions on individual graph data spaces, shape languages (e.g SHACL) have been developed. We investigate the question of how querying and programming can be guarded by reasoning over SHACL constraints in a distributed setting and we sketch a picture of how a future landscape based on semantically heterogeneous data spaces might look like.

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Steffen Staab

Test automation is a cornerstone of software development and quality assurance in today's rapidly evolving digital landscape. Its significance cannot be overstated. Businesses can enhance efficiency, productivity, and accelerate software delivery to market through automation, streamlining testing processes effectively. This comprehensive guide addresses the best practices for test automation in 2024. It offers a detailed checklist to empower you to optimize your automation efforts and maintain a competitive edge.

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

kalichargn70th171

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

masabamasaba

Conference: Engage2024 in Antwerp Type: Workshop Speakers: Florian Vogler, Henning Kunz, Christoph Adler Title: Navigating the Future with The Hitchhiker's Guide to Notes and Domino 14 Abstract: Embark on an exhilarating journey with industry trailblazers Florian Vogler, Henning Kunz, and Christoph Adler in this not-to-be-missed workshop at the forefront of the tech universe. Get ready for a thrilling kick-off as we navigate the current state of the HCL universe, setting the stage for an exploration of the groundbreaking Notes and Domino 14. Discover the latest enhancements and revolutionary features that will redefine your experience. In this interactive session, unlock a treasure trove of tips and tricks to elevate your utilization of version 14, both with and without the game-changing panagenda MarvelClient. Brace yourself for also diving into Nomad, Nomad Web, and VoltMX, expanding your horizons in the expansive HCL landscape. Be a part of this exclusive opportunity to stay ahead in the ever-evolving world of HCL technologies. Your journey to mastering Notes and Domino 14 begins here. And remember, in the spirit of intergalactic exploration, don't forget to bring your towel!

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

panagenda

VTU technical seminar 8Th Sem on Scikit-learn

AmarnathKambale

10 Trends Likely to Shape Enterprise Technology in 2024

Mind IT Systems

ManageIQ - Sprint 236 Review - Slide Deck

ManageIQ

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

masabamasaba

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

masabamasaba

Model Call Girl Services in Delhi reach out to us at 🔝 9953056974 🔝✔️✔️ Our agency presents a selection of young, charming call girls available for bookings at Oyo Hotels. Experience high-class escort services at pocket-friendly rates, with our female escorts exuding both beauty and a delightful personality, ready to meet your desires. Whether it's Housewives, College girls, Russian girls, Muslim girls, or any other preference, we offer a diverse range of options to cater to your tastes. We provide both in-call and out-call services for your convenience. Our in-call location in Delhi ensures cleanliness, hygiene, and 100% safety, while our out-call services offer doorstep delivery for added ease. We value your time and money, hence we kindly request pic collectors, time-passers, and bargain hunters to refrain from contacting us. Our services feature various packages at competitive rates: One shot: ₹2000/in-call, ₹5000/out-call Two shots with one girl: ₹3500/in-call, ₹6000/out-call Body to body massage with sex: ₹3000/in-call Full night for one person: ₹7000/in-call, ₹10000/out-call Full night for more than 1 person: Contact us at 🔝 9953056974 🔝. for details Operating 24/7, we serve various locations in Delhi, including Green Park, Lajpat Nagar, Saket, and Hauz Khas near metro stations. For premium call girl services in Delhi 🔝 9953056974 🔝. Thank you for considering us!

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

9953056974 Low Rate Call Girls In Saket, Delhi NCR

In the realm of real-time applications, Large Language Models (LLMs) have long dominated language-centric tasks, while tools like OpenCV have excelled in the visual domain. However, the future (maybe) lies in the fusion of LLMs and deep learning, giving birth to the revolutionary concept of Large Action Models (LAMs). Imagine a world where AI not only comprehends language but mimics human actions on technology interfaces. For example, the Rabbit r1 device presented at CES 2024, driven by an AI operating system and LAM, brings this vision to life. It executes complex commands, leveraging GUIs with unprecedented ease. In this presentation, join me on a journey as a software engineer tinkering with WebRTC, Janus, and LLM/LAMs. Together, we’ll evaluate the current state of these AI technologies, unraveling the potential they hold for shaping the future of real-time applications.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Alberto González Trastoy

Many specialized tools cater to distinct stages within the software development lifecycle (SDLC). These tools target various aspects of development, delivery, and operations, each with its unique strengths. Uniting these diverse testing needs into a single continuous testing platform presents several challenges. Such a platform must seamlessly integrate with various development tools and environments, accommodate different testing methodologies, and remain flexible to adapt to organizational processes and quality standards.

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...

kalichargn70th171

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf

VishalKumarJha10

Recently uploaded (20)

Pharm-D Biostatistics and Research methodology

A Secure and Reliable Document Management System is Essential.docx

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

VTU technical seminar 8Th Sem on Scikit-learn

10 Trends Likely to Shape Enterprise Technology in 2024

ManageIQ - Sprint 236 Review - Slide Deck

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf

A Hudi Live Event: Notion's journey through different stages of data scale

1. [Linkedin Live]: Notionʼs Journey through different stages of scale Dec 13th, 2023 Thomas Chow Nathan Louie

2. Notionʼs Data

3. Everything in Notion is a “Block”

4. Data Scale Doubling Rate: 6 months - 1 year 2021 start: 20B block rows 2022 end: 70B block rows 2023 end: >200B block rows File size at rest: 10TB -> ~50TB (compressed)

5. Timeline

6. Single Postgres OLTP - before 2020

7. Postgres Sharding - H2ʼ 2020 https://www.notion.so/blog/sharding-postgres-at-notion 15 logical shards 32 database instances

8. Data Warehouse Architecture - 2021

9. Data Warehouse Architecture - Challenges 1% upserts day over day >90% of upserts are updates

10. Why HUDI? ● Incremental processing ○ Random upserts ● Out-of-box CDC (Debezium) ● Good with indexing (bloom filter) ● Directory partitioning ● Open Source velocity and relationships

11. Data Lake

12. Data Lake Architecture - 2022

13. HUDI Incremental Processing

14. Learnings Tuning file size for write amplification: ~300MB Sort key on last_updated_at ● Recently changed records are clustered together Consistent sharding scheme ● Borrow sharding from Postgres

15. Improvements ● Net saving: $1.25M/year ● Fivetran full re-sync dropped from 1 week to 2 hours ● Historical fivetran re-sync can be done without maxing out resources on live DBs ● Reliable incremental sync every 4 hours

16. Product Use Case Spotlight: Notion AI Q&A ● Ask Notion AI questions in chat interface ● Get response based on your Notion pages and databases

17. AI Product Architecture ● Generate embeddings from user data in offline batch job ● Load into Vector DB ● Continuously update embeddings as updates come in online Kafka job Insert Offline and Online Path diagram

18. AI Embeddings: Hudi Usage in Batch Indexing ● How many vectors do we generate in the offline batch ● Once per day ● 4 hour Hudi update cadence enables us to index and catch up quickly ● How many rows (vectors) we write per batch ● How long does the full pipeline take ● Insert diagram of Datalake -> derived hudi table of embeddings -> Spark load to Pinecone

19. Thanks to the OneHouse team Vinoth Chandar, Alexey Kudinkin, Ethan Guo, Bhavani Sudha Saktheeswaran, Kyle Weller

20. Thanks! Questions?

A Hudi Live Event: Notion's journey through different stages of data scale

Recommended

Recommended

More Related Content

Similar to A Hudi Live Event: Notion's journey through different stages of data scale

Similar to A Hudi Live Event: Notion's journey through different stages of data scale (20)

Recently uploaded

Recently uploaded (20)

A Hudi Live Event: Notion's journey through different stages of data scale