A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys

•

0 likes•30 views

Primary keys have always been mandatory configurations that the Hudi user needs to set. To enhance ease of use, Hudi 0.14.0 introduces autogenerated keys where primary key columns are not readily available. In this live session, we’ll get a deep dive into the design principles and rationale as to why the team introduced this new feature and how it can improve usability.

Technology

Apache Hudi
- Auto Record Key
Generation
● - Sivabalan Narayanan {sivabalan@onehouse.ai}
●

Agenda
- Primary keys
- Can we live w/o Primary keys
- Auto generation - Requirements
- Diﬀerent Approaches
- Can it replace all use-cases?
- Wrap up

Primary keys
● Inspiration from RDBMS world
● Uniquely identify records
● EmployeeId (employee dataset), Trip Id (Uber trips dataset), etc.
● Ensure uniqueness across entire dataset
● Helps with deduping, indexing etc
● Also assists on read queries

Apache Hudi - Primary keys
● Uniquely identify records
● Plays a major role during indexing (write path)
a. Diﬀerentiate inserts vs updates(& deletes)
● Provides an aggregated view rather than letting users to stitch multiple versions of
records
● Enables eﬃcient writes using diﬀerent indexes like bloom ﬁlter, RLI etc.
● De-duping, compaction, log merging (MOR), etc.

Primary keys Conﬁguration
● Simple (single ﬁeld)
● Complex (multi ﬁeld)
Record key Conﬁg : “hoodie.datasource.write.recordkey.ﬁeld”
Key Generator conﬁg: “hoodie.datasource.write.keygenerator.class”
Example: “hoodie.datasource.write.recordkey.ﬁeld” = “employeeId”

Apache Hudi - Auto
Generation of record keys

Auto generation of record keys -
Requirements
● Global uniqueness across entire dataset
● Highly compressible
● Eﬃcient to encode and decode
● Resilient to Partial task failures
● Compatible/reusable implementation across diﬀerent engines

Auto generation of record keys -
Different Approaches
● Auto increment w/in batch
● Random Id generation
● Mototonically increasing id
● Format = “<instantTime>-<partitionId>-<rowId>”

Auto generation of record keys -
Different Approaches
Diﬀerent approaches to encode
“<instantTime>-<partitionId>-<rowId>”
- Original string as is
- UUID6,7
- Base64 encoding
- ASCII encoding

Storage Comparison - Different
Approaches

Run time to encode - Different
Approaches

Similar to A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys

IT glossaryMd. Foyaz Ullah Shahin

Linux Internals - Part IEmertxe Information Technologies Pvt Ltd

SQL or NoSQL - how to chooseLars Thorup

Spark Workflow ManagementRomi Kuntsman

Drupal Content Management SystemAdhoura Academy

Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management SystemWong Hoi Sing Edison

Lua as a business logic language in high load applicationIlya Martynov

PostgreSQL - Object Relational DatabaseMubashar Iqbal

Creating a custom API for a headless DrupalExove

NodeJSLinkMe Srl

Big SQL NYC Event December by Virendervithakur

WSO2 Presentation LayerNuwan Bandara

Engage 2020 - Best Practices for analyzing Domino Applicationspanagenda

HBase introduction talkHayden Marchant

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Neo4j

Storing User Files with Express, Stormpath, and Amazon S3Stormpath

Apache ArrowMike Frampton

Empathic API-DesignCorneil du Plessis

Apache Jena Elephas and FriendsRob Vesse

Similar to A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys (19)

IT glossary

Linux Internals - Part I

SQL or NoSQL - how to choose

Spark Workflow Management

Drupal Content Management System

Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management System

Lua as a business logic language in high load application

PostgreSQL - Object Relational Database

Creating a custom API for a headless Drupal

NodeJS

Big SQL NYC Event December by Virender

WSO2 Presentation Layer

Engage 2020 - Best Practices for analyzing Domino Applications

HBase introduction talk

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop

Storing User Files with Express, Stormpath, and Amazon S3

Apache Arrow

Empathic API-Design

Apache Jena Elephas and Friends

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Scaling API-first – The story of a global engineering organizationRadu Cotescu

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

GenAI Risks & Security Meetup 01052024.pdflior mazor

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Tech Trends Report 2024 Future Today Institute.pdfhans926745

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Artificial Intelligence: Facts and MythsJoaquim Jorge

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?

Data Cloud, More than a CDP by Matt Robison

GenCyber Cyber Security Day Presentation

Scaling API-first – The story of a global engineering organization

How to Troubleshoot Apps for the Modern Connected Worker

[2024]Digital Global Overview Report 2024 Meltwater.pdf

presentation ICT roal in 21st century education

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

AWS Community Day CPH - Three problems of Terraform

GenAI Risks & Security Meetup 01052024.pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tech Trends Report 2024 Future Today Institute.pdf

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Artificial Intelligence: Facts and Myths

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Strategies for Landing an Oracle DBA Job as a Fresher

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Driving Behavioral Change for Information Management through Data-Driven Gree...

A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys

1. Apache Hudi - Auto Record Key Generation ● - Sivabalan Narayanan {sivabalan@onehouse.ai} ●

2. Agenda - Primary keys - Can we live w/o Primary keys - Auto generation - Requirements - Diﬀerent Approaches - Can it replace all use-cases? - Wrap up

3. Primary keys

4. Primary keys ● Inspiration from RDBMS world ● Uniquely identify records ● EmployeeId (employee dataset), Trip Id (Uber trips dataset), etc. ● Ensure uniqueness across entire dataset ● Helps with deduping, indexing etc ● Also assists on read queries

5. Apache Hudi - Primary keys

6. Apache Hudi - Primary keys ● Uniquely identify records ● Plays a major role during indexing (write path) a. Differentiate inserts vs updates(& deletes) ● Provides an aggregated view rather than letting users to stitch multiple versions of records ● Enables efficient writes using different indexes like bloom filter, RLI etc. ● De-duping, compaction, log merging (MOR), etc.

7. Apache Hudi - Conﬁguring Primary keys

8. Primary keys Configuration ● Simple (single field) ● Complex (multi field) Record key Config : “hoodie.datasource.write.recordkey.field” Key Generator config: “hoodie.datasource.write.keygenerator.class” Example: “hoodie.datasource.write.recordkey.field” = “employeeId”

9. Apache Hudi - Auto Generation of record keys

10. Auto generation of record keys - Requirements ● Global uniqueness across entire dataset ● Highly compressible ● Eﬃcient to encode and decode ● Resilient to Partial task failures ● Compatible/reusable implementation across diﬀerent engines

11. Apache Hudi - Auto Generation of record keys

12. Auto generation of record keys - Different Approaches ● Auto increment w/in batch ● Random Id generation ● Mototonically increasing id ● Format = “<instantTime>-<partitionId>-<rowId>”

13. Auto generation of record keys - Different Approaches Diﬀerent approaches to encode “<instantTime>-<partitionId>-<rowId>” - Original string as is - UUID6,7 - Base64 encoding - ASCII encoding

14. Storage Comparison - Different Approaches

15. Run time to encode - Different Approaches

16. Thanks! Questions?

A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys

Recommended

Recommended

More Related Content

Similar to A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys

Similar to A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys (19)

Recently uploaded

Recently uploaded (20)

A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys