Primary keys have always been mandatory configurations that the Hudi user needs to set. To enhance ease of use, Hudi 0.14.0 introduces autogenerated keys where primary key columns are not readily available. In this live session, we’ll get a deep dive into the design principles and rationale as to why the team introduced this new feature and how it can improve usability.
Driving Behavioral Change for Information Management through Data-Driven Gree...
A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys
1. Apache Hudi
- Auto Record Key
Generation
● - Sivabalan Narayanan {sivabalan@onehouse.ai}
●
2. Agenda
- Primary keys
- Can we live w/o Primary keys
- Auto generation - Requirements
- Different Approaches
- Can it replace all use-cases?
- Wrap up
4. Primary keys
● Inspiration from RDBMS world
● Uniquely identify records
● EmployeeId (employee dataset), Trip Id (Uber trips dataset), etc.
● Ensure uniqueness across entire dataset
● Helps with deduping, indexing etc
● Also assists on read queries
6. Apache Hudi - Primary keys
● Uniquely identify records
● Plays a major role during indexing (write path)
a. Differentiate inserts vs updates(& deletes)
● Provides an aggregated view rather than letting users to stitch multiple versions of
records
● Enables efficient writes using different indexes like bloom filter, RLI etc.
● De-duping, compaction, log merging (MOR), etc.
10. Auto generation of record keys -
Requirements
● Global uniqueness across entire dataset
● Highly compressible
● Efficient to encode and decode
● Resilient to Partial task failures
● Compatible/reusable implementation across different engines
12. Auto generation of record keys -
Different Approaches
● Auto increment w/in batch
● Random Id generation
● Mototonically increasing id
● Format = “<instantTime>-<partitionId>-<rowId>”
13. Auto generation of record keys -
Different Approaches
Different approaches to encode
“<instantTime>-<partitionId>-<rowId>”
- Original string as is
- UUID6,7
- Base64 encoding
- ASCII encoding