SlideShare a Scribd company logo
1 of 16
Download to read offline
Apache Hudi
- Auto Record Key
Generation
● - Sivabalan Narayanan {sivabalan@onehouse.ai}
●
Agenda
- Primary keys
- Can we live w/o Primary keys
- Auto generation - Requirements
- Different Approaches
- Can it replace all use-cases?
- Wrap up
Primary keys
Primary keys
● Inspiration from RDBMS world
● Uniquely identify records
● EmployeeId (employee dataset), Trip Id (Uber trips dataset), etc.
● Ensure uniqueness across entire dataset
● Helps with deduping, indexing etc
● Also assists on read queries
Apache Hudi - Primary keys
Apache Hudi - Primary keys
● Uniquely identify records
● Plays a major role during indexing (write path)
a. Differentiate inserts vs updates(& deletes)
● Provides an aggregated view rather than letting users to stitch multiple versions of
records
● Enables efficient writes using different indexes like bloom filter, RLI etc.
● De-duping, compaction, log merging (MOR), etc.
Apache Hudi - Configuring
Primary keys
Primary keys Configuration
● Simple (single field)
● Complex (multi field)
Record key Config : “hoodie.datasource.write.recordkey.field”
Key Generator config: “hoodie.datasource.write.keygenerator.class”
Example: “hoodie.datasource.write.recordkey.field” = “employeeId”
Apache Hudi - Auto
Generation of record keys
Auto generation of record keys -
Requirements
● Global uniqueness across entire dataset
● Highly compressible
● Efficient to encode and decode
● Resilient to Partial task failures
● Compatible/reusable implementation across different engines
Apache Hudi - Auto
Generation of record keys
Auto generation of record keys -
Different Approaches
● Auto increment w/in batch
● Random Id generation
● Mototonically increasing id
● Format = “<instantTime>-<partitionId>-<rowId>”
Auto generation of record keys -
Different Approaches
Different approaches to encode
“<instantTime>-<partitionId>-<rowId>”
- Original string as is
- UUID6,7
- Base64 encoding
- ASCII encoding
Storage Comparison - Different
Approaches
Run time to encode - Different
Approaches
Thanks!
Questions?

More Related Content

Similar to A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys

SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to chooseLars Thorup
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow ManagementRomi Kuntsman
 
Drupal Content Management System
Drupal Content Management SystemDrupal Content Management System
Drupal Content Management SystemAdhoura Academy
 
Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management System
Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management SystemBarcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management System
Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management SystemWong Hoi Sing Edison
 
Lua as a business logic language in high load application
Lua as a business logic language in high load applicationLua as a business logic language in high load application
Lua as a business logic language in high load applicationIlya Martynov
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabaseMubashar Iqbal
 
Creating a custom API for a headless Drupal
Creating a custom API for a headless DrupalCreating a custom API for a headless Drupal
Creating a custom API for a headless DrupalExove
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virendervithakur
 
WSO2 Presentation Layer
WSO2 Presentation LayerWSO2 Presentation Layer
WSO2 Presentation LayerNuwan Bandara
 
Engage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino ApplicationsEngage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino Applicationspanagenda
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Neo4j
 
Storing User Files with Express, Stormpath, and Amazon S3
Storing User Files with Express, Stormpath, and Amazon S3Storing User Files with Express, Stormpath, and Amazon S3
Storing User Files with Express, Stormpath, and Amazon S3Stormpath
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 

Similar to A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys (19)

IT glossary
IT glossaryIT glossary
IT glossary
 
Linux Internals - Part I
Linux Internals - Part ILinux Internals - Part I
Linux Internals - Part I
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to choose
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Drupal Content Management System
Drupal Content Management SystemDrupal Content Management System
Drupal Content Management System
 
Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management System
Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management SystemBarcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management System
Barcamp Hong Kong 2014 - Commercial Use of OSS Web Content Management System
 
Lua as a business logic language in high load application
Lua as a business logic language in high load applicationLua as a business logic language in high load application
Lua as a business logic language in high load application
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Creating a custom API for a headless Drupal
Creating a custom API for a headless DrupalCreating a custom API for a headless Drupal
Creating a custom API for a headless Drupal
 
NodeJS
NodeJSNodeJS
NodeJS
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virender
 
WSO2 Presentation Layer
WSO2 Presentation LayerWSO2 Presentation Layer
WSO2 Presentation Layer
 
Engage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino ApplicationsEngage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino Applications
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
Storing User Files with Express, Stormpath, and Amazon S3
Storing User Files with Express, Stormpath, and Amazon S3Storing User Files with Express, Stormpath, and Amazon S3
Storing User Files with Express, Stormpath, and Amazon S3
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
Empathic API-Design
Empathic API-DesignEmpathic API-Design
Empathic API-Design
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

A Hudi Live Event: A Deep Dive into Hudi's Autogenerated Keys

  • 1. Apache Hudi - Auto Record Key Generation ● - Sivabalan Narayanan {sivabalan@onehouse.ai} ●
  • 2. Agenda - Primary keys - Can we live w/o Primary keys - Auto generation - Requirements - Different Approaches - Can it replace all use-cases? - Wrap up
  • 4. Primary keys ● Inspiration from RDBMS world ● Uniquely identify records ● EmployeeId (employee dataset), Trip Id (Uber trips dataset), etc. ● Ensure uniqueness across entire dataset ● Helps with deduping, indexing etc ● Also assists on read queries
  • 5. Apache Hudi - Primary keys
  • 6. Apache Hudi - Primary keys ● Uniquely identify records ● Plays a major role during indexing (write path) a. Differentiate inserts vs updates(& deletes) ● Provides an aggregated view rather than letting users to stitch multiple versions of records ● Enables efficient writes using different indexes like bloom filter, RLI etc. ● De-duping, compaction, log merging (MOR), etc.
  • 7. Apache Hudi - Configuring Primary keys
  • 8. Primary keys Configuration ● Simple (single field) ● Complex (multi field) Record key Config : “hoodie.datasource.write.recordkey.field” Key Generator config: “hoodie.datasource.write.keygenerator.class” Example: “hoodie.datasource.write.recordkey.field” = “employeeId”
  • 9. Apache Hudi - Auto Generation of record keys
  • 10. Auto generation of record keys - Requirements ● Global uniqueness across entire dataset ● Highly compressible ● Efficient to encode and decode ● Resilient to Partial task failures ● Compatible/reusable implementation across different engines
  • 11. Apache Hudi - Auto Generation of record keys
  • 12. Auto generation of record keys - Different Approaches ● Auto increment w/in batch ● Random Id generation ● Mototonically increasing id ● Format = “<instantTime>-<partitionId>-<rowId>”
  • 13. Auto generation of record keys - Different Approaches Different approaches to encode “<instantTime>-<partitionId>-<rowId>” - Original string as is - UUID6,7 - Base64 encoding - ASCII encoding
  • 14. Storage Comparison - Different Approaches
  • 15. Run time to encode - Different Approaches