Successfully reported this slideshow.
Your SlideShare is downloading. ×

Hudi architecture, fundamentals and capabilities

Check these out next

1 of 89 Ad
1 of 89 Ad

Hudi architecture, fundamentals and capabilities

Download to read offline

Learn about Hudi's architecture, concurrency control mechanisms, table services and tools.

By : Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal

Learn about Hudi's architecture, concurrency control mechanisms, table services and tools.

By : Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

Advertisement

Hudi architecture, fundamentals and capabilities

  1. 1. Apache Hudi Learning Series
  2. 2. Hudi Intro Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores). Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. Incremental Database Ingestion De-duping Log Events Storage Management Transactional Writes Faster Derived/ETL Data Compliance/Data Deletions Unique key constraints Late data handling
  3. 3. Industry/Cloud Solutions
  4. 4. Data Consistency Datacenter agnostic, xDC replication, strong consistency Data Freshness < 15 min of freshness on Lake & warehouse Hudi for Data Application Feature store for ML Incremental Processing for all Easy on-boarding, monitoring & debugging Adaptive Data Layout Stitch files, Optimize layout, Prune columns, Encrypt rows/columns on demand through a standardized interface Efficient Query Execution Column indexes for improved query planning & execution Compute & Storage Efficiency Do more with less CPU, Storage, Memory Data Accuracy Semantic validations for columns: NotNull, Range etc Hudi@Uber
  5. 5. UseCase (Latency, Scale..) Batch / Stream (Spark/Flink//Presto/...) Source A Table API Incremental Stream Pulls & Joins Consumer Derived Table A delta Source B delta Source N delta ... Table A delta Table B delta Table N delta ... UseCase (Latency, Scale..) Table API Incremental Stream Pulls & Joins Consumer Derived Table B Data Processing : Incremental Streams Batch / Stream (Spark/Flink/Presto/...) *source = {Kafka, CSV, DFS, Hive table, Hudi table etc}
  6. 6. 500B+ records/day 150+ PB Transactional Data Lake 8,000+ Tables Hudi@Uber Facts and figures
  7. 7. Read/Write Client APIs
  8. 8. 01 Write Client 02 Read Client 03 Supported Engines 04 Q&A Agenda
  9. 9. Hudi APIs Highlights Snapshot Isolation Readers will not see partial writes from writers. Atomic Writes Writes happen either full, or not at all. Partial writes (eg from killed processes) are not valid. Read / Write Optimized Depending on the required SLA, writes or reads can be made faster (at the other’s expense). Incremental Reads/Writes Readers can choose to only read new records from some timestamp. This makes efficient incremental pipelines possible. Point In Time Queries (aka Time-Travel) Readers can read snapshot views at either the latest time, or some past time. Table Services Table management services such as clustering, or compacting (covered in later series).
  10. 10. Insert ● Similar to INSERT in databases ● Insert records without checking for duplicates. Hudi Write APIs Upsert ● Similar to UPDATE or INSERT paradigms in databases ● Uses an index to find existing records to update and avoids duplicates. ● Slower than Insert.
  11. 11. Hudi Write APIs Bulk Insert ● Similar to Insert. ● Handles large amounts of data - best for bootstrapping use-cases. ● Does not guarantee file sizing Insert Overwrite ● Overwrite a partition with new data. ● Useful for backfilling use-cases. Insert Upsert
  12. 12. Bulk Insert Hudi Write APIs Delete ● Similar to DELETE in databases. ● Soft Deletes / Hard Deletes Hive Registration ● Sync the changes to your dataset to Hive. Insert Overwrite Insert Upsert
  13. 13. Hudi Write APIs Rollback / Restore ● Rollback inserts/upserts etc to restore the dataset to some past state. ● Useful when mistakes happen. Bulk Insert Hive Registration Insert Upsert Insert Overwrite Delete
  14. 14. Hudi Read APIs Snapshot Read ● This is the typical read pattern ● Read data at latest time (standard) ● Read data at some point in time (time travel) Incremental Read ● Read records modified only after a certain time or operation. ● Can be used in incremental processing pipelines.
  15. 15. Hudi Metadata Client Get Latest Snapshot Files Get the list of files that contain the latest snapshot data. This is useful for backing up / archiving datasets. Globally Consistent Meta Client Get X-DC consistent views at the cost of freshness. Get Partitions / Files Mutated Since Get a list of partitions or files Mutated since some time timestamp. This is also useful for incremental backup / archiving. There is a read client for Hudi Table Metadata as well. Here are some API highlights:
  16. 16. Hudi Table Services Compaction Convert files on disk into read optimized files (see Merge on Read in the next section). Clustering Clustering can make reads more efficient by changing the physical layout of records across files. (see section 3) Clean Remove Hudi data files that are no longer needed. (see section 3) Archiving Archive Hudi metadata files that are no longer being actively used. (see section 3)
  17. 17. Code Examples val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
  18. 18. Code Examples val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) This is a data gen class provided by Hudi for testing We’ll be using SPARK for this demo
  19. 19. Code Examples: Generate Data val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) generatedDataDF.show() +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | begin_lat| begin_lon| driver| end_lat| end_lon| fare| geo| rider| ts| uuid| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | 0.47269058795|0.461578584504|driver-213| 0.75480340700| 0.967115994201|34.1582847163|americas/brazi...|rider-213|1611908339000|500ca486-9323-46f...| | 0.61000705621| 0.87794022954|driver-213| 0.340787050592| 0.503079814229| 43.49238112|americas/brazi...|rider-123|1614586739000|cf0767fe-2afa-4b0...| | 0.57318354079| 0.49234796529|driver-213|0.0898858178093|0.4252089969871| 64.276962958|americas/unite...|rider-214|1617326300000|a7fc67fd-8026-4c9...| |0.216241503676|0.142850512594|driver-213| 0.589094962481| 0.096682383192| 93.560181152|americas/unite...|rider-417|1612167539000|77572226-6edd-4fb...| | 0.406135109| 0.56440921390|driver-213| 0.79870630494|0.0269835922718|17.8511352550| asia/india/c...|rider-249|1617326301000|2227d696-bc2f-490...| | 0.87420415264| 0.75282681532|driver-213| 0.919782712888| 0.36246477087|19.1791391066|americas/unite...|rider-351|1618301939000|90a4dae9-e21d-4a4...| | 0.18564880850| 0.96945864178|driver-213|0.3818636703720|0.2525265221447| 33.922164839|americas/unite...|rider-491|1611908339000|bb742f4d-1ab8-42e...| | 0.07505887600|0.038441044444|driver-213|0.0437635335453| 0.634604006761| 66.620843664|americas/brazi...|rider-481|1617351482000|4735f5e6-e746-49d...| | 0.6510585056| 0.81928686877|driver-213|0.2071489600291|0.0622403109582| 41.062909290| asia/india/c...|rider-471|1617325991000|16db8d5d-955a-4d1...| |0.114883931570| 0.62732122024|driver-213| 0.745467853751| 0.395493986490| 27.794786885|americas/unite...|rider-591|1611908339000|115c2738-9059-4be...| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
  20. 20. Code Examples: Generate Data val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) generatedDataDF.show() +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | begin_lat| begin_lon| driver| end_lat| end_lon| fare| geo| rider| ts| uuid| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | 0.47269058795|0.461578584504|driver-213| 0.75480340700| 0.967115994201|34.1582847163|americas/brazi...|rider-213|1611908339000|500ca486-9323-46f...| | 0.61000705621| 0.87794022954|driver-213| 0.340787050592| 0.503079814229| 43.49238112|americas/brazi...|rider-123|1614586739000|cf0767fe-2afa-4b0...| | 0.57318354079| 0.49234796529|driver-213|0.0898858178093|0.4252089969871| 64.276962958|americas/unite...|rider-214|1617326300000|a7fc67fd-8026-4c9...| |0.216241503676|0.142850512594|driver-213| 0.589094962481| 0.096682383192| 93.560181152|americas/unite...|rider-417|1612167539000|77572226-6edd-4fb...| | 0.406135109| 0.56440921390|driver-213| 0.79870630494|0.0269835922718|17.8511352550| asia/india/c...|rider-249|1617326301000|2227d696-bc2f-490...| | 0.87420415264| 0.75282681532|driver-213| 0.919782712888| 0.36246477087|19.1791391066|americas/unite...|rider-351|1618301939000|90a4dae9-e21d-4a4...| | 0.18564880850| 0.96945864178|driver-213|0.3818636703720|0.2525265221447| 33.922164839|americas/unite...|rider-491|1611908339000|bb742f4d-1ab8-42e...| | 0.07505887600|0.038441044444|driver-213|0.0437635335453| 0.634604006761| 66.620843664|americas/brazi...|rider-481|1617351482000|4735f5e6-e746-49d...| | 0.6510585056| 0.81928686877|driver-213|0.2071489600291|0.0622403109582| 41.062909290| asia/india/c...|rider-471|1617325991000|16db8d5d-955a-4d1...| |0.114883931570| 0.62732122024|driver-213| 0.745467853751| 0.395493986490| 27.794786885|americas/unite...|rider-591|1611908339000|115c2738-9059-4be...| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ and this for hoodie record key. We’ll use this for partition key.
  21. 21. Code Examples: Writes Opts val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) val hudiWriteOpts = Map( "hoodie.table.name" -> (tableName), "hoodie.datasource.write.recordkey.field" -> "uuid", "hoodie.datasource.write.partitionpath.field" -> "ts", "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator", "hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "UNIX_TIMESTAMP", "hoodie.deltastreamer.keygen.timebased.output.dateformat" -> "yyyy/MM/dd", )
  22. 22. Code Examples: Write val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) val hudiWriteOpts = Map( "hoodie.table.name" -> (tableName), "hoodie.datasource.write.recordkey.field" -> "uuid", "hoodie.datasource.write.partitionpath.field" -> "ts", "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator", "hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "UNIX_TIMESTAMP", "hoodie.deltastreamer.keygen.timebased.output.dateformat" -> "yyyy/MM/dd", ) generatedDataDF.write. format("org.apache.hudi"). options(hudiWriteOpts). save(basePath)
  23. 23. Code Examples: Hive Registration val hiveSyncConfig = new HiveSyncConfig() hiveSyncConfig.databaseName = databaseName hiveSyncConfig.tableName = tableName hiveSyncConfig.basePath = basePath hiveSyncConfig.partitionFields = List("ts") val hiveConf = new HiveConf() val dfs = (new Path(basePath)).getFileSystem(new Configuration()) val hiveSyncTool = new HiveSyncTool(hiveSyncConfig, hiveConf, dfs) hiveSyncTool.syncHoodieTable() Not to be confused with cross-dc hive sync Can be called manually, or you can configure HudiWriteOpts to trigger it automatically.
  24. 24. Code Examples: Snapshot Read val readDF = spark.sql("select uuid, driver, begin_lat, begin_lon from " + databaseName + "." + tableName) val readDF = spark.read.format("org.apache.hudi") .load(basePath) .select("uuid", "driver", "begin_lat", "begin_lon") readDF.show() +--------------------+----------+--------------------+--------------------+ | uuid| driver| begin_lat| begin_lon| +--------------------+----------+--------------------+--------------------+ |57d559d0-e375-475...|driver-284|0.014159831486388885| 0.42849372303000655| |fd51bc6e-1303-444...|driver-284| 0.1593867607188556|0.010872312870502165| |e8033c1e-a6e5-490...|driver-284| 0.2110206104048945| 0.2783086084578943| |d619e592-0b41-4c8...|driver-284| 0.08528650347654165| 0.4006983139989222| |799f7e50-27bc-4c9...|driver-284| 0.6570857443423376| 0.888493603696927| |c22ba7e5-68b5-4eb...|driver-284| 0.18294079059016366| 0.19949323322922063| |fbb80816-fe18-4e2...|driver-284| 0.7340133901254792| 0.5142184937933181| |3dfeb884-41fd-4ea...|driver-284| 0.4777395067707303| 0.3349917833248327| |034e0576-f59f-4e9...|driver-284| 0.7180196467760873| 0.13755354862499358| |e9c6e3b1-1ed4-43b...|driver-284| 0.16603428449020086| 0.6999655248704163| |18b39bef-9ebb-4b5...|driver-213| 0.1856488085068272| 0.9694586417848392| |653a4cb6-3c94-4ee...|driver-213| 0.11488393157088261| 0.6273212202489661| |11fbfce7-a10b-4d1...|driver-213| 0.21624150367601136| 0.14285051259466197| |0199a292-1702-47f...|driver-213| 0.4726905879569653| 0.46157858450465483| |5e1d80ce-e95b-4ef...|driver-213| 0.5731835407930634| 0.4923479652912024| |5d51b234-47ab-467...|driver-213| 0.651058505660742| 0.8192868687714224| |ff2e935b-a403-490...|driver-213| 0.0750588760043035| 0.03844104444445928| |bc644743-0667-48b...|driver-213| 0.6100070562136587| 0.8779402295427752| |026c7b79-3012-414...|driver-213| 0.8742041526408587| 0.7528268153249502| |9a06d89d-1921-4e2...|driver-213| 0.40613510977307| 0.5644092139040959| +--------------------+----------+--------------------+--------------------+ only showing top 20 rows Two ways of querying the same Hudi Dataset
  25. 25. Code Examples: Incremental Read val newerThanTimestamp = "20200728232543" val readDF = spark.read.format("org.apache.hudi") .option(QUERY_TYPE_OPT_KEY,QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(BEGIN_INSTANTTIME_OPT_KEY, newerThanTimestamp) .load(basePath) .filter("_hoodie_commit_time" > newerThanTimestamp) .select("uuid", "driver", "begin_lat", "begin_lon")
  26. 26. Code Examples: Incremental Read val newerThanTimestamp = "20200728232543" val readDF = spark.read.format("org.apache.hudi") .option(QUERY_TYPE_OPT_KEY,QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(BEGIN_INSTANTTIME_OPT_KEY, newerThanTimestamp) .load(basePath) .filter("_hoodie_commit_time" > newerThanTimestamp) .select("uuid", "driver", "begin_lat", "begin_lon") This is simply 2020/07/28 23:25:43s
  27. 27. Supported Engines Spark Flink Hive Presto Impala Athena (AWS)
  28. 28. Table Data Format
  29. 29. 01 Table Types 02 Table Layout 03 Log File Format 04 Q&A Agenda
  30. 30. ● Partitions are directories on disk ○ Date based partitions: 2021/01/01 2021/01/02 …. ● Data is written as records in data-files within partitions 2021/ 01/ 01/ fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210102020825.parquet ● Each record has a schema and should contain partition and a unique record id ● Each of the data-file is versioned and newer versions contain latest data ● Supported data-file formats: Parquet, ORC (under development) Basics
  31. 31. ● fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210102020825.parquet fileID (UUID) writeToken version (time of commit) file-format ● fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210103102345.parquet (newer version as timestamp is greater) ● A record with a particular hoodie-key will exist in only one fileID. Basics
  32. 32. Updates to existing records lead to a newer version of the data-file How are Inserts processed Inserts are partitioned and written to multiple new data-files How are updates processed All records are read from latest version of data-file Updates are applied in memory New version of data-file written Copy On Write (Read-optimized format)
  33. 33. Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 (ts1) upsert Key1 C1 .. Key3 C2 .. Version at C2 (ts2) Version at C1 (ts1) Version at C1 (ts1) File 2 Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. File 1 Queries HUDI Copy On Write: Explained Batch 2 (ts2) Key3 ... .....……...
  34. 34. Latest Data Latest version of the data can always be read from the latest data-files Performance Native Columnar File Format Read performance (read-optimized) overhead Limited Updates Very performant for insert only workloads with occasional updates. Copy On Write: Benefits
  35. 35. Copy On Write: Challenges Write Amplification Small batches lead to huge read and rewrites of parquet file Ingestion Latency Cannot ingest batches very frequently due to huge IO and compute overhead File sizes Cannot control file sizes very well, larger the file size, more IO for a single record update.
  36. 36. Merge On Read (Write-optimized format) Updates to existing records are written to a “log-file” (similar to WAL) How are Inserts processed Inserts are partitioned and written to multiple new data-files How are updates processed Updates are written to a LogBlock Write the LogBlock to the log-file Log-file format is optimized to support appends (HDFS only) but also works with Cloud Stores (new versions created)
  37. 37. upsert Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 (ts1) Parquet + Log Files Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Batch 2 (ts2) K1 C2 ... ... K2 C2 ... Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. K3 C2 Read Optimized Queries HUDI Merge On Read: Explained Data file at C1 (ts1) (parquet) Data file at C1 (ts1) (parquet) Unmerged log file at ts2 Unmerged log file at ts2 Real Time Queries
  38. 38. Merge On Read: Benefits Low Ingestion latency Writes are very fast Write Amplification Low write amplification as merge is over multiple ingestion batches Read vs Write Optimization Merge data-file and delta-file to create new version of data-file. “Compaction” operation creates new version of data-file, can be scheduled asynchronously in a separate pipeline without stopping Ingestion or Readers. New data-files automatically used after Compaction completes.
  39. 39. Merge On Read: Challenges Freshness is impacted Freshness may be worse if the read uses only the Read Optimized View (only data files). Increased query cost If reading from data-files and delta-files together (due to merge overhead). This is called Real Time View. Compaction required to bound merge cost Need to create and monitor additional pipeline(s)
  40. 40. ● Made up of multiple LogBlocks ● Each LogBlock is made up of: ○ A header with timestamp, schema and other details of the operation ○ Serialized records which are part of the operation ○ LogBlock can hold any format, typically AVRO or Parquet ● Log-File is also versioned ○ S3 and cloud stores do not allow appends ○ Versioning helps to assemble all updates Log File Format fileID (UUID) version (time of commit) file-format writeToken 3215eafe-72cb-4547-929a-0e982be3f45d-0_20210119233138.log.1_0-26-5305
  41. 41. Table Metadata Format
  42. 42. 01 Action Types 02 Hudi Metadata Table 03 Q&A Agenda
  43. 43. No online component - all state is read and updated from HDFS State saved as “actions” files within a directory (.hoodie) .hoodie/ 20210122133804.commit 20210122140222.clean hoodie.properties 20210122140222.commit when-action-happened what-action-was-taken Sorted list of all actions is called “HUDI Timeline” Basics
  44. 44. Action Types 20210102102345.commit COW Table: Insert or Updates MOR Table: data-files merged with delta-files 20210102102345.rollback Older commits rolled-back (data deleted) 20210102102345.delta-commit MOR Only: Insert or Updates 20210102102345.replace data-files clustered and re-written 20210102102345.clean Older versions of data-files and delta-files deleted 20210102102345.restore Restore dataset to a previous point in time
  45. 45. 1. Mark the intention to perform an action a. Create the file .hoodie/20210102102345.commit.requested 2. Pre-processing and validations (e.g. what files to update / delete) 3. Mark the starting of action a. Create the file .hoodie/20210102102345.commit.inflight b. Add the action plan to the file so we can rollback changes due to failures 4. Perform the action as per plan 5. Mark the end of the action a. Create the file .hoodie/20210102102345.commit How is an action performed ?
  46. 46. Before each operation HUDI needs to find the state of the dataset List all action files from .hoodie directory Read one or more of the action files List one or more partitions to get list of latest data-files and log-files HUDI operations lead to large number of ListStatus calls to NameNode ListStatus is slow and resource intensive for NameNode Challenges
  47. 47. ● ListStatus data is cached in an internal table (Metadata Table) ● What is cached? ○ List of all partitions ○ List of files in each partition ○ Minimal required information on each file - file size ● Internal table is a HUDI MOR Table ○ Updated when any operation changes files (commit, clean, etc) ○ Updates written to log-files and compacted periodically ● Very fast lookups from the Metadata Table HUDI File Listing Enhancements (0.7 release)
  48. 48. ● Reduced load on NameNode ● Reduce time for operations which list partitions ● Metadata Table is a HUDI MOR Table (.hoodie/metadata) ○ Can be queried like a regular HUDI Table ○ Helps in debugging issues Benefits
  49. 49. Indexing
  50. 50. Recap Writing data Bulk_insert, insert, upsert, insert_overwrite Querying data Hive, Spark, Presto etc Copy-On-Write: Columnar Format Simple & ideal for analytics use-cases (limited updates) Merge-On-Read: Write ahead log Complex, but reduces write amplification with updates Provides 2 views : Read-Optimized, Realtime Timeline Metadata Track information about actions taken on table Incremental Processing Efficiently propagate changes across tables
  51. 51. Table Service: Indexing
  52. 52. Concurrency Control MVCC Multi Version Concurrency Control File versioning Writes create a newer version, while concurrent readers access an older version. For simplicity, we will refer to hudi files as (fileId)-(timestamp) ● f1-t1, f1-t2 ● f2-t1, f2-t2 Lock Free Read and write transactions are isolated without any need for locking. Use timestamp to determine state of data to read. Data Lake Feature Guarantees Atomic multi-row commits Snapshot isolation Time travel
  53. 53. How is index used ? Key1 ... Key2 ... Key3 ... Key4 ... upsert Tag Location Using Index And Timeline Key1 partition, f1 ... Key2 partition, f2 ... Key3 partition, f1 ... Key4 partition, f2 ... Batch at t2 with index metadata Key1, Key3 Key2, Key4 f1-t2 (data/log) f2-t2 (data/log) Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. Batch at t2 f1-t1 f2-t1
  54. 54. Indexing Scope Global index Enforce uniqueness of keys across all partitions of a table Maintain mapping for record_key to (partition, fileId) Update/delete cost grows with size of the table O(size of table) Local index Enforce this constraint only within a specific partition. Writer to provide the same consistent partition path for a given record key Maintain mapping (partition, record_key) -> (fileId) Update/delete cost O(number of records updated/deleted)
  55. 55. Types of Indexes Bloom Index (default) Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Ideal workload: Late arriving updates Simple Index Performs a lean join of the incoming update/delete records against keys extracted from the table on storage. Ideal workload: Random updates/deletes to a dimension table HBase Index Manages the index mapping in an external Apache HBase table. Ideal workload: Global index Custom Index Users can provide custom index implementation
  56. 56. Indexing Configurations Property: hoodie.index.type Type of index to use. Default is Local Bloom filter (including dynamic bloom filters) Property: hoodie.index.class Full path of user-defined index class and must be a subclass of HoodieIndex class. It will take precedence over the hoodie.index.type configuration if specified Property: hoodie.bloom.index.parallelism Dynamically computed, but may need tuning for some cases for bloom index Property hoodie.simple.index.parallelism Tune parallelism for simple index
  57. 57. Indexing Limitations Indexing only works on primary key today. WIP to make this available as secondary index on other columns. Index information is only used in writer. Using this in read path will improve query performance. Move the index info from parquet metadata into metadata table
  58. 58. Storage Management
  59. 59. Storage Management Compaction Convert files on disk into read optimized files. Clustering Optimizing data layout, stitching small files Cleaning Remove Hudi data files that are no longer needed. Hudi Rewriter Pruning columns, encrypting columns and other rewriting use-cases Savepoint & Restore Bring table back to a correct/old state Archival Archive Hudi metadata files that are no longer being actively used.
  60. 60. Table Service: Compaction Main motivations behind Merge-On-Read is to reduce data latency when ingesting records Data is stored using a combination of base files and log files Compaction is a process to produce new versions of base files by merging updates Compaction is performed in 2 steps Compaction Scheduling Pluggable Strategies for compaction This is done inline. In this step, Hudi scans the partitions and selects base and log files to be compacted. A compaction plan is finally written to Hudi timeline. Compaction Execution Inline - Perform compaction inline, right after ingestion Asynchronous - A separate process reads the compaction plan and performs compaction of file slices.
  61. 61. K1 T3 .. K3 T3 .. Version at T3 K1 T4 ... Version of Log atT4 Real-time View Real-time View Real-time View Compaction Example Hudi Managed Dataset Version at T1 Key1 .....……... ... Key3 …..……... ... Batch 1 T1 Key1 .………... ... Key3 …..……... ... Batch 2 T2 upsert K1 T2 ... ... Unmerged update K1 T1 .. K3 T1 .. K3 T2 Version of Log at T2 Phantom File Schedule Compaction Commit Timeline Key1 . .…… T4 Batch 3 T3 Unmerged update done T2 Commit 2 done T4 Commit 4 done T3 Compact done T1 Commit 1 Read Optimized View Read Optimized View PARQUET T3 Compaction inflight T4 Commit 4 inflight HUDI
  62. 62. Code Examples: Inline compaction df.write.format("org.apache.hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "uuid"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). option(TABLE_NAME, tableName). option("hoodie.parquet.small.file.limit", "0"). option("hoodie.compact.inline", "true"). option("hoodie.clustering.inline.max.delta.commits", "4"). option("hoodie.compaction.strategy, "org.apache.hudi.io.compact.strategy.LogFileSizeBasedCompactionStrategy"). mode(Append). save(basePath);
  63. 63. Table Service: Clustering Ingestion and query engines are optimized for different things FileSize Ingestion prefers small files to improve freshness. Small files => increase in parallelism Query engines (and HDFS) perform poorly when there are lot of small files Data locality Ingestion typically groups data based on arrival time Queries perform better when data frequently queried together is co-located Clustering is a new framework introduced in hudi 0.7 Improve query performance without compromising on ingestion speed Run inline or in an async pipeline Pluggable strategy to rewrite data Provides two in-built strategies to 1) ‘stitch’ files and 2) ‘sort’ data on a list of columns Superset of Compaction. Follows MVCC like other hudi operations Provides snapshot isolation, time travel etc. Update index/metadata as needed Disadvantage: Incurs additional rewrite cost
  64. 64. Clustering: efficiency gain Before clustering: 20M rows scanned After clustering: 100K rows scanned
  65. 65. Code Examples: Inline clustering df.write.format("org.apache.hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "uuid"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). option(TABLE_NAME, tableName). option("hoodie.parquet.small.file.limit", "0"). option("hoodie.clustering.inline", "true"). option("hoodie.clustering.inline.max.commits", "4"). option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). option("hoodie.clustering.plan.strategy.sort.columns", ""). //optional, if sorting is needed mode(Append). save(basePath);
  66. 66. Table Service: Cleaning Delete older data files that are no longer needed Different configurable policies supported. Cleaner runs inline after every commit. Criteria#1: TTD data quality issues Provide sufficient time to detect data quality issues. Multiple versions of data are stored. Earlier versions can be used as a backup. Table can be rolled back to earlier version as long as cleaner has not deleted those files. Criteria#2: Long running queries Provide sufficient time for your long running jobs to finish running. Otherwise, the cleaner could delete a file that is being read by the job and will fail the job. Criteria#3: Incremental queries If you are using the incremental pull feature, then ensure you configure the cleaner to retain sufficient amount of last commits to rewind.
  67. 67. Cleaning Policies Partition structure f1_t1.parquet, f2_t1.parquet, f3_t1.parquet f1_t2.parquet, f2_t2.parquet, f4_t2.parquet f1_t3.parquet, f3_t3.parquet Keep N latest versions N=2, retain 2 versions for each file group At t3: Only f1_t1 can be removed Keep N latest commits N=2, retain all data for t2, t3 commits At t3: f1_t1, f2_t1 can be removed. f3_t1 cannot be removed
  68. 68. Table Service: Archiving Delete older metadata State saved as “actions” files within a directory (.hoodie) .hoodie/20210122133804.commit .hoodie/20210122140222.clean .hoodie/hoodie.properties Over time, many small files are created Moves older metadata to commits.archived sequence file Easy Configurations Set “hoodie.keep.min.commits” and “hoodie.keep.max.commits” Incremental queries only work on ‘active’ timeline
  69. 69. Table Service: Savepoints & Restore Some common questions in production systems What if a bug resulted in incorrect data pushed to the ingestion system ? What if an upstream system incorrectly marked column values as null ? Hudi addresses these concerns for you Ability to restore the table to the last known correct time Restore to well known state Logically “rollback” multiple commits. Savepoints - checkpoints at different instants of time Pro - optimizes number of versions needed to store and minimizes disk space Con - Not available for Merge_On_Read table types
  70. 70. Tools & Capabilities
  71. 71. 01 Ingestion frameworks 02 Hudi CLI 03 < 5 mins ingestion latency 04 Onboarding existing tables to Hudi 05 Testing Infra 06 Observability 07 Q&A Agenda
  72. 72. Hudi offers standalone utilities to connect with the data sources, to inspect the dataset and for registering a table with HMS. Ingestion framework Hudi Utilities Source DFS compatible stores (HDFS, AWS, GCP etc) Data Lake Ingest Data DeltaStreamer SparkDataSource Query Engines Register Table with HMS: HiveSyncTool Inspect table metadata: Hudi CLI Execution framework *source = {Kafka, CSV, DFS, Hive table, Hudi table etc} *Readers = {Hive, Presto, Spark SQL, Impala, AWS Athena}
  73. 73. Input formats Input data could be available as a HDFS file, Kafka source or as an input stream. Run Exactly Once Performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Continuous Mode Runs an infinite loop with each round performing one ingestion round as described in Run Once Mode. The frequency of data ingestion can be controlled by the configuration Record Types Support json, avro or a custom record type for the incoming data Checkpoint, rollback and recovery Automatically takes care of checkpointing of input data, rollback and recovery. Avro Schemas Leverage Avro schemas from DFS or a schema registry service. DeltaStreamer
  74. 74. HoodieDeltaStreamer Example More info at https://hudi.apache.org/docs/writing_data.html#deltastreamer spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --props file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --source-ordering-field impresssiontime --target-base-path file:///tmp/hudi-deltastreamer-op --target-table uber.impressions --op BULK_INSERT HoodieDeltaStreamer is used to ingest from a kafka source into a Hudi table Details on how to use the tool is available here
  75. 75. Spark Datasource API The hudi-spark module offers the DataSource API to write (and read) a Spark DataFrame into a Hudi table. Structured Spark Streaming Hudi also supports spark streaming to ingest data from a streaming source to a Hudi table. Flink Streaming Hudi added support for the Flink execution engine, in the latest 0.7.0 release. Execution Engines inputDF.write() .format("org.apache.hudi") .options(clientOpts) // any of the Hudi client opts can be passed in as well .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath); More info at https://hudi.apache.org/docs/writing_data.html#deltastreamer
  76. 76. Hudi CLI Create table Connect with table Inspect commit metadata File System View Inspect Archived Commits Clean, Rollback commits More info at https://hudi.apache.org/docs/deployment.html
  77. 77. Hive Registration Tools Hive Sync tools enables syncing of the table’s latest schema and updated partitions to the Hive metastore. cd hudi-hive ./run_sync_tool .sh --jdbc-url jdbc:hive2: //hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName> Hive Ecosystem Hive Meta Store (HMS) HiveSyncTool registers the Hudi table and updates on schema, partition changes Query Planner Query Executor HoodieInputFormat exposes the Hudi datafiles present on the DFS. More info at https://hudi.apache.org/docs/writing_data.html#syncing-to-hive Hudi Dataset Presto Spark SQL Spark Data Source Hudi HoodieInputFormat is integrated with Datasource API, without any dependency on HMS Query Engines
  78. 78. Write Amplification COW tables receiving many updates have a large write amplification, as the files are rewritten as new snapshots, even if a single record in the data file were to change. Amortized rewrites MOR reduces this write amplification by writing the updates to a log file and periodically merging the log files with base data files. Thus, amortizing the cost of rewriting to the data file at the time of compaction. Read Optimized vs real time view Data freshness experienced by the reader is affected by whether the read requests are served from compacted base files or by merging the base files with log files in real time, just before serving the reads. Small vs Large Data files Creating smaller data files (< 50MB) can be done under few mins. However, creating lots of small files would put pressure on the NameNode during the HDFS listing (and other metadata) operations. Creating larger data files (1GB) take longer to write to disk (10+ mins). However, maintaining larger files reduces the NameNode pressure Achieving ingestion latency of < 5 mins With clustering and compaction
  79. 79. Achieving ingestion latency of < 5 mins Managing write amplification with Clustering INSERTS UPDATES DELETES Ingestion Commit C10 Partition P1 F5_W1_C5.parquet [F1_C1, F2_C2, F3_C2, F4_C5] Partition P2 F12_W1_C5.parquet [F10_C1, F11_C3 ...] Commit C10 Commit C9 Commit C8 Commit C7 Commit C6 Commit C5 Commit C4 Commit C3 Commit C2 Commit C1 Commit C0 Background clustering process periodically rewrites the small base files created by ingestion process into larger base files, amortizing the cost to reduce pressure on the nameNode. Clustered large 1GB files Clustering/ compaction commit Ingestion process writes to Small < 50MB base files. Small base files help in managing the write amplification and the latency. Query on real-time table at commit C10 Contents of: 1. All base files are available to the readers Freshness is updated at every ingestion commit. F6_W1_C6.parquet F6_W1_C6.parquet F11_W1_C10.parquet F6_W1_C6.parquet F13_W1_C7.parquet
  80. 80. INSERTS UPDATES DELETES Ingestion Commit C10 Partition P1 F1_W1_C5.parquet F1_W1_C10.log Partition P2 F2_W1_C2.parquet Commit C10 Commit C9 Commit C8 Commit C7 Commit C6 Commit C5 Commit C4 Commit C3 Commit C2 Commit C1 Commit C0 F1_W1_C7.log F2_W1_C6.log Columnar basefile Compaction commit Row based append log Updates and deletes are written to a row based append log, by the ingestion process. Later the async compaction process merges the log files to the base fiile. Query on read optimized table at commit C10 Query on real time table at commit C10 Contents of: 1. Base file F1_W1_C5.parquet 2. Base file F2_W1_C2.parquet Contents of: 1. Base file F1_W1_C5.parquet is merged with append log files F1_W1_C7.log and F1_W1_c10.log. 2. Base file F2_W1_C2.parquet is merged with append log file F2_W1_C6.log. Timeline Achieving ingestion latency of < 5 mins Managing write amplification with merge-on-read
  81. 81. Legacy data When legacy data is available in parquet format and the table needs to be converted to aHudi table, all the parquet files are to be rewritten to Hudi data files. Fast Migration Process With Hudi Fast Migration, Hudi will keep the legacy data files (in parquet format) and generate a skeleton file containing Hudi specific metadata, with a special “BOOTSTRAP_TIMESTAMP”. Querying legacy partitions When executing a query involving legacy partitions, Hudi will return the legacy data file to the query engines. (Query engines can handle serving the query using non-hudi regular parquet/data files). Onboarding your table to Hudi val bootstrapDF = spark.emptyDataFrame bootstrapDF.write .format("hudi") .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key") .option(..).. .mode(SaveMode.Overwrite) .save(basePath)
  82. 82. Hudi unit-testing framework Hudi offers a unit testing framework that allows developers to write unit tests that mimic the real world scenarios and run these tests every time the code is recompiled. This enables increased developer velocity and roubest code changes. Hudi-test-suite Hudi-test-suite makes use of the hudi utilities to create an end-to-end testing framework to simulate complex workloads, schema evolution scenarios and version compatibility tests. Hudi A/B testing Hudi offers A/B style testing to ensure that data produced with a new change/build matches/agrees with the exact same workload in the production. With “Hudi union file-system”, a production Hudi table can be used as a read-only reference file system. Any commit from the production hudi-timeline can be replayed, using a different Hudi build, to compare the results of the “replay run” against the “production run”. Hudi Testing infrastructure
  83. 83. Hudi Test Suite Build Complex Workloads Define a complex workload that reflects production setups. Test Version Compatibility Pause the workload, upgrade dependency version, then resume the workload. Cassandra DBEvents MySql DBEvents Schemaless DBEvents User Application Heat pipe Unified Ingestion pipeline source /sink specific DAGs HDFS Hive/Presto/ Spark SQL Evolve Workloads Simulate changing elements such as schema changes. Simulate Ingestion Mock Data Generator Launch Queries
  84. 84. Production workload as read-only file system Hudi A/B testing INSERTS UPDATES DELETES Ingestion Commit C10 Partition P1 Partition P2 Commit C10 Commit C9 Commit C8 Commit C7 Commit C6 Commit C5 Commit C4 Commit C3 Commit C2 Commit C1 Commit C0 F6_W1_C6.parquet F6_W1_C6.parquet F11_W1_C10.parquet F6_W1_C6.parquet F13_W1_C7.parquet F6_W1_C6.parquet F5_W1_C5.parquet F6_W1_C6.parquet F12_W1_C5.parquet Write enabled test file system Partner write enabled Partition P1 F11_W1_C10.parquet Commit C10 Ensure commit produced by the test matches original commit metadata Ensure data files produced by the “commit replay” test matches with the original base/log data files in production.
  85. 85. Hudi Observability Insights on a specific ingestion run Collect key insights around storage efficiency, ingestion performance and surface bottlenecks at various stages. These insights can be used to automate fine-tuning of ingestion jobs by the feedback based tuning jobs. Identifying outliers At large scale, across thousands of tables, when a bad node/executor is involved, identifying the bad actor takes time, requires coordination across teams and involves lots of our production on-call resources. By reporting normalized stats, that are independent of the job size or workload characteristics, bad executor/nodes can be surfaced as outliers that warrant a closer inspection. Insights on Parallelism When managing thousands of Hudi tables in the data-lake, ability to visualize the parallelism applied at each stage of the job, would enable insights into the bottlenecks and allow the job to be fine-tuned at granular level.
  86. 86. On-Going & Future Work
  87. 87. ➔ Concurrent Writers [RFC-22] & [PR-2374] ◆ Multiple Writers to Hudi tables with file level concurrency control ➔ Hudi Observability [RFC-23] ◆ Collect metrics such as Physical vs Logical, Users, Stage Skews ◆ Use to feedback jobs for auto-tuning ➔ Point index [RFC-08] ◆ Target usage for primary key indexes, eg. B+ Tree ➔ ORC support [RFC] ◆ Support for ORC file format ➔ Range Index [RFC-15] ◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes) ➔ Enhance Hudi on Flink [RFC-24] ◆ Full feature support for Hudi on Flink version 1.11+ ◆ First class support for Flink ➔ Spark-SQL extensions [RFC-25] ◆ DML/DDL operations such as create, insert, merge etc ◆ Spark DatasourceV2 (Spark 3+) On-Going Work
  88. 88. ➔ Native Schema Evolution ◆ Support remove and rename columns ➔ Apache Calcite SQL integration ◆ DML/DDL support for other engines besides Spark Future Work (Upcoming RFCs)
  89. 89. Thank you dev@hudi.apache.org @apachehudi https://hudi.apache.org

×