Successfully reported this slideshow.
Your SlideShare is downloading. ×

SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

Upcoming SlideShare
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Loading in …3
×

Check these out next

1 of 39 Ad
1 of 39 Ad

SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

Download to read offline

Building highly efficient data lakes using Apache Hudi (Incubating)

Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team

Building highly efficient data lakes using Apache Hudi (Incubating)

Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

Similar to SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi (20)

Advertisement

More from Chester Chen (20)

Advertisement

SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

  1. 1. Building highly efficient data lakes using Apache Hudi (Incubating) Vinoth Chandar | Sr. Staff Engineer, Uber Apache®, Apache Hudi, and Hudi logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
  2. 2. Data Architectures Lakes, Marts, Silos
  3. 3. Simple… Right? Database Events Service Mesh Queries DFS/Cloud Storage Extract-Transform-Load Real-time/OLTP Analytics/OLAP External Sources Tables
  4. 4. OK.. May be not that simple ... Database Event s Service Mesh Queries DFS/Cloud Storage Ingestion (Extract-Load) Real-time/OLTP Analytics/OLAP External Sources Raw Tables Data Lake Derived Tables Schemas Data Audit
  5. 5. Data Lake Implementation : It’s actually hard..
  6. 6. High-value data - User information in RDBMS - Trip, transaction logs in NoSQL Replicate CRUD operations - Strict ordering guarantees - Zero data loss Bulk loads don’t scale - Adds more load to database - Wasteful re-writing of data Requirement #1: Incremental Database Ingestion MySQL users users Inserts, updates, deletes Replicate userID int country string last_mod long ... ... Data Lake
  7. 7. High-scale time series data - Several billions/day - Few millions/sec - Heavily aggregated Cause of duplicates - Client retries/failures/network errors - At least-once data pipes Overcounting problems - More impressions => more $ - Low fidelity data Requirement #2: De-Duping Log Events Impressions Impressions Produce impression events Replicate w/o duplicates event_id string datestr string time long ... ... Data Lake
  8. 8. Requirement #3: Transactional Writes Atomic publish of data - Ingestion can fail midway - Rollback bad data Consistency Guarantees - No partial data exposed - Repeatable queries Snapshot Isolation - Time-travel queries - Concurrent writer/readers Strong Durability - No data loss
  9. 9. Requirement #4: Unique Key Constraints Data model parity - Enforce upstream primary keys - 1-1 Mapping w/ source table - Great data quality! Transaction Processing - e.g: Settling orders, fraud detection - Lakes are well-suited for large scale processes
  10. 10. Multi stage ETL DAGS - Very common in batch analytics - Large amount of data Derived/ETL tables - Keep afresh with new/changed raw data - Star schema/warehousing Scaling challenges - Intelligent recomputations - Window based joins Requirement #5: Faster Derived Data raw_trips std_trips standardize_fare(row) id string datestr string currency string fare double id string datestr string std_fare double ... ... Raw Table Derived Table
  11. 11. Requirement #6: File Management Small Files = Big Problem - Slow queries - Stress filesystem metadata Big Files = Large Delays - 2GB Parquet writing => ~5-10 mins File Stitching? - Band-aid for bullet wound - Consistency? - Standardization?
  12. 12. Requirement #7: Scalable DFS/Storage RPCs Ingestion/Query all list DFS - List folders/files, take action - Single threaded vs parallel Subtle gotchas/differences - Cloud storage => no append() - S3 => Eventual consistency - S3 => rename() = copy() - Large directory listings - HDFS NameNode bottlenecks
  13. 13. Requirement #8: Incremental Copy to Data marts Data Marts - Specialized, often MPP OLAP databases - E.g Redshift, Vertica Online Serving - Sync ML features to databases - Throttling syncing rate Need to sync Lake => Mart - Full data refresh often very expensive - Need for incremental egress
  14. 14. Requirement #9: Legal Requirements/Data Deletions Strict rules on data retention - Delete records - Correct data - Raw + Derived tables Need efficient delete() - “needle in haystack” - Indexed on write (point-ish lookup) - Still optimized for scans - Propagate deleted records downstream
  15. 15. Requirement #10: Late Data Handling Data often arrives late - Minutes, Hours, even days - E.g: credit card txn settlement Not implicitly complete - Can lead to large data quality issues - Trigger recomputation of derived tables Data arrival tracking - First class, audit log - Flexible, rewindable windowing
  16. 16. Apache Hudi At a glance
  17. 17. Apache Hudi (Incubating) Overview
  18. 18. ● Snapshot isolation between writer & queries ● upsert() support with pluggable indexes ● Atomically publish data with rollback support ● Savepoints for data recovery ● Manages file sizes, layout using statistics ● Async compaction of new & old data ● Timeline metadata to track lineage Apache Hudi (Incubating) Storage
  19. 19. ● Three logical views on single physical dataset ● Read Optimized View ○ Provides excellent query performance ○ Replaces plain Apache Parquet tables ● Incremental View ○ Change stream to feed downstream jobs/ETLs ● Near-Real time Table ○ Provides queries on real-time data ○ Combination of Apache Parquet & Apache Avro data Apache Hudi (Incubating) Queries/Views of data REALTIME READ OPTIMIZED Cost Latency
  20. 20. Hudi: Upserts + Incremental Changes Incrementalize batch jobs Dataset Hudi upsert Incoming Changes Outgoing Changes Hudi Incremental Pull upsert(RDD<Record>) Updates records if present already or inserts them into its corresponding partitions RDD<Record> pullDelta(startTs, endTs) Gets all the records that changed (updated or inserted) between start and end time. The Delta can span any number of partitions.
  21. 21. Apache Hudi @ Uber Foundation for the vast Data Lake >1 Trillion Records/day 10s PB Entire Data Lake 1000s Pipelines/Tables
  22. 22. Apache Hudi Data Lake Meeting the requirements
  23. 23. Data Lake built on Apache Hudi Database Event s Service Mesh Queries DFS/Cloud Storage Ingestion (Extract-Load) Real-time/OLTP Analytics/OLAP External Sources Raw Tables Data Lake Derived Tables upsert() /insert() Incr Pull()
  24. 24. #1: upsert() database changelogs // Command to extract incrementals using sqoop bin/sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:mysql://localhost/users --username root --password ******* --table users --as-avrodatafile --target-dir s3:///tmp/sqoop/import-1/users // Spark Datasource Import com.uber.hoodie.DataSourceWriteOptions._ // Use Spark datasource to read avro Dataset<Row> inputDataset spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’); // save it as a Hudi dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”) .option(RECORDKEY_FIELD_OPT_KEY(), "userID") .option(PARTITIONPATH_FIELD_OPT_KEY(),"country") .option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Step 1: Extract new changes to users table in MySQL, as avro data files on DFS (or) Use data integration tool of choice to feed db changelogs to Kafka/event queue Step 2: Use your fav datasource to read extracted data and directly “upsert” the users table on DFS/Hive (or) Use the Hudi DeltaStreamer tool
  25. 25. #2: Filter out duplicate events // Deltastreamer command to ingest kafka events, dedupe, ingest spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /path/to/hoodie-utilities-bundle-*.jar` --props s3://path/to/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource --source-ordering-field time --target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions --op BULK_INSERT --filter-dupes // kafka-source-properties include=base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=datestr # schema provider configs hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions- value/versions/latest # Kafka Source hoodie.deltastreamer.source.kafka.topic=impressions #Kafka props metadata.broker.list=localhost:9092 auto.offset.reset=smallest schema.registry.url=http://localhost:8081
  26. 26. #3: Timeline consistency Atomic multi-row commits - Mask partial failures using timeline - Rollback/savepoint support Timeline - Special .hoodie folder - Actions are instantaneous MVCC based isolation - Between queries/ingestion - Between ingestion/compaction Future - Unlimited timeline lookback
  27. 27. #4: Keyed update/insert() operations Ingested record tagging - Merge updates - Log inserts - HoodieRecordPayload interface to support complex merges Pluggable indexing - Built-in : Bloom/Range based, HBase - Scales with long term data growth - Handles data skews Future - Support via SQL
  28. 28. #5: Incremental ETL/Data Pipelines // Spark Datasource Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie") .option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM) .load(“s3://tables/raw_trips”); Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF) // save it as a Hudi dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”) Bring Streaming APIs on Data Lake Incrementally pull - Avoid recomputes! - Order of magnitudes faster Transform + upsert - Avoid rewriting all data Future - Incr pull on logs - Watermark APIs
  29. 29. #6: File Sizing & Fast Ingestion Enforce file size on write - Pay up cost to keep queries healthy - Set hoodie.parquet.max.file.size & hoodie.parquet.small.file.limit - See docs for full list Near real-time log ingest - Asynchronous compact & write columnar data Future - Support for split/collapse - Auto tune compression ratio etc
  30. 30. #7: Optimized Timeline/FileSystem APIs Embedded Timeline Server - 0-listings from Spark executors - Incremental file-system views on Spark driver Consistency Guards - Masks eventual consistency on S3 - No data file renames, in-place writing - Storage aware “append” usage - Graceful MVCC design to handle various failures Future - Standalone timeline server
  31. 31. #8: Data Dispersal out of Lake Incremental pull as sync mechanism - Only copy updated ML features - Only copy affected data ranges Decoupled from ETL writing - Shock absorber between Lake & Mart - Enables throttling, retrying, rewinding Future - Support Lake => Mart in DeltaStreamer tool
  32. 32. #9: Efficient/Fast Deletes Soft deletes - upsert(k, null) - Propagates seamlessly via incr-pull Hard deletes - Using EmptyHoodieRecordPayload Indexing - 7-10x faster than using regular joins Future - Standardized tooling
  33. 33. #10: Safe Reprocessing Identify late data - Timeline tracks all write activity - E.g: obtain bounds on lateness Adjust incremental pull windows - Still much efficient than bulk recomputation Future - Support parrival(data, window) APIs in TimelineServer - Apache Beam support for composing safe, incremental pipelines
  34. 34. Open Source Roadmap, community, and the future
  35. 35. Current Status Where we are at ● Committed to open, vendor neutral data lake standard ● 2+ yrs of OSS community support ● First Apache release imminent ● EMIS Health, Yields.io + more in production ● Bunch of companies trying out ● Production tested on cloud ● hudi.apache.org/community.html
  36. 36. 2019 Roadmap Key initiatives Bootstrapping tables into Hudi - With indexing benefits - Convenient tooling Standalone Timeline Server - Eliminate fs listings for query planning/ingestion - Track column level statistics for query Smart storage layouts - Increase file sizes for older data - Re-clustering data for queries
  37. 37. Thank you dev@hudi.apache.org @apachehudi https://hudi.apache.org
  38. 38. ?
  39. 39. Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

×