SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

Building highly efficient data
lakes using Apache Hudi
(Incubating)
Vinoth Chandar | Sr. Staff Engineer, Uber
Apache®, Apache Hudi, and Hudi logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or
other countries.

Data Architectures
Lakes, Marts, Silos

Simple… Right?
Database
Events
Service
Mesh
Queries
DFS/Cloud Storage
Extract-Transform-Load
Real-time/OLTP Analytics/OLAP
External
Sources
Tables

OK.. May be not that simple ...
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
External
Sources
Raw Tables
Data Lake
Derived
Tables
Schemas
Data Audit

Data Lake Implementation : It’s actually hard..

High-value data
- User information in RDBMS
- Trip, transaction logs in NoSQL
Replicate CRUD operations
- Strict ordering guarantees
- Zero data loss
Bulk loads don’t scale
- Adds more load to database
- Wasteful re-writing of data
Requirement #1: Incremental Database Ingestion
MySQL
users
users
Inserts, updates, deletes
Replicate
userID int
country string
last_mod long
... ...
Data Lake

High-scale time series data
- Several billions/day
- Few millions/sec
- Heavily aggregated
Cause of duplicates
- Client retries/failures/network
errors
- At least-once data pipes
Overcounting problems
- More impressions => more $
- Low fidelity data
Requirement #2: De-Duping Log Events
Impressions
Impressions
Produce impression
events
Replicate
w/o
duplicates
event_id string
datestr string
time long
... ...
Data Lake

Requirement #3: Transactional Writes
Atomic publish of data
- Ingestion can fail midway
- Rollback bad data
Consistency Guarantees
- No partial data exposed
- Repeatable queries
Snapshot Isolation
- Time-travel queries
- Concurrent writer/readers
Strong Durability
- No data loss

Requirement #4: Unique Key Constraints
Data model parity
- Enforce upstream primary keys
- 1-1 Mapping w/ source table
- Great data quality!
Transaction Processing
- e.g: Settling orders, fraud
detection
- Lakes are well-suited for large
scale processes

Multi stage ETL DAGS
- Very common in batch analytics
- Large amount of data
Derived/ETL tables
- Keep afresh with new/changed raw
data
- Star schema/warehousing
Scaling challenges
- Intelligent recomputations
- Window based joins
Requirement #5: Faster Derived Data
raw_trips
std_trips
standardize_fare(row)
id string
datestr string
currency string
fare double
id string
datestr string
std_fare double
... ...
Raw Table
Derived
Table

Requirement #6: File Management
Small Files = Big Problem
- Slow queries
- Stress filesystem metadata
Big Files = Large Delays
- 2GB Parquet writing => ~5-10
mins
File Stitching?
- Band-aid for bullet wound
- Consistency?
- Standardization?

Requirement #7: Scalable DFS/Storage RPCs
Ingestion/Query all list DFS
- List folders/files, take action
- Single threaded vs parallel
Subtle gotchas/differences
- Cloud storage => no append()
- S3 => Eventual consistency
- S3 => rename() = copy()
- Large directory listings
- HDFS NameNode bottlenecks

Requirement #8: Incremental Copy to Data marts
Data Marts
- Specialized, often MPP OLAP databases
- E.g Redshift, Vertica
Online Serving
- Sync ML features to databases
- Throttling syncing rate
Need to sync Lake => Mart
- Full data refresh often very expensive
- Need for incremental egress

Requirement #9: Legal Requirements/Data Deletions
Strict rules on data retention
- Delete records
- Correct data
- Raw + Derived tables
Need efficient delete()
- “needle in haystack”
- Indexed on write (point-ish lookup)
- Still optimized for scans
- Propagate deleted records downstream

Requirement #10: Late Data Handling
Data often arrives late
- Minutes, Hours, even days
- E.g: credit card txn settlement
Not implicitly complete
- Can lead to large data quality issues
- Trigger recomputation of derived tables
Data arrival tracking
- First class, audit log
- Flexible, rewindable windowing

Apache Hudi (Incubating)
Overview

● Snapshot isolation between writer & queries
● upsert() support with pluggable indexes
● Atomically publish data with rollback support
● Savepoints for data recovery
● Manages file sizes, layout using statistics
● Async compaction of new & old data
● Timeline metadata to track lineage
Storage

● Three logical views on single physical dataset
● Read Optimized View
○ Provides excellent query performance
○ Replaces plain Apache Parquet tables
● Incremental View
○ Change stream to feed downstream jobs/ETLs
● Near-Real time Table
○ Provides queries on real-time data
○ Combination of Apache Parquet & Apache Avro
data
Queries/Views of data
REALTIME
READ
OPTIMIZED
Cost
Latency

Hudi: Upserts + Incremental Changes
Incrementalize batch jobs
Dataset
Hudi upsert
Incoming
Changes
Outgoing
Changes
Hudi Incremental
Pull
upsert(RDD<Record>)
Updates records if present already or inserts them
into its corresponding partitions
RDD<Record> pullDelta(startTs, endTs)
Gets all the records that changed (updated or
inserted) between start and end time. The Delta can
span any number of partitions.

Apache Hudi @ Uber
Foundation for the vast Data Lake
>1 Trillion
Records/day
10s PB
Entire Data Lake
1000s
Pipelines/Tables

Apache Hudi Data Lake
Meeting the requirements

Data Lake built on Apache Hudi
Database
Event
s
Service
Mesh
Queries
DFS/Cloud Storage
Ingestion
(Extract-Load)
External
Sources
Raw Tables
Data Lake
Derived
Tables
upsert()
/insert()
Incr
Pull()

#1: upsert() database changelogs
// Command to extract incrementals using sqoop
bin/sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://localhost/users
--username root
--password *******
--table users
--as-avrodatafile
--target-dir
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import com.uber.hoodie.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Step 1: Extract new changes to users table
in MySQL, as avro data files on DFS
(or)
Use data integration tool of choice to feed
db changelogs to Kafka/event queue
Step 2: Use your fav datasource to read
extracted data and directly “upsert” the
users table on DFS/Hive
(or)
Use the Hudi DeltaStreamer tool

#2: Filter out duplicate events
// Deltastreamer command to ingest kafka events, dedupe, ingest
spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
/path/to/hoodie-utilities-bundle-*.jar`
--props s3://path/to/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider
--source-class com.uber.hoodie.utilities.sources.AvroKafkaSource
--source-ordering-field time
--target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions
--op BULK_INSERT
--filter-dupes
// kafka-source-properties
include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=datestr
# schema provider configs
hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-
value/versions/latest
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=impressions
#Kafka props
metadata.broker.list=localhost:9092
auto.offset.reset=smallest
schema.registry.url=http://localhost:8081

#3: Timeline consistency
Atomic multi-row commits
- Mask partial failures using timeline
- Rollback/savepoint support
Timeline
- Special .hoodie folder
- Actions are instantaneous
MVCC based isolation
- Between queries/ingestion
- Between ingestion/compaction
Future
- Unlimited timeline lookback

#4: Keyed update/insert() operations
Ingested record tagging
- Merge updates
- Log inserts
- HoodieRecordPayload interface to
support complex merges
Pluggable indexing
- Built-in : Bloom/Range based, HBase
- Scales with long term data growth
- Handles data skews
Future
- Support via SQL

#5: Incremental ETL/Data Pipelines
// Spark Datasource
Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._
// Use Spark datasource to read avro
Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie")
.option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM)
.load(“s3://tables/raw_trips”);
Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF)
// save it as a Hudi dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”)
.option(RECORDKEY_FIELD_OPT_KEY(), "id")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr")
.option(PRECOMBINE_FIELD_OPT_KEY(), "time")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)
Bring Streaming APIs
on Data Lake
Incrementally pull
- Avoid recomputes!
- Order of magnitudes
faster
Transform + upsert
- Avoid rewriting all data
Future
- Incr pull on logs
- Watermark APIs

#6: File Sizing & Fast Ingestion
Enforce file size on write
- Pay up cost to keep queries healthy
- Set hoodie.parquet.max.file.size &
hoodie.parquet.small.file.limit
- See docs for full list
Near real-time log ingest
- Asynchronous compact & write
columnar data
Future
- Support for split/collapse
- Auto tune compression ratio etc

#7: Optimized Timeline/FileSystem APIs
Embedded Timeline Server
- 0-listings from Spark executors
- Incremental file-system views on Spark driver
Consistency Guards
- Masks eventual consistency on S3
- No data file renames, in-place writing
- Storage aware “append” usage
- Graceful MVCC design to handle various
failures
Future
- Standalone timeline server

#8: Data Dispersal out of Lake
Incremental pull as sync mechanism
- Only copy updated ML features
- Only copy affected data ranges
Decoupled from ETL writing
- Shock absorber between Lake & Mart
- Enables throttling, retrying, rewinding
Future
- Support Lake => Mart in DeltaStreamer tool

#9: Efficient/Fast Deletes
Soft deletes
- upsert(k, null)
- Propagates seamlessly via incr-pull
Hard deletes
- Using EmptyHoodieRecordPayload
Indexing
- 7-10x faster than using regular joins
Future
- Standardized tooling

#10: Safe Reprocessing
Identify late data
- Timeline tracks all write activity
- E.g: obtain bounds on lateness
Adjust incremental pull windows
- Still much efficient than bulk
recomputation
Future
- Support parrival(data, window) APIs in
TimelineServer
- Apache Beam support for composing
safe, incremental pipelines

Open Source
Roadmap, community, and the future

Current Status
Where we are at
● Committed to open, vendor neutral data lake standard
● 2+ yrs of OSS community support
● First Apache release imminent
● EMIS Health, Yields.io + more in production
● Bunch of companies trying out
● Production tested on cloud
● hudi.apache.org/community.html

2019 Roadmap
Key initiatives
Bootstrapping tables into Hudi
- With indexing benefits
- Convenient tooling
Standalone Timeline Server
- Eliminate fs listings for query planning/ingestion
- Track column level statistics for query
Smart storage layouts
- Increase file sizes for older data
- Re-clustering data for queries

Thank you
dev@hudi.apache.org
@apachehudi
https://hudi.apache.org

Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or
utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or
retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to
whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable
law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information
of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

More Related Content

What's hot

Similar to SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi

More from Chester Chen

Recently uploaded

SF Big Analytics 20190612: Building highly efficient data lakes using Apache Hudi