Hoodie - DataEngConf 2017

An Open Source Incremental Processing Framework
Hoodie
DATA

Who Am I
Vinoth Chandar
- Founding engineer/architect of the data team at Uber.
- Previously,
- Lead on Linkedin’s Voldemort key value store.
- Oracle Database replication, Stream Processing.
- HPC & Grid Computing

Agenda
• Data @ Uber
• Motivation
• Concepts
• Deep Dive
• Use-Cases
• Comparisons
• Open Source

Data @ Uber
Quick Recap of how Uber’s data ecosystem has evolved!

Circa 2014
Reliability
- JSON data, breaking pipelines
- Word-of-mouth schema
Scalability
- Kafka7, No Hadoop
- Growing data volumes
- Multi-datacenter data merges
In-efficiencies
- Several hours of data delays
- Bulk data copies stressing OLTP
systems
- Single choice of query engine

Re-Architecture
Schemafication
- Avro as data lingua franca
- Schema enforcement at producers
Horizontally Scalable
- Kafka8
- Hadoop (many PBs & 1000s servers)
- Scalable data pipelines
- Multi-DC Aware data flow
Performant
- 1-3 hrs data latency
- Columnar queries via parquet
- Multiple query engines

Data Users
Analytics
- Dashboards
- Federated Querying
- Interactive Analysis
Data Apps
- Machine Learning
- Fraud Detection
- Incentive Spends
Data Warehousing
- Traditional ETL
- Curated data feeds
- Data Lake => Data Mart

Query Engines
Presto
> 100K queries/day
Spark
100s of Apps
Hive
20K Pipelines & Queries

Hoodie : Motivations
Use-cases & business needs that led to the birth of the project

Partitioned by trip start date
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 30 min
Day level partitions
Motivating Use-Case: Late Arriving Updates
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Update
d Trips

Jan: 6 hr (500 executors)
Snapshot
DB Ingestion: Status Quo
trips
(Parquet)
Changelog
12-18+ hr
Derived
Tables
Apr: 8 hr (800 executors)
Aug: 10 hr (1000 executors)

How can we fix this?
Query HBase?
- Bad Fit for scans
- Lack of support for nested data
- Significant operational overhead
Specialized Analytical DBs?
- Joins with other datasets in HDFS
- Not all data will fit into memory
- Lambda architecture & data copies
Don’t support Snapshots, Only Logs
- Logs ultimately need to be
compacted anyway
- Merging done inconsistently &
inefficiently by users
Data Modelling Tricks?
- Does not change fundamental
nature of problem

Pivotal Question
What do we need to solve this directly on top of a
petabyte scale Hadoop Data Lake?

Let’s Go A Decade Back
How did RDBMS-es solve this?
• Update existing row with new value (Transactions)
• Consume a log of changes downstream (Redo log)
• Update again downstream
MySQL
(Server A)
MySQL
(Server B)
Update
Update
Pull Redo
Log
TransformationImportant Differences
• Columnar file formats
• Read-heavy analytical workloads
• Petabytes & 1000s of servers
Changes

Pivotal Question
What do we need to solve this directly on top of a
petabyte scale Hadoop Data Lake?
Answer: upserts & incrementals

10 hr (1000)
8 hr (800)
6 hr (500)
snapshot
Challenging Status Quo: upserts & incr pull
12-18+ hr
8 hr
1 hr
Replicated
Trip Rows
New
/updated
trip rows
Changelog

Hoodie : Concepts
Incremental Processing Foundations & why it’s important

Anatomy Of Data Pipelines
Core Operations
• Projections (Easy)
• Filtering (Easy)
• Aggregations (Tricky)
• Window (Tricky)
• Joins (Hard)
Operational Levers (Google DataFlow)
• Latency
• Completeness
• Cost
Typically Pick 2/3
Source SinkData Pipeline

It’s A Spectrum
- Very Common
use-cases tolerating
few mins of latency
- 100x more batch
pipelines than
streaming pipelines

Incremental Processing : What?
Run Mini Batch Pipelines
- Provide high completeness than streaming pipelines
- By supporting things like multi-table joins seamlessly
In Streaming Fashion
- Provide lower latency than typical batch pipeline
- By only consuming new input & ability to update old results

Incremental Processing : Increased Efficiency
Less IO,
On-Demand
Resource
Allocation

Incremental Processing : Leverage Hadoop SQL
- Good support for joins
- Columnar File Formats,
- Cover wide range of use
cases - exploratory,
interactive

Incremental Processing : Simplify Architecture
- Efficient pipelines on
same batch
infrastructure
- Consolidation of
storage & compute

Incremental Processing : Primitives
Incremental Pull (Primitive #2)
- Log stream of changes, avoid costly
scans
- Enable chaining processing in DAG
Upsert (Primitive #1)
- Modify processed results
- Like state stores in stream
processing

Introducing: Hoodie
(Hadoop Upserts anD Incrementals)
Storage Abstraction to
- Apply mutations to dataset
- Pull changelog incrementally
Spark Library
- Scales horizontally like any job
- Stores dataset directly on HDFS
Open Source
- https://github.com/uber/hoodie
- https://eng.uber.com/hoodie
Upsert
(Spark)
Changelog Changelog
Incr Pull
(Hive/Spark/Presto)
Normal Table
(Hive/Spark/Presto)

Hoodie: Overview
Hoodie
WriteClient
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Dataset On HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views

Hoodie: Storage Types & Views
Storage Type
(How is Data stored?)
Views
(How is Data Read?)
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView

Hoodie : Deep Dive
Design & Implementation of incremental processing primitives

Storage: Basic Idea
2017/02/17
Index
Index
File1_v2.parquet
2017/02/15
2017/02/16
2017/02/17
File1.avro.log
200 GB
30min batch
File1
10 GB
5min batch
File1_v1.parquet
10 GB
5 min batch ●
●
●
●
●
●
●
●
●
● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.5 % (single batch)
New Files - 0.005 % (single batch)
● 7300 Files rewritten
~ 8 new Files
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
10 executors
● 24 minutes to write
~2 minutes to write
Input
Changelog
Hoodie Dataset

Index and Storage
Index
- Tag ingested record as update or insert
- Index is immutable (record key to File mapping never changes)
- Pluggable
- Bloom Filter
- HBase
Storage
- HDFS or Compatible Filesystem or Cloud Storage
- Block aligned files
- ROFormat (Apache Parquet) & WOFormat (Apache Avro)

Concurrency
● Multi-row atomicity
● Strong consistency (Same as HDFS guarantees)
● Single Writer - Multiple Consumer pattern
● MVCC for isolation
○ Running queries are run concurrently to ingestion

Data Skew
Why skew is a problem?
- Spark 2GB Remote Shuffle Block limit
- Straggler problem
Hoodie handles data skew automatically
- Index lookup skew
- Data write skew handled by auto sub partitioning based on history

Compaction
Essential for Query performance
- Merge Write Optimized row format with Scan Optimized column
format
Scheduled asynchronously to Ingestion
- Ingestion already groups updates per File Id
- Locks down versions of log files to compact
- Pluggable strategy to prioritize compactions
- Base File to Log file size ratio
- Recent partitions compacted first

Failure recovery
Automatic recovery via Spark RDD
- Resilient Distributed Datasets!!
No Partial writes
- Commit is atomic
- Auto rollback last failed commit
Rollback specific commits
Savepoints/Snapshots

Hoodie Write API
// WriteConfig contains basePath of hoodie dataset (among other configs)
HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig)
// Start a commit and get a commit time to atomically upsert a batch of records
String startCommit()
// Upsert the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String
commitTime)
// Choose to commit
boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses)
// Rollback
boolean rollback(final String commitTime) throws HoodieRollbackException

Hoodie Record
HoodieRecordPayload
// Get the Avro IndexedRecord for the dataset schema
○ IndexedRecord getInsertValue(Schema schema);
// Combine Existing value with New incoming value and return the combined value
○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema);

Hoodie: Overview
Hoodie
WriteClien
t
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Hoodie Dataset On
HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views

Hoodie Views
REALTIME
READ
OPTIMIZED
Queryexecutiontime
Data Latency
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- Targets existing Hive tables
Real Time View
- Hybrid of row & columnar data
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incr. Pull

Hoodie Views
Read Optimized
Table
Real Time Table
Hive
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
File1
File1_v1.parquet
10 GB
5min batch
10 GB
5 min batch
Input
Changelog
Incremental Log table

Read Optimized View
InputFormat picks only Compacted Columnar Files
Optimized for faster query runtime over data latency
- Plug into query plan generation to filter out older versions
- All Optimizations done to read parquet applies (Vectorized etc)
Works out of the box with Presto and Apache Spark

Presto Read Optimized Performance

Real Time View
InputFormat merges Columnar with Row Log at query execution
- Data Latency can approach speed of HDFS appends
Custom RecordReader
- Logs are grouped per FileID
- Single split is usually a single FileID in Hoodie (Block Aligned files)
Works out of the box with Presto and Apache Spark
- Specialized parquet read path optimizations not supported

Incremental Log View
Partitioned by trip start date
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Update
d Trips Log
View
Incr Pull

Incremental Log View
Pull ONLY changed records in a time range using SQL
- ‘startTs’ > _hoodie_commit_time < ‘endTs’
Avoid full table/partition scan
Do not rely on a custom sequence ID to tail

Hoodie : Use Cases
How is it being used in real production environments?

Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS

Use Cases
Near Real-Time ingestion / stream into HDFS
Incremental Data Pipelines
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler

Use Cases
Near Real-Time ingestion / streaming into HDFS
Incremental Data Pipelines
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Unified Analytical Serving Layer
- Eliminate your specialized serving layer , if latency tolerated is > 5 min
- Simplify serving with HDFS for the entire dataset

Adoption @ Uber
Powering ~1000 Data ingestion data feeds
- Every 30 mins today, several TBs per hour
- Towards < 10 min in the next few months
Incremental ETL for dimension tables
- Data warehouse at large
Reduced resource usage by 10x
- In production for last 6 months
- Hardened across rolling restarts, data node
reboots

Hoodie : Comparisons
What trade-offs does hoodie offer compared to other systems?

Source: (CERN Blog) Performance comparison of different
file formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage - Scans

Source: (CERN Blog) Performance comparison of different
file formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage - Write Rate

Comparison
Apache HBase Apache Kudu Hoodie
Write Latency Milliseconds Seconds (streaming) ~5m update, ~1m insert**
Scan Performance Not optimal Optimized via columnar
State of Art Hadoop
Formats
Query Engines Hive* Impala/Spark* Hive, Presto, Spark at scale
Deployment Extra Region servers
Specialized Storage
Servers
Spark Jobs on HDFS
Multi Row Commit/Rollback No No Yes
Incremental Pull No No Yes
Automatic Hotspot Handling No No Yes

Hoodie : Open Source
How to get involved, roadmap..

Community
Shopify evaluating for use
- Incremental DB ingestion onto GCS
- Early interest from multiple companies
Engage with us on Github (uber/hoodie)
- Look for “beginner-task” tagged issues
- Try out tools & utilities
Uber is hiring for “Hoodie”
- “Software Engineer - Data Processing Plaform (Hoodie)”

Future Plans
Merge On Read (Project #1)
- Active developement, Productionizing, Shipping!
Global Index (Project #2)
- Fast, lightweight index to map key to fileID, globally (not just partitions)
Spark Datasource (Issue #7) & Presto Plugins (Issue #81)
- Native support for incremental SQL (e.g: where _hoodie_commit_time > ... )
Beam Runner (Issue #8)
- Build incremental pipelines that also port across batch or streaming modes

Takeaways
Fills a big void in Hadoop land
- Upserts & Faster data
Play well with Hadoop ecosystem & deployments
- Leverage Spark vs re-inventing yet-another storage silo
Designed for Incremental Processing
- Incremental Pull is a ‘Hoodie’ special

Questions?
Office Hours after talk
5:00pm–5:45pm
source

●
●
●
●
○
○
●
●

2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
Change Log 200 GB
Realtime View
Read Optimized
View
Hive
File1
10 GB
File1_v1.parquet

Hoodie Write Path
Change log
Index lookup
updates
inserts
File Id1 LogFile
commit
(10:06)
Failed
commit
(10:08)
commit
(10:08)
Version 1
commit
(10:09)
Version 2
2017-03-11
File Id1
Compacted
(10:05)
2017-03-14
File Id2
2017-03-10
2017-03-11
2017-03-12
2017-03-13
2017-03-14
Commit Time: 10:10
Empty

Hoodie Write Path
Spark Application

Spark SQL Performance Comparison

Petabytes to Exabytes
Greater need for
Incremental
Processing

Exponential Growth is fun ..
Also extremely hard, to keep up with …
- Long waits for queue
- Disks running out of space
Common Pitfalls
- Massive re-computations
- Batch jobs are too big fail

Hoodie - DataEngConf 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hoodie - DataEngConf 2017

Similar to Hoodie - DataEngConf 2017 (20)

More from Vinoth Chandar

More from Vinoth Chandar (8)

Recently uploaded

Recently uploaded (20)

Hoodie - DataEngConf 2017