SlideShare a Scribd company logo
1 of 59
Download to read offline
An Open Source Incremental Processing Framework
Hoodie
DATA
Who Am I
Vinoth Chandar
- Founding engineer/architect of the data team at Uber.
- Previously,
- Lead on Linkedin’s Voldemort key value store.
- Oracle Database replication, Stream Processing.
- HPC & Grid Computing
Agenda
• Motivation
• Concepts
• Deep Dive
• Use-Cases
• Comparisons
• Open Source
Hoodie : Motivations
Use-cases & business needs that led to the birth of the project
Query Engines
Presto
> 100K queries/day
Spark
100s of Apps
Hive
20K Pipelines & Queries
Need For Faster Data!!!
Partitioned by trip start date
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 30 min
Day level partitions
Motivating Use-Case: Late Arriving Updates
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Update
d Trips
Jan: 6 hr (500 executors)
Snapshot
DB Ingestion: Status Quo
Database
trips
(Parquet)
Replicated
Trip Rows
(>100TB)
HBase
New
/updated
trip rows
Changelog
12-18+ hr
Kafka
upsert Presto
Derived
Tables
logging
8 hr
Approximation
Batch
Recompute
Apr: 8 hr (800 executors)
Aug: 10 hr (1000 executors)
How can we fix this?
Query HBase?
- Bad Fit for scans
- Lack of support for nested data
- Significant operational overhead
Specialized Analytical DBs?
- Joins with other datasets in HDFS
- Not all data will fit into memory
- Lambda architecture & data copies
Don’t support Snapshots, Only Logs
- Logs ultimately need to be
compacted anyway
- Merging done inconsistently &
inefficiently by users
Data Modelling Tricks?
- Does not change fundamental
nature of problem
Pivotal Question
What do we need to solve this directly on top of a
petabyte scale Hadoop Data Lake?
Let’s Go A Decade Back
How did RDBMS-es solve this?
• Update existing row with new value (Transactions)
• Consume a log of changes downstream (Redo log)
• Update again downstream
MySQL
(Server A)
MySQL
(Server B)
Update
Update
Pull Redo
Log
TransformationImportant Differences
• Columnar file formats
• Read-heavy analytical workloads
• Petabytes & 1000s of servers
Changes
Pivotal Question
What do we need to solve this directly on top of a
petabyte scale Hadoop Data Lake?
Answer: upserts & incrementals
10 hr (1000)
8 hr (800)
6 hr (500)
snapshot
Batch
Recompute
Challenging Status Quo: upserts & incr pull
trips
(parquet)
12-18+ hr
Presto
Derived
Tables8 hr
Approximation
upsert()
30 min (100) - Today
5 min (50) - Q2 ‘17
1 hr
incrPull()
[2 mins to pull]
1 hr - 3 hr
(10x less
resources)
Accurate!!!
Database
Replicated
Trip Rows
(>100TB)
HBase
New
/updated
trip rows
Changelog
Kafka
upsert
logging
Hoodie : Concepts
Incremental Processing Foundations & why it’s important
Anatomy Of Data Pipelines
Core Operations
• Projections (Easy)
• Filtering (Easy)
• Aggregations (Tricky)
• Window (Tricky)
• Joins (Hard)
Operational Levers (Google DataFlow)
• Latency
• Completeness
• Cost
Typically Pick 2/3
Source SinkData Pipeline
An Artificial Dichotomy
It’s A Spectrum
- Very Common
use-cases tolerating
few mins of latency
- 100x more batch
pipelines than
streaming pipelines
Incremental Processing : What?
Run Mini Batch Pipelines
- Provide high completeness than streaming pipelines
- By supporting things like multi-table joins seamlessly
In Streaming Fashion
- Provide lower latency than typical batch pipeline
- By only consuming new input & ability to update old results
Incremental Processing : Increased Efficiency
For more, “Case For Incremental Processing on Hadoop” (link)
Less IO,
On-Demand
Resource
Allocation
Incremental Processing : Leverage Hadoop SQL
For more, “Case For Incremental Processing on Hadoop” (link)
- Good support for joins
- Columnar File Formats,
- Cover wide range of use
cases - exploratory,
interactive
Incremental Processing : Simplify Architecture
For more, “Case For Incremental Processing on Hadoop” (link)
- Efficient pipelines on
same batch
infrastructure
- Consolidation of
storage & compute
Incremental Processing : Primitives
Incremental Pull (Primitive #2)
- Log stream of changes, avoid costly
scans
- Enable chaining processing in DAG
Upsert (Primitive #1)
- Modify processed results
- Like state stores in stream
processing
Introducing: Hoodie
(Hadoop Upserts anD Incrementals)
Storage Abstraction to
- Apply mutations to dataset
- Pull changelog incrementally
Spark Library
- Scales horizontally like any job
- Stores dataset directly on HDFS
Open Source
- https://github.com/uber/hoodie
- https://eng.uber.com/hoodie
Large HDFS
Dataset
Upsert
(Spark)
Changelog Changelog
Incr Pull
(Hive/Spark/Presto)
Normal Table
(Hive/Spark/Presto)
Hoodie: Overview
Hoodie
WriteClient
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Dataset On HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views
Hoodie: How Do I Ingest?
HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
.withPath(path)
.withSchema(schema)
.withParallelism(500)
.withIndexConfig(HoodieIndexConfig.newBuilder()
.withIndexType(HoodieIndex.IndexType.BLOOM).build())
.withStorageConfig(HoodieStorageConfig.newBuilder()
.defaultStorage().build())
.withCompactionConfig(HoodieCompactionConfig.newBuilder()
.withCompactionStrategy(new BoundedIOCompactionStrategy()).build())
.build();
JavaRDD<HoodieRecord> inputRecords = … // input data
HoodieWriteClient client = new HoodieWriteClient(sc, cfg);
JavaRDD<WriteStatus> result = client.upsert(inputRecords, commitTime);
boolean toCommit = inspectResultFailures(result);
if(toCommit) {
client.commit(commitTime, result);
} else {
client.rollback(commitTime);
}
Hoodie: Spark DAG
Hoodie: How Do I Query?
SparkSession spark = SparkSession.builder()
.appName("Hoodie SparkSQL")
.config("spark.sql.hive.convertMetastoreParquet", false)
.enableHiveSupport()
.getOrCreate();
// real time query
spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_rt
where fare > 100.0").show();
// read optimized query
spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_ro
where fare > 100.0").show();
// Spark Datasource (WIP)
Dataset<Row> dataset = sqlContext.read().format(HOODIE_SOURCE_NAME)
.option("query", "SELECT driverUUID, riderUUID FROM trips").load();
Hoodie : Deep Dive
Design & Implementation of incremental processing primitives
Hoodie: Storage Types & Views
Storage Type
(How is Data stored?)
Views
(How is Data Read?)
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView
Storage: Basic Idea
2017/02/17
File1.parquet
Index
Index
File1_v2.parquet
2017/02/15
2017/02/16
2017/02/17
File1.avro.log
200 GB
30min batch
File1
10 GB
5min batch
File1_v1.parquet
10 GB
5 min batch ● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.005 (single batch)
● 7300 Files rewritten
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
● 24 minutes to write
● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.5 % (single batch)
New Files - 0.005 % (single batch)
● 7300 Files rewritten
~ 8 new Files
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
10 executors
● 24 minutes to write
~2 minutes to write
Input
Changelog
Hoodie Dataset
Index and Storage
Index
- Tag ingested record as update or insert
- Index is immutable (record key to File mapping never changes)
- Pluggable
- Bloom Filter
- HBase
Storage
- HDFS or Compatible Filesystem or Cloud Storage
- Block aligned files
- ROFormat (Apache Parquet) & WOFormat (Apache Avro)
Concurrency
● Multi-row atomicity
● Strong consistency (Same as HDFS guarantees)
● Single Writer - Multiple Consumer pattern
● MVCC for isolation
○ Running queries are run concurrently to ingestion
Data Skew
Why skew is a problem?
- Spark 2GB Remote Shuffle Block limit
- Straggler problem
Hoodie handles data skew automatically
- Index lookup skew
- Data write skew handled by auto sub partitioning based on history
Compaction
Essential for Query performance
- Merge Write Optimized row format with Scan Optimized column
format
Scheduled asynchronously to Ingestion
- Ingestion already groups updates per File Id
- Locks down versions of log files to compact
- Pluggable strategy to prioritize compactions
- Base File to Log file size ratio
- Recent partitions compacted first
Failure recovery
Automatic recovery via Spark RDD
- Resilient Distributed Datasets!!
No Partial writes
- Commit is atomic
- Auto rollback last failed commit
Rollback specific commits
Savepoints/Snapshots
Hoodie: Overview
Hoodie Concepts
Hoodie
WriteClien
t
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Hoodie Dataset On
HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views
Hoodie Views
REALTIME
READ
OPTIMIZED
Queryexecutiontime
Data Latency
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- Targets existing Hive tables
Real Time View
- Hybrid of row & columnar data
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incr. Pull
Hoodie Views
Read Optimized
Table
Real Time Table
Hive
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
File1
File1_v1.parquet
10 GB
5min batch
10 GB
5 min batch
Input
Changelog
Incremental Log table
Read Optimized View
InputFormat picks only Compacted Columnar Files
Optimized for faster query runtime over data latency
- Plug into query plan generation to filter out older versions
- All Optimizations done to read parquet applies (Vectorized etc)
Works out of the box with Presto and Apache Spark
Presto Read Optimized Performance
Real Time View
InputFormat merges Columnar with Row Log at query execution
- Data Latency can approach speed of HDFS appends
Custom RecordReader
- Logs are grouped per FileID
- Single split is usually a single FileID in Hoodie (Block Aligned files)
Works out of the box with Presto and Apache Spark
- Specialized parquet read path optimizations not supported
Incremental Log View
Partitioned by trip start date
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Update
d Trips Log
View
Incr Pull
Incremental Log View
Pull ONLY changed records in a time range using SQL
- ‘startTs’ > _hoodie_commit_time < ‘endTs’
Avoid full table/partition scan
Do not rely on a custom sequence ID to tail
Hoodie : Use Cases
How is it being used in real production environments?
Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Near Real-Time Ingestion
Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental Data Pipelines
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Incremental ETL
Use Cases
Near Real-Time ingestion / streaming into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental Data Pipelines
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Unified Analytical Serving Layer
- Eliminate your specialized serving layer , if latency tolerated is > 5 min
- Simplify serving with HDFS for the entire dataset
Unified Analytics Serving
Adoption @ Uber
Powering ~1000 Data ingestion data feeds
- Every 30 mins today, several TBs per hour
- Towards < 10 min in the next few months
Incremental ETL for dimension tables
- Data warehouse at large
Reduced resource usage by 10x
- In production for last 6 months
- Hardened across rolling restarts, data node
reboots
Hoodie : Comparisons
What trade-offs does hoodie offer compared to other systems?
Source: (CERN Blog) Performance comparison of different
file formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage - Scans
Source: (CERN Blog) Performance comparison of different
file formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage - Write Rate
Comparison
Apache HBase Apache Kudu Hoodie
Write Latency Milliseconds Seconds (streaming) ~5m update, ~1m insert**
Scan Performance Not optimal Optimized via columnar
State of Art Hadoop
Formats
Query Engines Hive* Impala/Spark* Hive, Presto, Spark at scale
Deployment Extra Region servers
Specialized Storage
Servers
Spark Jobs on HDFS
Multi Row Commit/Rollback No No Yes
Incremental Pull No No Yes
Automatic Hotspot Handling No No Yes
Hoodie : Open Source
How to get involved, roadmap..
Community
Shopify evaluating for use
- Incremental DB ingestion onto GCS
- Early interest from multiple companies (DoubleVerify,..)
Engage with us on Github (uber/hoodie)
- Look for “beginner-task” tagged issues
- Try out tools & utilities
Uber is hiring for “Hoodie”
- “Software Engineer - Data Processing Plaform (Hoodie)”
Future Plans
Merge On Read (Project #1)
- Productionizing, Shipping!
Global Index (Project #2)
- Fast, lightweight index to map key to fileID, globally (not just partitions)
Spark Datasource (Issue #7) & Presto Plugins (Issue #81)
- Native support for incremental SQL (e.g: where _hoodie_commit_time > ... )
Beam Runner (Issue #8)
- Build incremental pipelines that also port across batch or streaming modes
Takeaways
Fills a big void in Hadoop land
- Upserts & Faster data
Play well with Hadoop ecosystem & deployments
- Leverage Spark vs re-inventing yet-another storage silo
Designed for Incremental Processing
- Incremental Pull is a ‘Hoodie’ special
Questions?
source

More Related Content

What's hot

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Expand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinExpand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinDataWorks Summit
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...Databricks
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015N Masahiro
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 

What's hot (20)

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Expand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinExpand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with Zeppelin
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
More Algorithms and Tools for Genomic Analysis on Apache Spark with Ryan Will...
 
RubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngineRubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngine
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 

Similar to SF Big Analytics meetup : Hoodie From Uber

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Realtime Data Analytics
Realtime Data AnalyticsRealtime Data Analytics
Realtime Data AnalyticsBo Yang
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 

Similar to SF Big Analytics meetup : Hoodie From Uber (20)

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Realtime Data Analytics
Realtime Data AnalyticsRealtime Data Analytics
Realtime Data Analytics
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 

More from Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfChester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdfChester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataChester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapChester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathChester Chen
 

More from Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

SF Big Analytics meetup : Hoodie From Uber

  • 1. An Open Source Incremental Processing Framework Hoodie DATA
  • 2. Who Am I Vinoth Chandar - Founding engineer/architect of the data team at Uber. - Previously, - Lead on Linkedin’s Voldemort key value store. - Oracle Database replication, Stream Processing. - HPC & Grid Computing
  • 3. Agenda • Motivation • Concepts • Deep Dive • Use-Cases • Comparisons • Open Source
  • 4. Hoodie : Motivations Use-cases & business needs that led to the birth of the project
  • 5. Query Engines Presto > 100K queries/day Spark 100s of Apps Hive 20K Pipelines & Queries Need For Faster Data!!!
  • 6. Partitioned by trip start date 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 30 min Day level partitions Motivating Use-Case: Late Arriving Updates 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Update d Trips
  • 7. Jan: 6 hr (500 executors) Snapshot DB Ingestion: Status Quo Database trips (Parquet) Replicated Trip Rows (>100TB) HBase New /updated trip rows Changelog 12-18+ hr Kafka upsert Presto Derived Tables logging 8 hr Approximation Batch Recompute Apr: 8 hr (800 executors) Aug: 10 hr (1000 executors)
  • 8. How can we fix this? Query HBase? - Bad Fit for scans - Lack of support for nested data - Significant operational overhead Specialized Analytical DBs? - Joins with other datasets in HDFS - Not all data will fit into memory - Lambda architecture & data copies Don’t support Snapshots, Only Logs - Logs ultimately need to be compacted anyway - Merging done inconsistently & inefficiently by users Data Modelling Tricks? - Does not change fundamental nature of problem
  • 9. Pivotal Question What do we need to solve this directly on top of a petabyte scale Hadoop Data Lake?
  • 10. Let’s Go A Decade Back How did RDBMS-es solve this? • Update existing row with new value (Transactions) • Consume a log of changes downstream (Redo log) • Update again downstream MySQL (Server A) MySQL (Server B) Update Update Pull Redo Log TransformationImportant Differences • Columnar file formats • Read-heavy analytical workloads • Petabytes & 1000s of servers Changes
  • 11. Pivotal Question What do we need to solve this directly on top of a petabyte scale Hadoop Data Lake? Answer: upserts & incrementals
  • 12. 10 hr (1000) 8 hr (800) 6 hr (500) snapshot Batch Recompute Challenging Status Quo: upserts & incr pull trips (parquet) 12-18+ hr Presto Derived Tables8 hr Approximation upsert() 30 min (100) - Today 5 min (50) - Q2 ‘17 1 hr incrPull() [2 mins to pull] 1 hr - 3 hr (10x less resources) Accurate!!! Database Replicated Trip Rows (>100TB) HBase New /updated trip rows Changelog Kafka upsert logging
  • 13. Hoodie : Concepts Incremental Processing Foundations & why it’s important
  • 14. Anatomy Of Data Pipelines Core Operations • Projections (Easy) • Filtering (Easy) • Aggregations (Tricky) • Window (Tricky) • Joins (Hard) Operational Levers (Google DataFlow) • Latency • Completeness • Cost Typically Pick 2/3 Source SinkData Pipeline
  • 16. It’s A Spectrum - Very Common use-cases tolerating few mins of latency - 100x more batch pipelines than streaming pipelines
  • 17. Incremental Processing : What? Run Mini Batch Pipelines - Provide high completeness than streaming pipelines - By supporting things like multi-table joins seamlessly In Streaming Fashion - Provide lower latency than typical batch pipeline - By only consuming new input & ability to update old results
  • 18. Incremental Processing : Increased Efficiency For more, “Case For Incremental Processing on Hadoop” (link) Less IO, On-Demand Resource Allocation
  • 19. Incremental Processing : Leverage Hadoop SQL For more, “Case For Incremental Processing on Hadoop” (link) - Good support for joins - Columnar File Formats, - Cover wide range of use cases - exploratory, interactive
  • 20. Incremental Processing : Simplify Architecture For more, “Case For Incremental Processing on Hadoop” (link) - Efficient pipelines on same batch infrastructure - Consolidation of storage & compute
  • 21. Incremental Processing : Primitives Incremental Pull (Primitive #2) - Log stream of changes, avoid costly scans - Enable chaining processing in DAG Upsert (Primitive #1) - Modify processed results - Like state stores in stream processing
  • 22. Introducing: Hoodie (Hadoop Upserts anD Incrementals) Storage Abstraction to - Apply mutations to dataset - Pull changelog incrementally Spark Library - Scales horizontally like any job - Stores dataset directly on HDFS Open Source - https://github.com/uber/hoodie - https://eng.uber.com/hoodie Large HDFS Dataset Upsert (Spark) Changelog Changelog Incr Pull (Hive/Spark/Presto) Normal Table (Hive/Spark/Presto)
  • 23. Hoodie: Overview Hoodie WriteClient (Spark) Index Data Files Timeline Metadata Hive Queries Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  • 24. Hoodie: How Do I Ingest? HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder() .withPath(path) .withSchema(schema) .withParallelism(500) .withIndexConfig(HoodieIndexConfig.newBuilder() .withIndexType(HoodieIndex.IndexType.BLOOM).build()) .withStorageConfig(HoodieStorageConfig.newBuilder() .defaultStorage().build()) .withCompactionConfig(HoodieCompactionConfig.newBuilder() .withCompactionStrategy(new BoundedIOCompactionStrategy()).build()) .build(); JavaRDD<HoodieRecord> inputRecords = … // input data HoodieWriteClient client = new HoodieWriteClient(sc, cfg); JavaRDD<WriteStatus> result = client.upsert(inputRecords, commitTime); boolean toCommit = inspectResultFailures(result); if(toCommit) { client.commit(commitTime, result); } else { client.rollback(commitTime); }
  • 26. Hoodie: How Do I Query? SparkSession spark = SparkSession.builder() .appName("Hoodie SparkSQL") .config("spark.sql.hive.convertMetastoreParquet", false) .enableHiveSupport() .getOrCreate(); // real time query spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_rt where fare > 100.0").show(); // read optimized query spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_ro where fare > 100.0").show(); // Spark Datasource (WIP) Dataset<Row> dataset = sqlContext.read().format(HOODIE_SOURCE_NAME) .option("query", "SELECT driverUUID, riderUUID FROM trips").load();
  • 27. Hoodie : Deep Dive Design & Implementation of incremental processing primitives
  • 28. Hoodie: Storage Types & Views Storage Type (How is Data stored?) Views (How is Data Read?) Copy On Write Read Optimized, LogView Merge On Read Read Optimized, RealTime, LogView
  • 29. Storage: Basic Idea 2017/02/17 File1.parquet Index Index File1_v2.parquet 2017/02/15 2017/02/16 2017/02/17 File1.avro.log 200 GB 30min batch File1 10 GB 5min batch File1_v1.parquet 10 GB 5 min batch ● 1825 Partitions (365 days * 5 yrs) ● 100 GB Partition Size ● 128 MB File Size ● ~800 Files Per Partition ● Skew spread - 0.005 (single batch) ● 7300 Files rewritten ● 20 seconds to re-write 1 File (shuffle) ● 100 executors ● 24 minutes to write ● 1825 Partitions (365 days * 5 yrs) ● 100 GB Partition Size ● 128 MB File Size ● ~800 Files Per Partition ● Skew spread - 0.5 % (single batch) New Files - 0.005 % (single batch) ● 7300 Files rewritten ~ 8 new Files ● 20 seconds to re-write 1 File (shuffle) ● 100 executors 10 executors ● 24 minutes to write ~2 minutes to write Input Changelog Hoodie Dataset
  • 30. Index and Storage Index - Tag ingested record as update or insert - Index is immutable (record key to File mapping never changes) - Pluggable - Bloom Filter - HBase Storage - HDFS or Compatible Filesystem or Cloud Storage - Block aligned files - ROFormat (Apache Parquet) & WOFormat (Apache Avro)
  • 31. Concurrency ● Multi-row atomicity ● Strong consistency (Same as HDFS guarantees) ● Single Writer - Multiple Consumer pattern ● MVCC for isolation ○ Running queries are run concurrently to ingestion
  • 32. Data Skew Why skew is a problem? - Spark 2GB Remote Shuffle Block limit - Straggler problem Hoodie handles data skew automatically - Index lookup skew - Data write skew handled by auto sub partitioning based on history
  • 33. Compaction Essential for Query performance - Merge Write Optimized row format with Scan Optimized column format Scheduled asynchronously to Ingestion - Ingestion already groups updates per File Id - Locks down versions of log files to compact - Pluggable strategy to prioritize compactions - Base File to Log file size ratio - Recent partitions compacted first
  • 34. Failure recovery Automatic recovery via Spark RDD - Resilient Distributed Datasets!! No Partial writes - Commit is atomic - Auto rollback last failed commit Rollback specific commits Savepoints/Snapshots
  • 35. Hoodie: Overview Hoodie Concepts Hoodie WriteClien t (Spark) Index Data Files Timeline Metadata Hive Queries Hoodie Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  • 36. Hoodie Views REALTIME READ OPTIMIZED Queryexecutiontime Data Latency 3 Logical views Of Dataset Read Optimized View - Raw Parquet Query Performance - Targets existing Hive tables Real Time View - Hybrid of row & columnar data - Brings near-real time tables Log View - Stream of changes to dataset - Enables Incr. Pull
  • 37. Hoodie Views Read Optimized Table Real Time Table Hive 2017/02/15 2017/02/16 2017/02/17 2017/02/16 File1.parquet Index Index File1_v2.parquet File1.avro.log File1 File1_v1.parquet 10 GB 5min batch 10 GB 5 min batch Input Changelog Incremental Log table
  • 38. Read Optimized View InputFormat picks only Compacted Columnar Files Optimized for faster query runtime over data latency - Plug into query plan generation to filter out older versions - All Optimizations done to read parquet applies (Vectorized etc) Works out of the box with Presto and Apache Spark
  • 39. Presto Read Optimized Performance
  • 40. Real Time View InputFormat merges Columnar with Row Log at query execution - Data Latency can approach speed of HDFS appends Custom RecordReader - Logs are grouped per FileID - Single split is usually a single FileID in Hoodie (Block Aligned files) Works out of the box with Presto and Apache Spark - Specialized parquet read path optimizations not supported
  • 41. Incremental Log View Partitioned by trip start date 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 5 min 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Update d Trips Log View Incr Pull
  • 42. Incremental Log View Pull ONLY changed records in a time range using SQL - ‘startTs’ > _hoodie_commit_time < ‘endTs’ Avoid full table/partition scan Do not rely on a custom sequence ID to tail
  • 43. Hoodie : Use Cases How is it being used in real production environments?
  • 44. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS
  • 46. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental Data Pipelines - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler
  • 48. Use Cases Near Real-Time ingestion / streaming into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental Data Pipelines - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler Unified Analytical Serving Layer - Eliminate your specialized serving layer , if latency tolerated is > 5 min - Simplify serving with HDFS for the entire dataset
  • 50. Adoption @ Uber Powering ~1000 Data ingestion data feeds - Every 30 mins today, several TBs per hour - Towards < 10 min in the next few months Incremental ETL for dimension tables - Data warehouse at large Reduced resource usage by 10x - In production for last 6 months - Hardened across rolling restarts, data node reboots
  • 51. Hoodie : Comparisons What trade-offs does hoodie offer compared to other systems?
  • 52. Source: (CERN Blog) Performance comparison of different file formats and storage engines in the Hadoop ecosystem Comparison: Analytical Storage - Scans
  • 53. Source: (CERN Blog) Performance comparison of different file formats and storage engines in the Hadoop ecosystem Comparison: Analytical Storage - Write Rate
  • 54. Comparison Apache HBase Apache Kudu Hoodie Write Latency Milliseconds Seconds (streaming) ~5m update, ~1m insert** Scan Performance Not optimal Optimized via columnar State of Art Hadoop Formats Query Engines Hive* Impala/Spark* Hive, Presto, Spark at scale Deployment Extra Region servers Specialized Storage Servers Spark Jobs on HDFS Multi Row Commit/Rollback No No Yes Incremental Pull No No Yes Automatic Hotspot Handling No No Yes
  • 55. Hoodie : Open Source How to get involved, roadmap..
  • 56. Community Shopify evaluating for use - Incremental DB ingestion onto GCS - Early interest from multiple companies (DoubleVerify,..) Engage with us on Github (uber/hoodie) - Look for “beginner-task” tagged issues - Try out tools & utilities Uber is hiring for “Hoodie” - “Software Engineer - Data Processing Plaform (Hoodie)”
  • 57. Future Plans Merge On Read (Project #1) - Productionizing, Shipping! Global Index (Project #2) - Fast, lightweight index to map key to fileID, globally (not just partitions) Spark Datasource (Issue #7) & Presto Plugins (Issue #81) - Native support for incremental SQL (e.g: where _hoodie_commit_time > ... ) Beam Runner (Issue #8) - Build incremental pipelines that also port across batch or streaming modes
  • 58. Takeaways Fills a big void in Hadoop land - Upserts & Faster data Play well with Hadoop ecosystem & deployments - Leverage Spark vs re-inventing yet-another storage silo Designed for Incremental Processing - Incremental Pull is a ‘Hoodie’ special