SlideShare a Scribd company logo
1 of 23
Download to read offline
Prasanna Rajaperumal, Engineer, Uber
Hoodie
How (and Why) Uber built
an Analytical datastore On
Spark
June, 2017
What’s our problem?
Queryable State for Analytics
Queryable state == mutations
● Pure Dimension Table e.g. Users
● Fact Tables that can get super large and needs a
materialized view e.g. Trips
● Late Arriving Data
○ Event time vs Processing time
● Delete records (Compliance)
● Data correction upstream
Analytics == Big Scans
● Super fast scans on subset of columns
● Large time ranges - Lots of data
source
Okay, so what did we want?
OLAP Database
● Scale and complexity
○ Scale horizontally [Petabytes]
○ Support Nested columns
○ Batch ingest and Analytical scans
● Latency
○ Ingest Latency ~ 10 minutes
○ Query Latency ~ upto 2 minutes
● Multi tenant - High throughput
● Transactional - ACID
● Self Healing
○ Less tunable knobs
○ Handle data skew
○ Auto scale with load
○ Failure Recovery
○ Rollback and Savepoints
source
Okay, Could you do ...?
Solutions that did not work for us
● OLAP RDBMS
○ Petabyte scale
○ Elastic scaling of compute
● No/New SQL (LSM)
○ Scan performance
○ Operations involved - Compaction
● Hack around it
○ Dump LSM Snapshot
○ Rewrite partitions too costly
○ Watermark - Approximations
● Hive Transactions
○ Hive specific solution
○ Hash bucketing - tuning?
● Apache Kudu
○ Separate storage server
○ Eco system support
source
Let's design what we want.
We have 20 minutes.
“Software Engineer engineers
the illusion of simplicity”
Grady Booch, UML Creator
Pick the area in RUM triangle
Design choices
● RUM Conjecture
○ Optimize 2 at the expense of the third
● Fast data - Write Optimized
○ Control Read Amplification
○ Query execution cost
● Fast Scans - Read Optimized
○ Control Write Amplification
○ Ingestion cost
● Choice per client/query
source
Pick Framework
Leveraging Spark’s Elasticity + Scalability + Speed
● Spark + DFS vs Storage Server
○ Batch engine vs MPP engine
■ Throughput vs Latency
■ Flexibility to go batch or streaming
■ Dynamic Resource Allocation
○ Complexity
■ Static Partitioning
■ Dedicated resources
■ Consensus
○ Scaling
■ Auto Scaling with load using Spark
○ Resiliency and Recovery (RDD)
■ Simplify Application Abstraction
■ Self Healing
○ Simplified API Layer
Correctness - ACID
Design choices
● Atomic ingest of a batch
○ Based on Processing time
○ Cross row atomicity
● Strong consistency
● Single Writer | Multiple Reader
● High query concurrency
○ Query Isolation using Snapshot
● Time travel
○ Temporal queries
source
Storage
Design choices
● Hybrid Storage
○ Row based - Recent data
○ Column based - Cold data
● Compactor
● Insert vs Update during Ingest
○ Need for Index
● Ingest parallelism vs Query parallelism
○ Max file size
Partitioning
Implementation choices
● DFS - Directory Partitioning
○ Coarse grained
○ Need finer grained
■ Hash Bucket
■ Auto create partition on insert
Introducing Hoodie
Hadoop Upsert anD Incrementals
https://github.com/uber/hoodie
https://eng.uber.com/hoodie
HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
.withPath(path)
.withSchema(schema)
.withParallelism(500)
.withIndexConfig(HoodieIndexConfig.newBuilder()
.withIndexType(HoodieIndex.IndexType.BLOOM).build())
.withStorageConfig(HoodieStorageConfig.newBuilder()
.defaultStorage().build())
.withCompactionConfig(HoodieCompactionConfig.newBuilder()
.withCompactionStrategy(new BoundedIOCompactionStrategy()).build())
.build();
JavaRDD<HoodieRecord> inputRecords = … // input data
HoodieWriteClient client = new HoodieWriteClient(sc, cfg);
JavaRDD<WriteStatus> result = client. upsert(inputRecords, commitTime);
boolean toCommit = inspectResultFailures(result);
if(toCommit) {
client.commit(commitTime, result);
} else {
client.rollback(commitTime);
}
How do I ingest?
Show me the code !!!
Spark DAG
How is that graph looking?
Storage & Index
Implementation choices
Storage RDD
Every columnar file has one or more “redo” log
● Row based Log Format - Apache Avro
○ Append block
○ Rollback
● Columnar Format - Apache Parquet
○ Predicate pushdown
○ Columnar compression
○ Vectorized Reading
Index RDD - Insert vs Update during Ingest
● Embedded
○ Bloom Filter
● External
○ Key Value store source
Correctness
Implementation choices
● Commit File on DFS
○ Atomic rename to publish data
○ Consume from downstream Spark job
● Query Isolation
○ Multiple versions of data file
○ Query hook - InputFormat via SparkSQL
source
Compaction
Implementation
● Compaction
○ Background Spark job
○ Lock log files
○ Minor
■ IO Bound Strategy
■ Improve Query Performance
○ Major
■ No log left behind
source
SparkSession spark = SparkSession.builder()
.appName("Hoodie SparkSQL")
.config("spark.sql.hive.convertMetastoreParquet", false)
. enableHiveSupport()
.getOrCreate();
// real time query
spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_rt where fare
> 100.0").show();
// read optimized query
spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_ro where fare
> 100.0").show();
// Spark Datasource (WIP)
Dataset<Row> dataset = sqlContext.read().format(HOODIE_SOURCE_NAME)
.option("query", "SELECT driverUUID, riderUUID FROM trips").load();
How can I query?
Show me the code !!!
Query
Design choices
RUM Conjecture - Moving within the chosen area
Read Optimized View
- Pick only columnar files for querying
- Raw Parquet Query Performance
- Freshness of Major Compaction
Real Time View
- Hybrid of row and columnar data
- Brings near-real time tables
- SparkSQL with convertMetaStore=false
REALTIME
READ
OPTIMIZED
Queryexecutiontime
Data Latency
2017/02/17
Index
Index
File1_v2.parquet
2017/02/15
2017/02/16
2017/02/17
File1.avro.log
FileGroup
10 GB
5min batch
File1_v1.parquet
10 GB
5 min batch
Input
Changelog
Hoodie Dataset
Realtime View
Read Optimized
View
Spark SQL
Shopify evaluating for use
- Incremental DB ingestion onto GCS
- Early interest from multiple companies
Engage with us on Github (uber/hoodie)
- Look for “beginner-task” tagged issues
- Try out tools & utilities
Uber is hiring for “Hoodie”
- Staff Engineer
Community
Share love and code
● Productionizing on AWS S3/EFS, GCP
● Spark Datasource
● Structured Streaming Sink
● Performance in Read Path
○ Presto plugin
○ Impala
● Spark Caching and integration with Apache Arrow
● Beam Runner
Future Plans
Aim high
Questions?

More Related Content

What's hot

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalDatabricks
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerDatabricks
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelishuguk
 

What's hot (20)

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelis
 

Similar to Hoodie: How (And Why) We built an analytical datastore on Spark

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkVMware Tanzu
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
 
Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScyllaDB
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 

Similar to Hoodie: How (And Why) We built an analytical datastore on Spark (20)

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache Spark
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/Day
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 

More from Vinoth Chandar

Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to ProductionVinoth Chandar
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesVinoth Chandar
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesVinoth Chandar
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingVinoth Chandar
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
DistributeddatabasesforchallengednetVinoth Chandar
 

More from Vinoth Chandar (6)

Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based Grouping
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
Distributeddatabasesforchallengednet
 
Bluetube
BluetubeBluetube
Bluetube
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Hoodie: How (And Why) We built an analytical datastore on Spark

  • 1. Prasanna Rajaperumal, Engineer, Uber Hoodie How (and Why) Uber built an Analytical datastore On Spark June, 2017
  • 2. What’s our problem? Queryable State for Analytics Queryable state == mutations ● Pure Dimension Table e.g. Users ● Fact Tables that can get super large and needs a materialized view e.g. Trips ● Late Arriving Data ○ Event time vs Processing time ● Delete records (Compliance) ● Data correction upstream Analytics == Big Scans ● Super fast scans on subset of columns ● Large time ranges - Lots of data source
  • 3. Okay, so what did we want? OLAP Database ● Scale and complexity ○ Scale horizontally [Petabytes] ○ Support Nested columns ○ Batch ingest and Analytical scans ● Latency ○ Ingest Latency ~ 10 minutes ○ Query Latency ~ upto 2 minutes ● Multi tenant - High throughput ● Transactional - ACID ● Self Healing ○ Less tunable knobs ○ Handle data skew ○ Auto scale with load ○ Failure Recovery ○ Rollback and Savepoints source
  • 4. Okay, Could you do ...? Solutions that did not work for us ● OLAP RDBMS ○ Petabyte scale ○ Elastic scaling of compute ● No/New SQL (LSM) ○ Scan performance ○ Operations involved - Compaction ● Hack around it ○ Dump LSM Snapshot ○ Rewrite partitions too costly ○ Watermark - Approximations ● Hive Transactions ○ Hive specific solution ○ Hash bucketing - tuning? ● Apache Kudu ○ Separate storage server ○ Eco system support source
  • 5. Let's design what we want. We have 20 minutes.
  • 6. “Software Engineer engineers the illusion of simplicity” Grady Booch, UML Creator
  • 7. Pick the area in RUM triangle Design choices ● RUM Conjecture ○ Optimize 2 at the expense of the third ● Fast data - Write Optimized ○ Control Read Amplification ○ Query execution cost ● Fast Scans - Read Optimized ○ Control Write Amplification ○ Ingestion cost ● Choice per client/query source
  • 8. Pick Framework Leveraging Spark’s Elasticity + Scalability + Speed ● Spark + DFS vs Storage Server ○ Batch engine vs MPP engine ■ Throughput vs Latency ■ Flexibility to go batch or streaming ■ Dynamic Resource Allocation ○ Complexity ■ Static Partitioning ■ Dedicated resources ■ Consensus ○ Scaling ■ Auto Scaling with load using Spark ○ Resiliency and Recovery (RDD) ■ Simplify Application Abstraction ■ Self Healing ○ Simplified API Layer
  • 9. Correctness - ACID Design choices ● Atomic ingest of a batch ○ Based on Processing time ○ Cross row atomicity ● Strong consistency ● Single Writer | Multiple Reader ● High query concurrency ○ Query Isolation using Snapshot ● Time travel ○ Temporal queries source
  • 10. Storage Design choices ● Hybrid Storage ○ Row based - Recent data ○ Column based - Cold data ● Compactor ● Insert vs Update during Ingest ○ Need for Index ● Ingest parallelism vs Query parallelism ○ Max file size
  • 11. Partitioning Implementation choices ● DFS - Directory Partitioning ○ Coarse grained ○ Need finer grained ■ Hash Bucket ■ Auto create partition on insert
  • 12. Introducing Hoodie Hadoop Upsert anD Incrementals https://github.com/uber/hoodie https://eng.uber.com/hoodie
  • 13. HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder() .withPath(path) .withSchema(schema) .withParallelism(500) .withIndexConfig(HoodieIndexConfig.newBuilder() .withIndexType(HoodieIndex.IndexType.BLOOM).build()) .withStorageConfig(HoodieStorageConfig.newBuilder() .defaultStorage().build()) .withCompactionConfig(HoodieCompactionConfig.newBuilder() .withCompactionStrategy(new BoundedIOCompactionStrategy()).build()) .build(); JavaRDD<HoodieRecord> inputRecords = … // input data HoodieWriteClient client = new HoodieWriteClient(sc, cfg); JavaRDD<WriteStatus> result = client. upsert(inputRecords, commitTime); boolean toCommit = inspectResultFailures(result); if(toCommit) { client.commit(commitTime, result); } else { client.rollback(commitTime); } How do I ingest? Show me the code !!!
  • 14. Spark DAG How is that graph looking?
  • 15. Storage & Index Implementation choices Storage RDD Every columnar file has one or more “redo” log ● Row based Log Format - Apache Avro ○ Append block ○ Rollback ● Columnar Format - Apache Parquet ○ Predicate pushdown ○ Columnar compression ○ Vectorized Reading Index RDD - Insert vs Update during Ingest ● Embedded ○ Bloom Filter ● External ○ Key Value store source
  • 16. Correctness Implementation choices ● Commit File on DFS ○ Atomic rename to publish data ○ Consume from downstream Spark job ● Query Isolation ○ Multiple versions of data file ○ Query hook - InputFormat via SparkSQL source
  • 17. Compaction Implementation ● Compaction ○ Background Spark job ○ Lock log files ○ Minor ■ IO Bound Strategy ■ Improve Query Performance ○ Major ■ No log left behind source
  • 18. SparkSession spark = SparkSession.builder() .appName("Hoodie SparkSQL") .config("spark.sql.hive.convertMetastoreParquet", false) . enableHiveSupport() .getOrCreate(); // real time query spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_rt where fare > 100.0").show(); // read optimized query spark.sql("select fare, begin_lon, begin_lat, timestamp from hoodie.trips_ro where fare > 100.0").show(); // Spark Datasource (WIP) Dataset<Row> dataset = sqlContext.read().format(HOODIE_SOURCE_NAME) .option("query", "SELECT driverUUID, riderUUID FROM trips").load(); How can I query? Show me the code !!!
  • 19. Query Design choices RUM Conjecture - Moving within the chosen area Read Optimized View - Pick only columnar files for querying - Raw Parquet Query Performance - Freshness of Major Compaction Real Time View - Hybrid of row and columnar data - Brings near-real time tables - SparkSQL with convertMetaStore=false REALTIME READ OPTIMIZED Queryexecutiontime Data Latency
  • 20. 2017/02/17 Index Index File1_v2.parquet 2017/02/15 2017/02/16 2017/02/17 File1.avro.log FileGroup 10 GB 5min batch File1_v1.parquet 10 GB 5 min batch Input Changelog Hoodie Dataset Realtime View Read Optimized View Spark SQL
  • 21. Shopify evaluating for use - Incremental DB ingestion onto GCS - Early interest from multiple companies Engage with us on Github (uber/hoodie) - Look for “beginner-task” tagged issues - Try out tools & utilities Uber is hiring for “Hoodie” - Staff Engineer Community Share love and code
  • 22. ● Productionizing on AWS S3/EFS, GCP ● Spark Datasource ● Structured Streaming Sink ● Performance in Read Path ○ Presto plugin ○ Impala ● Spark Caching and integration with Apache Arrow ● Beam Runner Future Plans Aim high