Data Infra Meetup | Uber's Data Storage Evolution

Data Storage
Evolution in Uber
Jing Zhao, Uber

Data informs every decision at Uber
Marketplace
Pricing
Community
Operations
Growth Marketing Data Science
Compliance
Eats

Total Data Footprint
1.5+ EB
Presto®
Queries Daily
500K+
Apache Spark®
Apps
Daily
370K+
Uber’s Batch Data Stack

Apache Hadoop®/HDFS @ Uber
30
Clusters
2
Regions
1.5EB
Data Footprint
11K
Nodes

Scalability and Modernization
● HDFS Router-based Federation (2019 ~ 2020)
● Containerization and Automation (2020 ~ 2023)

HDFS Router-based Federation
● R/W routers + Read-only Routers
● Rolled out to Uber’s production
since 2019
● Greatly improved HDFS
scalability
● Distributing traffic to 30 HDFS
clusters

Containerization and Automation
● Containerized across data
plane and control plane
○ Including NameNode
with 300+ GB heap size
● Fully automated for cluster
management
○ Managing 11K nodes
○ NN + JN

Data Storage Efficiency
● Erasure Coding: reduce storage overhead (2020 ~ 2022)
● High-Density HDD: reduce storage unit cost (2022 ~2023)

HDFS Erasure Coding
HDFS Hot
Clusters
HDFS EC Clusters
(Hadoop 3)
HDFS
Router
Clients
(Hadoop 2.x)
EC Access Proxy
Data Transfer
Data
Correctness
Scanner
Replicated Data
Detector
Offline EC
Converter
RPC
● 50% storage saving with
Reed–Solomon(6, 3)
● EC access proxy
○ Seamless access for
Hadoop 2.x clients
○ Avoid Hadoop version
upgrade

● Capacity per Host: 4TB * 24 → 16TB * 35
● Efficiency: >50% HW cost reduction
● Challenges
○ DataNode IO performance
○ HDFS reliability (blast radius)
● Opportunities
○ Traffic focuses on a small group of
extremely hot blocks
○ Top 10K blocks attracted >90% read
traffic
Adopting High-Density HDD in HDFS

● Build a local cache within DataNode
○ 4TB NVMe SSD disk
○ Based on DataNode local traffic
● Leverage Alluxio for cache management
○ Page-level cache
○ 1MB default page size matches traffic
pattern
DataNode Local Cache

Cloud Migration
2023 ~ Present
● Replacing HDFS with Cloud Object Storages
● Hybrid Cloud and Multi-Cloud Architectures

● Migrating Batch Data
Processing Stack to Google®
Cloud Platform (GCP)
● Replace HDFS with Google®
Cloud Storage (GCS)
● Logical namespace to abstract
out internal bucket layout
● Performance optimizations
Cloud Object Storage

Perf/Func Optimizations
IO capacity limits Traffic balancing and bucket pre-splits
Write throughput GVNIC adoption for aggregated throughput improvement: 20 Gbps → 32 Gbps
Parallel composite uploads for single writer throughput improvement
Read/Listing latenties gRPC APIs for better performance consistency
Presto: local SSD cache
Hive/Spark parallel listing for partitioning data
Hudi: the performance improvements with 0.14 features
Rename Failure handling and Python library enhancement
Spark optimized file output committer
Performance optimizations

Hybrid Cloud Architecture (WIP)
● One logical DataLake on unified data
storage
○ Across on-prem HDFS and Cloud
object storage
○ Logical paths to abstract out internal
details
● Optimizations for
○ Ingress/Egress traffic cost
○ Data storage cost

Tables and Blobs: Unified Multi-Cloud Storage (Future)
● Tables and Blobs
● Multi-Cloud architecture
○ Google Cloud Platform (GCP)
○ Oracle® Cloud Infrastructure
(OCI)
● Data orchestration and caching

"Apache®, Apache Hadoop®, Hadoop®, and Apache Spark® are either registered trademarks or trademarks of the Apache
Software Foundation® in the United States and/or other countries. No endorsement by The Apache Software Foundation® is
implied by the use of these marks."
"Google®, Google Cloud Platform®, and Google Cloud Storage® are either registered trademarks or trademarks of Google LLC in
the United States and/or other countries. No endorsement by Google LLC is implied by the use of these marks."
"Oracle® is a registered trademarks of Oracle Corporation. No endorsement by Oracle Corporation is implied by the use of the
mark."
"Presto® is a registered trademark of LF Projects, LLC. No endorsement by LF Projects, LLC is implied by the use of the mark."

Data Infra Meetup | Uber's Data Storage Evolution

More Related Content

Similar to Data Infra Meetup | Uber's Data Storage Evolution

More from Alluxio, Inc.

Recently uploaded

Data Infra Meetup | Uber's Data Storage Evolution