SlideShare a Scribd company logo
1 of 17
Download to read offline
Data Storage
Evolution in Uber
Jing Zhao, Uber
Data informs every decision at Uber
Marketplace
Pricing
Community
Operations
Growth Marketing Data Science
Compliance
Eats
Total Data Footprint
1.5+ EB
Presto®
Queries Daily
500K+
Apache Spark®
Apps
Daily
370K+
Uber’s Batch Data Stack
Apache Hadoop®/HDFS @ Uber
30
Clusters
2
Regions
1.5EB
Data Footprint
11K
Nodes
Scalability and Modernization
● HDFS Router-based Federation (2019 ~ 2020)
● Containerization and Automation (2020 ~ 2023)
HDFS Router-based Federation
● R/W routers + Read-only Routers
● Rolled out to Uber’s production
since 2019
● Greatly improved HDFS
scalability
● Distributing traffic to 30 HDFS
clusters
Containerization and Automation
● Containerized across data
plane and control plane
○ Including NameNode
with 300+ GB heap size
● Fully automated for cluster
management
○ Managing 11K nodes
○ NN + JN
Data Storage Efficiency
● Erasure Coding: reduce storage overhead (2020 ~ 2022)
● High-Density HDD: reduce storage unit cost (2022 ~2023)
HDFS Erasure Coding
HDFS Hot
Clusters
HDFS EC Clusters
(Hadoop 3)
HDFS
Router
Clients
(Hadoop 2.x)
EC Access Proxy
Data Transfer
Data
Correctness
Scanner
Replicated Data
Detector
Offline EC
Converter
RPC
● 50% storage saving with
Reed–Solomon(6, 3)
● EC access proxy
○ Seamless access for
Hadoop 2.x clients
○ Avoid Hadoop version
upgrade
● Capacity per Host: 4TB * 24 → 16TB * 35
● Efficiency: >50% HW cost reduction
● Challenges
○ DataNode IO performance
○ HDFS reliability (blast radius)
● Opportunities
○ Traffic focuses on a small group of
extremely hot blocks
○ Top 10K blocks attracted >90% read
traffic
Adopting High-Density HDD in HDFS
● Build a local cache within DataNode
○ 4TB NVMe SSD disk
○ Based on DataNode local traffic
● Leverage Alluxio for cache management
○ Page-level cache
○ 1MB default page size matches traffic
pattern
DataNode Local Cache
Cloud Migration
2023 ~ Present
● Replacing HDFS with Cloud Object Storages
● Hybrid Cloud and Multi-Cloud Architectures
● Migrating Batch Data
Processing Stack to Google®
Cloud Platform (GCP)
● Replace HDFS with Google®
Cloud Storage (GCS)
● Logical namespace to abstract
out internal bucket layout
● Performance optimizations
Cloud Object Storage
Perf/Func Optimizations
IO capacity limits Traffic balancing and bucket pre-splits
Write throughput GVNIC adoption for aggregated throughput improvement: 20 Gbps → 32 Gbps
Parallel composite uploads for single writer throughput improvement
Read/Listing latenties gRPC APIs for better performance consistency
Presto: local SSD cache
Hive/Spark parallel listing for partitioning data
Hudi: the performance improvements with 0.14 features
Rename Failure handling and Python library enhancement
Spark optimized file output committer
Performance optimizations
Hybrid Cloud Architecture (WIP)
● One logical DataLake on unified data
storage
○ Across on-prem HDFS and Cloud
object storage
○ Logical paths to abstract out internal
details
● Optimizations for
○ Ingress/Egress traffic cost
○ Data storage cost
Tables and Blobs: Unified Multi-Cloud Storage (Future)
● Tables and Blobs
● Multi-Cloud architecture
○ Google Cloud Platform (GCP)
○ Oracle® Cloud Infrastructure
(OCI)
● Data orchestration and caching
"Apache®, Apache Hadoop®, Hadoop®, and Apache Spark® are either registered trademarks or trademarks of the Apache
Software Foundation® in the United States and/or other countries. No endorsement by The Apache Software Foundation® is
implied by the use of these marks."
"Google®, Google Cloud Platform®, and Google Cloud Storage® are either registered trademarks or trademarks of Google LLC in
the United States and/or other countries. No endorsement by Google LLC is implied by the use of these marks."
"Oracle® is a registered trademarks of Oracle Corporation. No endorsement by Oracle Corporation is implied by the use of the
mark."
"Presto® is a registered trademark of LF Projects, LLC. No endorsement by LF Projects, LLC is implied by the use of the mark."

More Related Content

Similar to Data Infra Meetup | Uber's Data Storage Evolution

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 

Similar to Data Infra Meetup | Uber's Data Storage Evolution (20)

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Oracle EBS Journey to the Cloud - What is New in 2022 (UKOUG Breakthrough 22 ...
Oracle EBS Journey to the Cloud - What is New in 2022 (UKOUG Breakthrough 22 ...Oracle EBS Journey to the Cloud - What is New in 2022 (UKOUG Breakthrough 22 ...
Oracle EBS Journey to the Cloud - What is New in 2022 (UKOUG Breakthrough 22 ...
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Data Platform on GCP
Data Platform on GCPData Platform on GCP
Data Platform on GCP
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Recently uploaded (20)

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in Uganda
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 

Data Infra Meetup | Uber's Data Storage Evolution

  • 1. Data Storage Evolution in Uber Jing Zhao, Uber
  • 2. Data informs every decision at Uber Marketplace Pricing Community Operations Growth Marketing Data Science Compliance Eats
  • 3. Total Data Footprint 1.5+ EB Presto® Queries Daily 500K+ Apache Spark® Apps Daily 370K+ Uber’s Batch Data Stack
  • 4. Apache Hadoop®/HDFS @ Uber 30 Clusters 2 Regions 1.5EB Data Footprint 11K Nodes
  • 5. Scalability and Modernization ● HDFS Router-based Federation (2019 ~ 2020) ● Containerization and Automation (2020 ~ 2023)
  • 6. HDFS Router-based Federation ● R/W routers + Read-only Routers ● Rolled out to Uber’s production since 2019 ● Greatly improved HDFS scalability ● Distributing traffic to 30 HDFS clusters
  • 7. Containerization and Automation ● Containerized across data plane and control plane ○ Including NameNode with 300+ GB heap size ● Fully automated for cluster management ○ Managing 11K nodes ○ NN + JN
  • 8. Data Storage Efficiency ● Erasure Coding: reduce storage overhead (2020 ~ 2022) ● High-Density HDD: reduce storage unit cost (2022 ~2023)
  • 9. HDFS Erasure Coding HDFS Hot Clusters HDFS EC Clusters (Hadoop 3) HDFS Router Clients (Hadoop 2.x) EC Access Proxy Data Transfer Data Correctness Scanner Replicated Data Detector Offline EC Converter RPC ● 50% storage saving with Reed–Solomon(6, 3) ● EC access proxy ○ Seamless access for Hadoop 2.x clients ○ Avoid Hadoop version upgrade
  • 10. ● Capacity per Host: 4TB * 24 → 16TB * 35 ● Efficiency: >50% HW cost reduction ● Challenges ○ DataNode IO performance ○ HDFS reliability (blast radius) ● Opportunities ○ Traffic focuses on a small group of extremely hot blocks ○ Top 10K blocks attracted >90% read traffic Adopting High-Density HDD in HDFS
  • 11. ● Build a local cache within DataNode ○ 4TB NVMe SSD disk ○ Based on DataNode local traffic ● Leverage Alluxio for cache management ○ Page-level cache ○ 1MB default page size matches traffic pattern DataNode Local Cache
  • 12. Cloud Migration 2023 ~ Present ● Replacing HDFS with Cloud Object Storages ● Hybrid Cloud and Multi-Cloud Architectures
  • 13. ● Migrating Batch Data Processing Stack to Google® Cloud Platform (GCP) ● Replace HDFS with Google® Cloud Storage (GCS) ● Logical namespace to abstract out internal bucket layout ● Performance optimizations Cloud Object Storage
  • 14. Perf/Func Optimizations IO capacity limits Traffic balancing and bucket pre-splits Write throughput GVNIC adoption for aggregated throughput improvement: 20 Gbps → 32 Gbps Parallel composite uploads for single writer throughput improvement Read/Listing latenties gRPC APIs for better performance consistency Presto: local SSD cache Hive/Spark parallel listing for partitioning data Hudi: the performance improvements with 0.14 features Rename Failure handling and Python library enhancement Spark optimized file output committer Performance optimizations
  • 15. Hybrid Cloud Architecture (WIP) ● One logical DataLake on unified data storage ○ Across on-prem HDFS and Cloud object storage ○ Logical paths to abstract out internal details ● Optimizations for ○ Ingress/Egress traffic cost ○ Data storage cost
  • 16. Tables and Blobs: Unified Multi-Cloud Storage (Future) ● Tables and Blobs ● Multi-Cloud architecture ○ Google Cloud Platform (GCP) ○ Oracle® Cloud Infrastructure (OCI) ● Data orchestration and caching
  • 17. "Apache®, Apache Hadoop®, Hadoop®, and Apache Spark® are either registered trademarks or trademarks of the Apache Software Foundation® in the United States and/or other countries. No endorsement by The Apache Software Foundation® is implied by the use of these marks." "Google®, Google Cloud Platform®, and Google Cloud Storage® are either registered trademarks or trademarks of Google LLC in the United States and/or other countries. No endorsement by Google LLC is implied by the use of these marks." "Oracle® is a registered trademarks of Oracle Corporation. No endorsement by Oracle Corporation is implied by the use of the mark." "Presto® is a registered trademark of LF Projects, LLC. No endorsement by LF Projects, LLC is implied by the use of the mark."