SlideShare a Scribd company logo
1 of 14
Download to read offline
Practices of Ozone
at Shopee
Shopee Data Infrastructure
Zhou Yiyang
Private & Confidential
CONTENT Ozone at Shopee
Problems we met & solved
Future Plans
Private & Confidential
Ozone at Shopee
▪ 2021
▪ Storage
o HDFS
o Ozone
▪ Small file
o Spark event logs (Dr Elephant)
▪ Volume
o 1000:1
Private & Confidential
Ozone at Shopee
▪ 2022
▪ S3 Clients
▪ S3 Protocol
o CLI
o SDK
• Java
• Go
• …
o Rest API
▪ Advantages
o S3 compatible
o Low refactoring
Private & Confidential
Ozone at Shopee
▪ S3 Clients
o S3 gateway
▪ Ozone Clients
Private & Confidential
Ozone at Shopee
▪ Volumes: Tens of
▪ Buckets: Tens of
▪ Keys: 100m*
▪ Datanodes : Tens of
▪ Storage: 1Pb*
Private & Confidential
Problems we met & solved
▪ Recon
Symptom Root cause Solutions
Incorrect containers number Recon didn’t count deleted
containers
HDDS-5235
Incorrect Hostname of DN
after hostname change
Recon persisted DatanodeDetails HDDS-5418
Get delta update incurred
full GC of OM
Trying to retrieve too much data
from OM
HDDS-6147(OM side)
HDDS-6215(Recon side)
HDDS-6333(Metrics)
Slow syncing data with OM Loop costs too much time.
1. table of 90m records needs 70s
for each loop
2. 100 deletes needs 100 loops
3. 1 sync needs about 2 hours
4. Sync interval: 10m -> 2h, causing
full GC of OM
HDDS-6312 (Waiting for Review)
Private & Confidential
Problems we met & solved
▪ OM
Symptom Root cause Solutions
Implement HA Not implemented Manually sync
Full GC Versioning of the file HDDS-5243
HDDS-5472
HDDS-5461
Get delta update incurred
full GC of OM
Trying to retrieve too much data
from OM
HDDS-6147
HDDS-6215
Couldn’t decide leader node Specify leader node for OM failover HDDS-6743 (Waiting for Review)
Private & Confidential
Problems we met & solved
▪ SCM
Symptom Root cause Solutions
HA Not implemented Upgrade from 1.1 to 1.2,
bootstrap
ContainerBalancer doesn’t
read configs from
ozone-site.xml
ContainerBanlancer didn’t follow
rules of ConfigurationSource
HDDS-6070
Incorrect timeout of
ContainerBalancer
Incorrect implementation to check
timeout
HDDS-6553
ContainerBalancer becomes
slower
Empty chunk file HDDS-6235
Slow and repeated
container Balancer
hdds.datanode.replication.streams.limit
N nodes write to 1 node, replication can’t
complete before timeout
Increase config
Private & Confidential
Problems we met & solved
▪ S3g
Symptom Root cause Solutions
No metrics of S3g Not implemented HDDS-6481
Error logs of S3g while
checking S3g
Favicon request from Browser HDDS-6497
No audit log of S3g Not implemented HDDS-6525
No read audit log Read audit log disabled by default HDDS-6525 (exclude
operations)
HDDS-6535
Need to restart service to
reload exclude operations
Dynamically refresh debug
operations for audit log
HDDS-6603 (Waiting for Review)
Private & Confidential
Problems we met & solved
▪ Better support
o Users
o SREs
Private & Confidential
Future Plans
▪S3g
o Performance
• HDDS-4440
▪Client, DN
o write streaming
• HDDS-4454
o EC
▪SCM
o Multi-DC
•Dependencies
• Hadoop
• Ratis
• RocksDB
Private & Confidential
Q&A
Private & Confidential
Thank you

More Related Content

Similar to Practices of Ozone.pptx.pdf

Enterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM Router
Enterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM RouterEnterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM Router
Enterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM RouterDOYO Live
 
VMworld 2015: Virtualize Active Directory, the Right Way!
VMworld 2015: Virtualize Active Directory, the Right Way!VMworld 2015: Virtualize Active Directory, the Right Way!
VMworld 2015: Virtualize Active Directory, the Right Way!VMworld
 
Real world cloud formation feb 2014 final
Real world cloud formation feb 2014 finalReal world cloud formation feb 2014 final
Real world cloud formation feb 2014 finalHoward Glynn
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®MariaDB plc
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Akshay Rai
 
MongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the FieldMongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the FieldMongoDB
 
AWS Community Day - Jessie Daubner - Building a data lake
AWS Community Day - Jessie Daubner - Building a data lakeAWS Community Day - Jessie Daubner - Building a data lake
AWS Community Day - Jessie Daubner - Building a data lakeAWS Chicago
 
Decoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data WorkloadsDecoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data WorkloadsAlluxio, Inc.
 
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?Dennis Martin
 
The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...
The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...
The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...Alluxio, Inc.
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Moving faster with CI/CD: Best DevOps practices and lessons learnt
Moving faster with CI/CD: Best DevOps practices and lessons learntMoving faster with CI/CD: Best DevOps practices and lessons learnt
Moving faster with CI/CD: Best DevOps practices and lessons learntMalinda Kapuruge
 
ICON UK - Only an IBM Domino Server can take this much beating and still run
ICON UK - Only an IBM Domino Server can take this much beating and still runICON UK - Only an IBM Domino Server can take this much beating and still run
ICON UK - Only an IBM Domino Server can take this much beating and still runAndreas Ponte
 
ICON UK 2013 - Only a Domino Server can take this much..
ICON UK 2013 - Only a Domino Server can take this much..ICON UK 2013 - Only a Domino Server can take this much..
ICON UK 2013 - Only a Domino Server can take this much..Belsoft
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
Database Provisioning in EM12c: Provision me a Database Now!
Database Provisioning in EM12c: Provision me a Database Now!Database Provisioning in EM12c: Provision me a Database Now!
Database Provisioning in EM12c: Provision me a Database Now!Maaz Anjum
 

Similar to Practices of Ozone.pptx.pdf (20)

Enterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM Router
Enterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM RouterEnterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM Router
Enterprise Imaging Interoperability: Why It’s Time to Replace Your DICOM Router
 
VMworld 2015: Virtualize Active Directory, the Right Way!
VMworld 2015: Virtualize Active Directory, the Right Way!VMworld 2015: Virtualize Active Directory, the Right Way!
VMworld 2015: Virtualize Active Directory, the Right Way!
 
Real world cloud formation feb 2014 final
Real world cloud formation feb 2014 finalReal world cloud formation feb 2014 final
Real world cloud formation feb 2014 final
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
 
MongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the FieldMongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the Field
 
AWS Community Day - Jessie Daubner - Building a data lake
AWS Community Day - Jessie Daubner - Building a data lakeAWS Community Day - Jessie Daubner - Building a data lake
AWS Community Day - Jessie Daubner - Building a data lake
 
Decoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data WorkloadsDecoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data Workloads
 
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
 
Virtualization and Containers
Virtualization and ContainersVirtualization and Containers
Virtualization and Containers
 
The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...
The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...
The Power of Data Orchestration: Storage Acceleration and Servitization at Sh...
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Moving faster with CI/CD: Best DevOps practices and lessons learnt
Moving faster with CI/CD: Best DevOps practices and lessons learntMoving faster with CI/CD: Best DevOps practices and lessons learnt
Moving faster with CI/CD: Best DevOps practices and lessons learnt
 
ICON UK - Only an IBM Domino Server can take this much beating and still run
ICON UK - Only an IBM Domino Server can take this much beating and still runICON UK - Only an IBM Domino Server can take this much beating and still run
ICON UK - Only an IBM Domino Server can take this much beating and still run
 
ICON UK 2013 - Only a Domino Server can take this much..
ICON UK 2013 - Only a Domino Server can take this much..ICON UK 2013 - Only a Domino Server can take this much..
ICON UK 2013 - Only a Domino Server can take this much..
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Database Provisioning in EM12c: Provision me a Database Now!
Database Provisioning in EM12c: Provision me a Database Now!Database Provisioning in EM12c: Provision me a Database Now!
Database Provisioning in EM12c: Provision me a Database Now!
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Practices of Ozone.pptx.pdf

  • 1. Practices of Ozone at Shopee Shopee Data Infrastructure Zhou Yiyang
  • 2. Private & Confidential CONTENT Ozone at Shopee Problems we met & solved Future Plans
  • 3. Private & Confidential Ozone at Shopee ▪ 2021 ▪ Storage o HDFS o Ozone ▪ Small file o Spark event logs (Dr Elephant) ▪ Volume o 1000:1
  • 4. Private & Confidential Ozone at Shopee ▪ 2022 ▪ S3 Clients ▪ S3 Protocol o CLI o SDK • Java • Go • … o Rest API ▪ Advantages o S3 compatible o Low refactoring
  • 5. Private & Confidential Ozone at Shopee ▪ S3 Clients o S3 gateway ▪ Ozone Clients
  • 6. Private & Confidential Ozone at Shopee ▪ Volumes: Tens of ▪ Buckets: Tens of ▪ Keys: 100m* ▪ Datanodes : Tens of ▪ Storage: 1Pb*
  • 7. Private & Confidential Problems we met & solved ▪ Recon Symptom Root cause Solutions Incorrect containers number Recon didn’t count deleted containers HDDS-5235 Incorrect Hostname of DN after hostname change Recon persisted DatanodeDetails HDDS-5418 Get delta update incurred full GC of OM Trying to retrieve too much data from OM HDDS-6147(OM side) HDDS-6215(Recon side) HDDS-6333(Metrics) Slow syncing data with OM Loop costs too much time. 1. table of 90m records needs 70s for each loop 2. 100 deletes needs 100 loops 3. 1 sync needs about 2 hours 4. Sync interval: 10m -> 2h, causing full GC of OM HDDS-6312 (Waiting for Review)
  • 8. Private & Confidential Problems we met & solved ▪ OM Symptom Root cause Solutions Implement HA Not implemented Manually sync Full GC Versioning of the file HDDS-5243 HDDS-5472 HDDS-5461 Get delta update incurred full GC of OM Trying to retrieve too much data from OM HDDS-6147 HDDS-6215 Couldn’t decide leader node Specify leader node for OM failover HDDS-6743 (Waiting for Review)
  • 9. Private & Confidential Problems we met & solved ▪ SCM Symptom Root cause Solutions HA Not implemented Upgrade from 1.1 to 1.2, bootstrap ContainerBalancer doesn’t read configs from ozone-site.xml ContainerBanlancer didn’t follow rules of ConfigurationSource HDDS-6070 Incorrect timeout of ContainerBalancer Incorrect implementation to check timeout HDDS-6553 ContainerBalancer becomes slower Empty chunk file HDDS-6235 Slow and repeated container Balancer hdds.datanode.replication.streams.limit N nodes write to 1 node, replication can’t complete before timeout Increase config
  • 10. Private & Confidential Problems we met & solved ▪ S3g Symptom Root cause Solutions No metrics of S3g Not implemented HDDS-6481 Error logs of S3g while checking S3g Favicon request from Browser HDDS-6497 No audit log of S3g Not implemented HDDS-6525 No read audit log Read audit log disabled by default HDDS-6525 (exclude operations) HDDS-6535 Need to restart service to reload exclude operations Dynamically refresh debug operations for audit log HDDS-6603 (Waiting for Review)
  • 11. Private & Confidential Problems we met & solved ▪ Better support o Users o SREs
  • 12. Private & Confidential Future Plans ▪S3g o Performance • HDDS-4440 ▪Client, DN o write streaming • HDDS-4454 o EC ▪SCM o Multi-DC •Dependencies • Hadoop • Ratis • RocksDB