3. Private & Confidential
Ozone at Shopee
▪ 2021
▪ Storage
o HDFS
o Ozone
▪ Small file
o Spark event logs (Dr Elephant)
▪ Volume
o 1000:1
4. Private & Confidential
Ozone at Shopee
▪ 2022
▪ S3 Clients
▪ S3 Protocol
o CLI
o SDK
• Java
• Go
• …
o Rest API
▪ Advantages
o S3 compatible
o Low refactoring
6. Private & Confidential
Ozone at Shopee
▪ Volumes: Tens of
▪ Buckets: Tens of
▪ Keys: 100m*
▪ Datanodes : Tens of
▪ Storage: 1Pb*
7. Private & Confidential
Problems we met & solved
▪ Recon
Symptom Root cause Solutions
Incorrect containers number Recon didn’t count deleted
containers
HDDS-5235
Incorrect Hostname of DN
after hostname change
Recon persisted DatanodeDetails HDDS-5418
Get delta update incurred
full GC of OM
Trying to retrieve too much data
from OM
HDDS-6147(OM side)
HDDS-6215(Recon side)
HDDS-6333(Metrics)
Slow syncing data with OM Loop costs too much time.
1. table of 90m records needs 70s
for each loop
2. 100 deletes needs 100 loops
3. 1 sync needs about 2 hours
4. Sync interval: 10m -> 2h, causing
full GC of OM
HDDS-6312 (Waiting for Review)
8. Private & Confidential
Problems we met & solved
▪ OM
Symptom Root cause Solutions
Implement HA Not implemented Manually sync
Full GC Versioning of the file HDDS-5243
HDDS-5472
HDDS-5461
Get delta update incurred
full GC of OM
Trying to retrieve too much data
from OM
HDDS-6147
HDDS-6215
Couldn’t decide leader node Specify leader node for OM failover HDDS-6743 (Waiting for Review)
9. Private & Confidential
Problems we met & solved
▪ SCM
Symptom Root cause Solutions
HA Not implemented Upgrade from 1.1 to 1.2,
bootstrap
ContainerBalancer doesn’t
read configs from
ozone-site.xml
ContainerBanlancer didn’t follow
rules of ConfigurationSource
HDDS-6070
Incorrect timeout of
ContainerBalancer
Incorrect implementation to check
timeout
HDDS-6553
ContainerBalancer becomes
slower
Empty chunk file HDDS-6235
Slow and repeated
container Balancer
hdds.datanode.replication.streams.limit
N nodes write to 1 node, replication can’t
complete before timeout
Increase config
10. Private & Confidential
Problems we met & solved
▪ S3g
Symptom Root cause Solutions
No metrics of S3g Not implemented HDDS-6481
Error logs of S3g while
checking S3g
Favicon request from Browser HDDS-6497
No audit log of S3g Not implemented HDDS-6525
No read audit log Read audit log disabled by default HDDS-6525 (exclude
operations)
HDDS-6535
Need to restart service to
reload exclude operations
Dynamically refresh debug
operations for audit log
HDDS-6603 (Waiting for Review)