SlideShare a Scribd company logo
Petabyte Scale Object Storage Service using Ceph
in a Private Cloud
Flipkart Internet Pvt Ltd, Bangalore, India
1
• About Flipkart
• Flipkart Cloud Platform(FCP)
• Virtualized Infrastructure
• Object Store Services
• FCP Storage Team - Tech Stack
• Ceph Configuration and Deployment
• QoS(Server Side, RPS based)
• Cluster Administration and Health Monitoring
Agenda
2
About Flipkart
•Flipkart is India’s leading e-commerce marketplace offering
• Over 80 million products across 70+ product categories
• 100 million registered users
• 100,000 Sellers
• 8 million shipments/month
• 13 million daily visitors
•Developed a highly scalable and reliable storage infrastructure
•Object storage service backed by Ceph for ~3 years
•Runs two data centers having more than 20,000 hosts
•Completely virtualized infrastructure with over 60,000 VMs
3
Object Storage as a Service
• Scale-out object storage service built on CEPH for Flipkart
• Simple HTTP API - A subset of AWS S3 API
• Multiple clusters behind a single endpoint
• Key features
• Focus on durability
• Elastic for performance and capacity
• Multi-tenant with user SLAs
• Five 9's of reliability
4
FCP Storage - SLAs Offered
Shared-Active Workloads
● Latency sensitive e.g.
Debian packages, product
image uploads &
near-term invoices.
● Optimized for small to
medium sized objects
Backup Workloads
● Bandwidth optimized e.g.
MySQL DB backup
● Optimized for very large
objects.
● Best suited for taking
backups with limited
retention period.
Archival Workloads
● Long-term (multi-year)
retention optimized e.g.
legal compliance data
● Moderate throughput and
high latency
● Optimized for medium to
very large objects
● Explicit verification targets
for durability
5
Hot Cold
Shared Active Service
•Stores ~1.5 billion objects
•~750 OSDs
•Provides ~1 GB/s bandwidth utilization
•Handles ~ 500 RGW requests/sec
•~2PB of raw storage
•Hybrid deployment with SSDs and HDDs
•200+ internal customers
•Growing ~50 million objects/month
6
Shared Active Tech Stack
7
Elastic Load
Balancer
Nginx Nginx
VIP
Shared Active Tech Stack
Elastic Load
Balancer
ELB
(Read)
ELB
(Write)
RGW
(Write)
RGW
(Write)
RGW
(Read)
RGW
(Read)
Nginx Nginx
VIP VIP
VIP
Shared Active Tech Stack
9
Elastic Load
Balancer
ELB
(Read)
ELB
(Write)
RGW
(Write)
RGW
(Write)
RGW
(Read)
RGW
(Read)
Nginx Nginx
Mon
OSD (SSD) OSD HybridRADOS Cluster
Bare Metal H/W (Debian)
Debian VMs
Cache/Index Base Tier
VIP VIP
VIP
Shared Active Tech Stack
10
Clients(Invoice Service)
Elastic Load
Balancer
ELB
(Read)
ELB
(Write)
RGW
(Write)
RGW
(Write)
RGW
(Read)
RGW
(Read)
Nginx Nginx
Mon
OSD (SSD) OSD HybridRADOS Cluster
Bare Metal H/W (Debian)
Debian VMs
Clients(CDN) Clients(Repo)
Clients(Backup)
Clients(Catalogue)
Cache/Index Base Tier
VIP VIP
VIP
Boto
AWS SDK
for Java
Time Series
Based Health
Monitoring
Grafana
Open
TSDB
Alerts
Ceph Deployment: HW Instance
• HW VM instances for Ceph components
• Monitors
• Higher CPU cores and RAM
• RGWs
• Higher Bandwidth and CPUs
• An instance group of N RGWs
• Base Tier OSDs
• Higher Capacity, Hybrid(SSD+HDD)
• Cache Tier OSDs(SSDs)
• Index OSDs (SSDs+More RAM)
• Low latency, large capacity SSDs
• An instance has the following:
• Host OS (Debian)
• Network
• CPU (Hyper Thread cores)
• Disks (Direct attached)
• RAM
11
12
QoS
QoS
RGW Rate Limit
● Per user quota policy
● Requests are rate limited per
GET, PUT, CREATE, DELETE
● Deployed per RGW instance
Elastic Load Balancer
Spread load across RGW/Nginx
Nginx Rd/Wr Separation
● Independent sets of RGW
instances
● Handle GET and PUT in
isolation
● Maintained in Nginx conf
CRUSH
Spread load across physically
separated tiers across different PDU
Cluster Administration and Health Monitoring
13
SA Service: QoS(Grafana)
14
SA Service: QoS(Grafana)
15
16
Challenges faced so far...
Issue Timeouts due to lack of iops
Reason 1. Increased client and rebalancing load
2. Split and merges in filestore at the same time
3. Unable to serve I/O requests
Mitigation 1. Migrate journals to SSDs
2. Introduce SSD writeback cache tier in front of HDD OSDs to
absorb iops
3. Introduce rate limits for clients
4. Two set of RGWs to serve read and write operations
17
Challenges faced so far...
Issue Service outage due to rebalancing in large metadata pools (index, log
etc)
Reason 1. Buckets with more than 100 million objects
2. RGW usage pool had high number of entries and was enabled
3. Unable to write when rebalancing happens on large metadata pools
Mitigation 1. Manual sharding of large bucket indexes
2. Disable RGW log pool
3. Introduce bucket quota
18
Challenges faced so far...
Issue Write timeouts due to high and concurrent bucket create calls
Reason Contention in Leveldb to due high number of concurrent updates to a
bucket’s metadata
Mitigation Enable quota on bucket creation calls
19
Challenges faced so far...
Issue Data leak
Reason Leaks with incomplete multipart uploads
Mitigation 1. Fixed : https://github.com/ceph/ceph/pull/15630
2. radosgw-admin orphan find did not scale with millions of objects
3. Developed an offline tool to delete leaked objects
20
Challenges faced so far...
Issue Data loss
Reason Race in RGWCompleteMultipart
Mitigation 1. Fixed : https://github.com/ceph/ceph/pull/16732
2. Accounted loss by scanning all objects for missing parts.
21
Challenges that still persist...
• Upgrades are still a pain
• Upgrade from Hammer to Jewel, issue with chown.
• Upgrading to Luminous:
• Filestore has split and merges problem.
• BlueStore solves them but upgrade path is not smooth
• Adding new osds with BlueStore backend.
• Occasional PG incomplete issue and resulting data loss to
bring PGs back online.
• http://tracker.ceph.com/issues/17002
•End-to-End QOS
Questions?
22
Thank You
Backup
24
Build Service Alert ServiceCosmos Monitoring Service
Repo Service Authorization AuthN/ AuthZ Service
Elastic Load Balancer Object Storage Service
Infrastructure as a Service (IAAS)
25
Flipkart Cloud Platform: Building Blocks
Object Storage As a Service
• Policy based services
• Shared Active
• Low latency, replicated, high throughput
• Hybrid Deployment of HDDs and SSDs
• Data Backup and Disaster Recovery Service
• Replicated on HDDs
• Data Archival Service
• Erasure Coded on HDDs
• Multi-tenant
• Five 9's reliable
• 200+ internal customers
26
Cluster Administration and Monitoring
•Nagios alerts
•Ceph Periodic stats
• Has more than 100 metrics
• Integrates to Cosmos service
• Stores data on Cosmos's OpenTS DB
• Complements Ceph manager
•Grafana
• Visual representation of Ceph Periodic Stats
• Historic and near real-time status of the service
27
Use Cases
28

More Related Content

What's hot

MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
Kyle Bader
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
Ceph Community
 
Troubleshooting redis
Troubleshooting redisTroubleshooting redis
Troubleshooting redis
DaeMyung Kang
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Ceph Community
 
Redis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis LabsRedis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Labs
 
MySQL Head to Head Performance
MySQL Head to Head PerformanceMySQL Head to Head Performance
MySQL Head to Head Performance
Kyle Bader
 
Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016
Red_Hat_Storage
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Community
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
Ceph Community
 
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Ceph Community
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Colleen Corrice
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoPro
Redis Labs
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
Redis Labs
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
Redis Labs
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Ceph Community
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?
Kyle Bader
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
Redis Labs
 
RGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming Xie
RGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming XieRGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming Xie
RGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming Xie
Ceph Community
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
Redis Labs
 
librados
libradoslibrados
librados
Patrick McGarry
 

What's hot (20)

MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
 
Troubleshooting redis
Troubleshooting redisTroubleshooting redis
Troubleshooting redis
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
 
Redis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis LabsRedis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis Labs
 
MySQL Head to Head Performance
MySQL Head to Head PerformanceMySQL Head to Head Performance
MySQL Head to Head Performance
 
Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
 
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoPro
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
RGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming Xie
RGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming XieRGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming Xie
RGW Beyond Cloud: Live Video Storage with Ceph - Shengjing Zhu, Yiming Xie
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
librados
libradoslibrados
librados
 

Similar to Petabyte Scale Object Storage Service Using Ceph in A Private Cloud - Varada Kari

Turning object storage into vm storage
Turning object storage into vm storageTurning object storage into vm storage
Turning object storage into vm storage
wim_provoost
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 Systems
David Newman
 
TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.
Real-Time Innovations (RTI)
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red_Hat_Storage
 
Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017
Felix GV
 
Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...
Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...
Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...
Amazon Web Services
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
GlobalsDB: Its significance for Node.js Developers
GlobalsDB: Its significance for Node.js DevelopersGlobalsDB: Its significance for Node.js Developers
GlobalsDB: Its significance for Node.js Developers
Rob Tweed
 
Running database infrastructure on containers
Running database infrastructure on containersRunning database infrastructure on containers
Running database infrastructure on containers
MariaDB plc
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
Ceph Community
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
Yan Yan
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
Mathew Beane
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
Konstantin Gredeskoul
 
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Jens Hadlich
 
CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5
CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5
CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5Tim Mackey
 
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSArquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Amazon Web Services LATAM
 
Trove Updates - Liberty Edition
Trove Updates - Liberty EditionTrove Updates - Liberty Edition
Trove Updates - Liberty Edition
OpenStack Foundation
 
Openstack trove-updates
Openstack trove-updatesOpenstack trove-updates
Openstack trove-updates
Jesse Wiles
 

Similar to Petabyte Scale Object Storage Service Using Ceph in A Private Cloud - Varada Kari (20)

Turning object storage into vm storage
Turning object storage into vm storageTurning object storage into vm storage
Turning object storage into vm storage
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 Systems
 
TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers Red Hat Storage Day LA - Persistent Storage for Linux Containers
Red Hat Storage Day LA - Persistent Storage for Linux Containers
 
Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017
 
Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...
Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...
Drinking our own Champagne: How Woot, an Amazon subsidiary, uses AWS (ARC212)...
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
GlobalsDB: Its significance for Node.js Developers
GlobalsDB: Its significance for Node.js DevelopersGlobalsDB: Its significance for Node.js Developers
GlobalsDB: Its significance for Node.js Developers
 
Running database infrastructure on containers
Running database infrastructure on containersRunning database infrastructure on containers
Running database infrastructure on containers
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
 
CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5
CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5
CloudStack Day Japan 2015 - Hypervisor Selection in CloudStack 4.5
 
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSArquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
 
Trove Updates - Liberty Edition
Trove Updates - Liberty EditionTrove Updates - Liberty Edition
Trove Updates - Liberty Edition
 
Openstack trove-updates
Openstack trove-updatesOpenstack trove-updates
Openstack trove-updates
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Petabyte Scale Object Storage Service Using Ceph in A Private Cloud - Varada Kari

  • 1. Petabyte Scale Object Storage Service using Ceph in a Private Cloud Flipkart Internet Pvt Ltd, Bangalore, India 1
  • 2. • About Flipkart • Flipkart Cloud Platform(FCP) • Virtualized Infrastructure • Object Store Services • FCP Storage Team - Tech Stack • Ceph Configuration and Deployment • QoS(Server Side, RPS based) • Cluster Administration and Health Monitoring Agenda 2
  • 3. About Flipkart •Flipkart is India’s leading e-commerce marketplace offering • Over 80 million products across 70+ product categories • 100 million registered users • 100,000 Sellers • 8 million shipments/month • 13 million daily visitors •Developed a highly scalable and reliable storage infrastructure •Object storage service backed by Ceph for ~3 years •Runs two data centers having more than 20,000 hosts •Completely virtualized infrastructure with over 60,000 VMs 3
  • 4. Object Storage as a Service • Scale-out object storage service built on CEPH for Flipkart • Simple HTTP API - A subset of AWS S3 API • Multiple clusters behind a single endpoint • Key features • Focus on durability • Elastic for performance and capacity • Multi-tenant with user SLAs • Five 9's of reliability 4
  • 5. FCP Storage - SLAs Offered Shared-Active Workloads ● Latency sensitive e.g. Debian packages, product image uploads & near-term invoices. ● Optimized for small to medium sized objects Backup Workloads ● Bandwidth optimized e.g. MySQL DB backup ● Optimized for very large objects. ● Best suited for taking backups with limited retention period. Archival Workloads ● Long-term (multi-year) retention optimized e.g. legal compliance data ● Moderate throughput and high latency ● Optimized for medium to very large objects ● Explicit verification targets for durability 5 Hot Cold
  • 6. Shared Active Service •Stores ~1.5 billion objects •~750 OSDs •Provides ~1 GB/s bandwidth utilization •Handles ~ 500 RGW requests/sec •~2PB of raw storage •Hybrid deployment with SSDs and HDDs •200+ internal customers •Growing ~50 million objects/month 6
  • 7. Shared Active Tech Stack 7 Elastic Load Balancer Nginx Nginx VIP
  • 8. Shared Active Tech Stack Elastic Load Balancer ELB (Read) ELB (Write) RGW (Write) RGW (Write) RGW (Read) RGW (Read) Nginx Nginx VIP VIP VIP
  • 9. Shared Active Tech Stack 9 Elastic Load Balancer ELB (Read) ELB (Write) RGW (Write) RGW (Write) RGW (Read) RGW (Read) Nginx Nginx Mon OSD (SSD) OSD HybridRADOS Cluster Bare Metal H/W (Debian) Debian VMs Cache/Index Base Tier VIP VIP VIP
  • 10. Shared Active Tech Stack 10 Clients(Invoice Service) Elastic Load Balancer ELB (Read) ELB (Write) RGW (Write) RGW (Write) RGW (Read) RGW (Read) Nginx Nginx Mon OSD (SSD) OSD HybridRADOS Cluster Bare Metal H/W (Debian) Debian VMs Clients(CDN) Clients(Repo) Clients(Backup) Clients(Catalogue) Cache/Index Base Tier VIP VIP VIP Boto AWS SDK for Java Time Series Based Health Monitoring Grafana Open TSDB Alerts
  • 11. Ceph Deployment: HW Instance • HW VM instances for Ceph components • Monitors • Higher CPU cores and RAM • RGWs • Higher Bandwidth and CPUs • An instance group of N RGWs • Base Tier OSDs • Higher Capacity, Hybrid(SSD+HDD) • Cache Tier OSDs(SSDs) • Index OSDs (SSDs+More RAM) • Low latency, large capacity SSDs • An instance has the following: • Host OS (Debian) • Network • CPU (Hyper Thread cores) • Disks (Direct attached) • RAM 11
  • 12. 12 QoS QoS RGW Rate Limit ● Per user quota policy ● Requests are rate limited per GET, PUT, CREATE, DELETE ● Deployed per RGW instance Elastic Load Balancer Spread load across RGW/Nginx Nginx Rd/Wr Separation ● Independent sets of RGW instances ● Handle GET and PUT in isolation ● Maintained in Nginx conf CRUSH Spread load across physically separated tiers across different PDU
  • 13. Cluster Administration and Health Monitoring 13
  • 16. 16 Challenges faced so far... Issue Timeouts due to lack of iops Reason 1. Increased client and rebalancing load 2. Split and merges in filestore at the same time 3. Unable to serve I/O requests Mitigation 1. Migrate journals to SSDs 2. Introduce SSD writeback cache tier in front of HDD OSDs to absorb iops 3. Introduce rate limits for clients 4. Two set of RGWs to serve read and write operations
  • 17. 17 Challenges faced so far... Issue Service outage due to rebalancing in large metadata pools (index, log etc) Reason 1. Buckets with more than 100 million objects 2. RGW usage pool had high number of entries and was enabled 3. Unable to write when rebalancing happens on large metadata pools Mitigation 1. Manual sharding of large bucket indexes 2. Disable RGW log pool 3. Introduce bucket quota
  • 18. 18 Challenges faced so far... Issue Write timeouts due to high and concurrent bucket create calls Reason Contention in Leveldb to due high number of concurrent updates to a bucket’s metadata Mitigation Enable quota on bucket creation calls
  • 19. 19 Challenges faced so far... Issue Data leak Reason Leaks with incomplete multipart uploads Mitigation 1. Fixed : https://github.com/ceph/ceph/pull/15630 2. radosgw-admin orphan find did not scale with millions of objects 3. Developed an offline tool to delete leaked objects
  • 20. 20 Challenges faced so far... Issue Data loss Reason Race in RGWCompleteMultipart Mitigation 1. Fixed : https://github.com/ceph/ceph/pull/16732 2. Accounted loss by scanning all objects for missing parts.
  • 21. 21 Challenges that still persist... • Upgrades are still a pain • Upgrade from Hammer to Jewel, issue with chown. • Upgrading to Luminous: • Filestore has split and merges problem. • BlueStore solves them but upgrade path is not smooth • Adding new osds with BlueStore backend. • Occasional PG incomplete issue and resulting data loss to bring PGs back online. • http://tracker.ceph.com/issues/17002 •End-to-End QOS
  • 25. Build Service Alert ServiceCosmos Monitoring Service Repo Service Authorization AuthN/ AuthZ Service Elastic Load Balancer Object Storage Service Infrastructure as a Service (IAAS) 25 Flipkart Cloud Platform: Building Blocks
  • 26. Object Storage As a Service • Policy based services • Shared Active • Low latency, replicated, high throughput • Hybrid Deployment of HDDs and SSDs • Data Backup and Disaster Recovery Service • Replicated on HDDs • Data Archival Service • Erasure Coded on HDDs • Multi-tenant • Five 9's reliable • 200+ internal customers 26
  • 27. Cluster Administration and Monitoring •Nagios alerts •Ceph Periodic stats • Has more than 100 metrics • Integrates to Cosmos service • Stores data on Cosmos's OpenTS DB • Complements Ceph manager •Grafana • Visual representation of Ceph Periodic Stats • Historic and near real-time status of the service 27