SlideShare a Scribd company logo
1 of 21
Download to read offline
Nikos Kormpakis – nkorb@noc.grnet.gr
Open Source Storage at
Scale: Ceph @ GRNET
Object Storage
Source: https://www.slideshare.net/alohamora/ceph-introduction-2017
Independent, unique
entities including data, plus
some metadata.
● Simple without hierarchy
● Useful for unstructured data
● Abstract lower layers (blocks, files, sectors)
● Let “users” create applications on top of it
So, Ceph is...
“Object Storage”?
What is Ceph?
ceph.com states: Ceph is a unified, distributed storage system
designed for excellent performance, reliability and scalability.
●
Free and Open Source Software
●
Started as a research project in UCSC, now a Red Hat IBM product
●
Software Defined Storage (sic)
●
Runs on commodity hardware
●
Implements Object Storage internally, provides all types: Block, Object, File
Components
Source: https://www.slideshare.net/sageweil1/a-crash-course-in-crush
RADOS
Reliable Autonomic Distributed Object Store
●
Storage layer where all objects live
●
Based on CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data)
●
Maintains physical topology of cluster
●
Handles placement of objects
●
Monitors cluster health
●
Two type of daemons: OSDs and MONs (plus some more)
●
Instead of relying on a central directory, let each client calculate itself where to find
or place objects
OSDs
●
10-1000s per cluster
●
1 per disk
●
Serve data to clients
●
Peer for replication and
recovery
MONs
●
3 or 5 per cluster
●
Maintain cluster state
●
Paxos for decisions
●
Do not handle data
More daemons
●
mgr, mds and more...
Source: https://www.slideshare.net/sageweil1/a-crash-course-in-crush
Daemons
Ways of storing data
Pool Types
Replicated
●
Each object has size (>=3) replicas
●
Each object must have min_size
replicas
●
Faster than EC
●
Larger space overhead than EC
Erasure Coding
●
Each object gets divided in k chunks
plus m additional
●
An object can be recovered from any
k chunks
●
More CPU intensive
Objectstore
Filestore
●
POSIX filesystem (XFS)
●
Each object is a file + xattrs
●
Has external journal
●
LevelDB for metadata
●
Deprecated
Bluestore
●
Raw block device
●
RocksDB for metadata
●
No journals
●
RocksDB can be moved to fast disks
●
Faster for most workloads
●
Checksumming
●
Compression
Physical Topology of Cluster
●
Leaves: OSDs
●
Nodes: Buckets: physical locations
(rack, PDU, DC, chassis, etc)
●
Custom placement policies (i.e,
send secondary replica to other DC)
●
Place data on different disks types
●
Custom failure domain (bucket
type)
●
Replicas of same object are spread
across different buckets of failure
domain
●
OSDs have weights, depending on
their size (or not)
Source: http://yauuu.me/ride-around-ceph-crush-map.html
CRUSH map
Example: With size=3 and failure
domain = rack, you can even lose
two racks without having data
unavailable or lost
librados
●
Low level interface for RADOS
●
Simply read, write and manipulate objects
●
Useful for custom apps (dovecot-ceph-plugin, Archipelago, etc)
●
All other public interfaces use librados internally
RBD
RADOS Block Device
●
Provide block devices to clients
●
Each block device (image) gets split into multiple RADOS objects (4MB by default)
●
librbd (or kernel RBD) calculates on the fly offsets (and thus the target object)
●
Has a lot of fancy features (object-maps, clones, snapshots, mirroring)
●
2 ways to access RBD images: librbd or kernel RBD (shipped with mainline kernel)
●
RBD volumes can be mirrored to a different cluster
Provide Ceph through S3 and
Openstack Swift APIs
●
A RESTful gateway to Ceph
●
Maps internally each S3/Swift
bucket/container/object to RADOS
objects and metadata
●
Runs a built-in HTTP server (civetweb)
●
Can be multizoned
●
Has mulitple auth backends (keystone,
ldap, etc)
Source: http://docs.ceph.com/docs/mimic/radosgw/
RADOS Gateway
POSIX-compliant Shared Filesystem
●
Exposes a shared filesystem (think NFS) to
clients
●
Can be mounted using Ceph clients (FUSE,
kernel) and be exposed as a NFS filesystem
(nfs-ganesha)
●
Metadata (folders, filenames, etc) are
stored in separate pools and managed by
MDS
Source: https://www.sebastien-han.fr/blog/2012/12/03/ceph-and-mds/
CephFS
Ceph @ GRNET
Ceph Infrastructure
Interesting numbers
●
6 clusters (4 prod, 2 staging)
●
900 OSDs
●
81 Hosts
●
2.5PB total raw storage
●
100 Million objects
●
1500 RBD volumes
●
400,000 Swift objects
●
2 major outages
●
0 bytes corrupted or lost
Facts
●
librados, RBD, rgw (Swift)
●
Variety of hardware
●
Spine-leaf network topology
●
Each cluster lives only in one DC
●
Mix of Ceph versions and setups
●
Filestore & Bluestore
●
Failure domain host
●
No mixed clusters
●
4/6 clusters are IPv6 only :)
Ceph Clusters
Name rd0 rd1 rd2 rd3
Version Jewel Luminous Luminous Luminous
Location YPEPTH KNOSSOS YPEPTH KNOSSOS
Services librados librados, RBD RBD rgw (Swift)
Used by ~okeanos ~okeanos, ViMa ViMa ESA Copernicus
Pools repl. size=2 repl. size=3 repl. Size=3 EC 6+3, size=3 on SSDs
Objectstore Filestore Filestore Bluestore Bluestore
Capacity 350TB 700TB 540TB 1PB
Usage 60% 25% 25% 22%
OSDs 186 192 192 350
Hosts 31 16 12 22
Open Source Tooling
●
FAI for fully automated Bare-Metal Server provisioning
●
ceph-ansible for provisioning plus some custom scripts
●
Puppet for configuration management
●
Ansible and Python tooling for maintenance, operations, upgrades
●
Icinga for alerting/healthchecks
●
Prometheus for Ceph and node metrics
●
ELK for log aggregation
Also, started an effort to open-source our tooling (always GPL!) and
provide it to the community.
https://github.com/grnet/cephtools
Ansible playbooks, helper scripts, health checks and more to come!
Outages
2 major outages
●
4 day outage caused by flapping hosts
– https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
●
Not exactly an “outage”: Huge performance degradation due to a single QSFP
– https://blog.noc.grnet.gr/2018/08/29/a-performance-story-how-a-faulty-qsfp-crippled-
a-whole-ceph-cluster/
Future of Ceph @ GRNET
●
Provide S3/Swift as a Service
●
Use Ceph for Openstack Clouds
●
Automate ourselves out of daily operations!
●
Improve performance monitoring
●
Contribute to Ceph with patches and docs
●
Experiment with new features and tunings
●
….
Pros & Cons
Pros
●
Free Software
●
Cheaper than vendor solutions
●
Can easily scale
●
Fast and resilient
●
Each release is getting more stable,
faster and easier to manage
●
Great community
●
Provides a lot of services out of the
box without much hassle
●
Super easy upgrades & ops
●
No central directory, no SPOFs
●
Customizable
●
No major problems so far
Cons
●
Requires more technical insight
than closed-source vendor
solutions
●
Ceph hates unstable networks!
●
Benchmarking is not easy
●
Increased latency in some scenarios
●
Recovery is not always fast
●
Wrong configuration can cause
trouble
●
Not FOSS tools for daily ops: you
might have to implement your own
Questions?

More Related Content

What's hot

Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgwzhouyuan
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookDanny Al-Gaaf
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetupktdreyer
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...Ian Colle
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous updateinwin stack
 
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
Ceph storage for ocp   deploying and managing ceph on top of open shift conta...Ceph storage for ocp   deploying and managing ceph on top of open shift conta...
Ceph storage for ocp deploying and managing ceph on top of open shift conta...OrFriedmann
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackSage Weil
 

What's hot (19)

Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
librados
libradoslibrados
librados
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
 
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
Ceph storage for ocp   deploying and managing ceph on top of open shift conta...Ceph storage for ocp   deploying and managing ceph on top of open shift conta...
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
Ceph Research at UCSC
Ceph Research at UCSCCeph Research at UCSC
Ceph Research at UCSC
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 

Similar to Open Source Storage at Scale: Ceph @ GRNET

Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Divejoshdurgin
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH Ceph Community
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemNETWAYS
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS Ceph Community
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewMarcel Hergaarden
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With CephCeph Community
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stackNikos Kormpakis
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.ioDávid Kőszeghy
 
Let's Containerize New York with Docker!
Let's Containerize New York with Docker!Let's Containerize New York with Docker!
Let's Containerize New York with Docker!Jérôme Petazzoni
 

Similar to Open Source Storage at Scale: Ceph @ GRNET (20)

XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With Ceph
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 
Let's Containerize New York with Docker!
Let's Containerize New York with Docker!Let's Containerize New York with Docker!
Let's Containerize New York with Docker!
 

Recently uploaded

A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxatharvdev2010
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxmprakaash5
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdfHulkTheDevil
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfwill854175
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceOpsTree solutions
 
Which standard is best for your content?
Which standard is best for your content?Which standard is best for your content?
Which standard is best for your content?Rustici Software
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Memoori
 
The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...stockholm university
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Antonio de Llamas
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerAnchore
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 

Recently uploaded (20)

A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptx
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdf
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdf
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 
Which standard is best for your content?
Which standard is best for your content?Which standard is best for your content?
Which standard is best for your content?
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!
 
The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
BoSEU24 | Bill Thompson | Talk From Another Century
BoSEU24 | Bill Thompson | Talk From Another CenturyBoSEU24 | Bill Thompson | Talk From Another Century
BoSEU24 | Bill Thompson | Talk From Another Century
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey Hightower
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 

Open Source Storage at Scale: Ceph @ GRNET

  • 1. Nikos Kormpakis – nkorb@noc.grnet.gr Open Source Storage at Scale: Ceph @ GRNET
  • 2. Object Storage Source: https://www.slideshare.net/alohamora/ceph-introduction-2017 Independent, unique entities including data, plus some metadata. ● Simple without hierarchy ● Useful for unstructured data ● Abstract lower layers (blocks, files, sectors) ● Let “users” create applications on top of it
  • 4. What is Ceph? ceph.com states: Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability. ● Free and Open Source Software ● Started as a research project in UCSC, now a Red Hat IBM product ● Software Defined Storage (sic) ● Runs on commodity hardware ● Implements Object Storage internally, provides all types: Block, Object, File
  • 6. RADOS Reliable Autonomic Distributed Object Store ● Storage layer where all objects live ● Based on CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data) ● Maintains physical topology of cluster ● Handles placement of objects ● Monitors cluster health ● Two type of daemons: OSDs and MONs (plus some more) ● Instead of relying on a central directory, let each client calculate itself where to find or place objects
  • 7. OSDs ● 10-1000s per cluster ● 1 per disk ● Serve data to clients ● Peer for replication and recovery MONs ● 3 or 5 per cluster ● Maintain cluster state ● Paxos for decisions ● Do not handle data More daemons ● mgr, mds and more... Source: https://www.slideshare.net/sageweil1/a-crash-course-in-crush Daemons
  • 8. Ways of storing data Pool Types Replicated ● Each object has size (>=3) replicas ● Each object must have min_size replicas ● Faster than EC ● Larger space overhead than EC Erasure Coding ● Each object gets divided in k chunks plus m additional ● An object can be recovered from any k chunks ● More CPU intensive Objectstore Filestore ● POSIX filesystem (XFS) ● Each object is a file + xattrs ● Has external journal ● LevelDB for metadata ● Deprecated Bluestore ● Raw block device ● RocksDB for metadata ● No journals ● RocksDB can be moved to fast disks ● Faster for most workloads ● Checksumming ● Compression
  • 9. Physical Topology of Cluster ● Leaves: OSDs ● Nodes: Buckets: physical locations (rack, PDU, DC, chassis, etc) ● Custom placement policies (i.e, send secondary replica to other DC) ● Place data on different disks types ● Custom failure domain (bucket type) ● Replicas of same object are spread across different buckets of failure domain ● OSDs have weights, depending on their size (or not) Source: http://yauuu.me/ride-around-ceph-crush-map.html CRUSH map Example: With size=3 and failure domain = rack, you can even lose two racks without having data unavailable or lost
  • 10. librados ● Low level interface for RADOS ● Simply read, write and manipulate objects ● Useful for custom apps (dovecot-ceph-plugin, Archipelago, etc) ● All other public interfaces use librados internally
  • 11. RBD RADOS Block Device ● Provide block devices to clients ● Each block device (image) gets split into multiple RADOS objects (4MB by default) ● librbd (or kernel RBD) calculates on the fly offsets (and thus the target object) ● Has a lot of fancy features (object-maps, clones, snapshots, mirroring) ● 2 ways to access RBD images: librbd or kernel RBD (shipped with mainline kernel) ● RBD volumes can be mirrored to a different cluster
  • 12. Provide Ceph through S3 and Openstack Swift APIs ● A RESTful gateway to Ceph ● Maps internally each S3/Swift bucket/container/object to RADOS objects and metadata ● Runs a built-in HTTP server (civetweb) ● Can be multizoned ● Has mulitple auth backends (keystone, ldap, etc) Source: http://docs.ceph.com/docs/mimic/radosgw/ RADOS Gateway
  • 13. POSIX-compliant Shared Filesystem ● Exposes a shared filesystem (think NFS) to clients ● Can be mounted using Ceph clients (FUSE, kernel) and be exposed as a NFS filesystem (nfs-ganesha) ● Metadata (folders, filenames, etc) are stored in separate pools and managed by MDS Source: https://www.sebastien-han.fr/blog/2012/12/03/ceph-and-mds/ CephFS
  • 15. Ceph Infrastructure Interesting numbers ● 6 clusters (4 prod, 2 staging) ● 900 OSDs ● 81 Hosts ● 2.5PB total raw storage ● 100 Million objects ● 1500 RBD volumes ● 400,000 Swift objects ● 2 major outages ● 0 bytes corrupted or lost Facts ● librados, RBD, rgw (Swift) ● Variety of hardware ● Spine-leaf network topology ● Each cluster lives only in one DC ● Mix of Ceph versions and setups ● Filestore & Bluestore ● Failure domain host ● No mixed clusters ● 4/6 clusters are IPv6 only :)
  • 16. Ceph Clusters Name rd0 rd1 rd2 rd3 Version Jewel Luminous Luminous Luminous Location YPEPTH KNOSSOS YPEPTH KNOSSOS Services librados librados, RBD RBD rgw (Swift) Used by ~okeanos ~okeanos, ViMa ViMa ESA Copernicus Pools repl. size=2 repl. size=3 repl. Size=3 EC 6+3, size=3 on SSDs Objectstore Filestore Filestore Bluestore Bluestore Capacity 350TB 700TB 540TB 1PB Usage 60% 25% 25% 22% OSDs 186 192 192 350 Hosts 31 16 12 22
  • 17. Open Source Tooling ● FAI for fully automated Bare-Metal Server provisioning ● ceph-ansible for provisioning plus some custom scripts ● Puppet for configuration management ● Ansible and Python tooling for maintenance, operations, upgrades ● Icinga for alerting/healthchecks ● Prometheus for Ceph and node metrics ● ELK for log aggregation Also, started an effort to open-source our tooling (always GPL!) and provide it to the community. https://github.com/grnet/cephtools Ansible playbooks, helper scripts, health checks and more to come!
  • 18. Outages 2 major outages ● 4 day outage caused by flapping hosts – https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/ ● Not exactly an “outage”: Huge performance degradation due to a single QSFP – https://blog.noc.grnet.gr/2018/08/29/a-performance-story-how-a-faulty-qsfp-crippled- a-whole-ceph-cluster/
  • 19. Future of Ceph @ GRNET ● Provide S3/Swift as a Service ● Use Ceph for Openstack Clouds ● Automate ourselves out of daily operations! ● Improve performance monitoring ● Contribute to Ceph with patches and docs ● Experiment with new features and tunings ● ….
  • 20. Pros & Cons Pros ● Free Software ● Cheaper than vendor solutions ● Can easily scale ● Fast and resilient ● Each release is getting more stable, faster and easier to manage ● Great community ● Provides a lot of services out of the box without much hassle ● Super easy upgrades & ops ● No central directory, no SPOFs ● Customizable ● No major problems so far Cons ● Requires more technical insight than closed-source vendor solutions ● Ceph hates unstable networks! ● Benchmarking is not easy ● Increased latency in some scenarios ● Recovery is not always fast ● Wrong configuration can cause trouble ● Not FOSS tools for daily ops: you might have to implement your own