SF Ceph Users Jan. 2014
Upcoming SlideShare
Loading in...5
×
 

SF Ceph Users Jan. 2014

on

  • 641 views

Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in ...

Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in such a way to facilitate easy scaling without service disruption. After an introduction to Ceph itself this talk will dive into the design of Ceph client and cluster network topologies.

Statistics

Views

Total Views
641
Views on SlideShare
567
Embed Views
74

Actions

Likes
0
Downloads
22
Comments
0

4 Embeds 74

http://www.meetup.com 51
http://t.co 19
http://moderation.local 2
https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SF Ceph Users Jan. 2014 SF Ceph Users Jan. 2014 Presentation Transcript

  • SF BAY AREA CEPH USERS GROUP INAUGURAL MEETUP Thursday, January 16, 14
  • AGENDA Intro to Ceph Ceph Networking Public Topologies Cluster Topologies Network Hardware 2 Thursday, January 16, 14
  • THE FORECAST By 2020 over 39 ZB of data will be stored. 1.5 ZB are stored today. 3 View slide
  • THE PROBLEM Growth of data  Existing systems don’t scale IT Storage Budget  Increasing cost and complexity 2010 4 Thursday, January 16, 14 2020  Need to invest in new platforms ahead of time View slide
  • THE SOLUTION PAST: SCALE UP FUTURE: SCALE OUT 5 Thursday, January 16, 14
  • CEPH Thursday, January 16, 14
  • INTRO TO CEPH  Distributed storage system  Horizontally scalable  No single point of failure  Self healing and self managing  Runs on commodity hardware  GPLv2 License 7 Thursday, January 16, 14
  • ARCHITECTURE 8 Thursday, January 16, 14
  • SERVICE COMPONENTS MONITOR  PAXOS for consensus  Maintain cluster state  Typically 3-5 nodes  NOT in write path OSD  Object storage interface  Gossips with peers  Data lives here 9 Thursday, January 16, 14 PART 1
  • SERVICE COMPONENTS RADOS GATEWAY  Provides S3/Swift compatibility  Scale out METADATA  Object storage interface  Gossips with peers  Dynamic subtree partitioning 10 Thursday, January 16, 14 PART 2
  • CRUSH  Ceph uses CRUSH for data placement  Aware of cluster topography  Statistically even distribution across pool  Supports asymmetric nodes and devices  Hierarchal weighting 11 Thursday, January 16, 14
  • DATA PLACEMENT 12 Thursday, January 16, 14
  • POOLS  Groupings of OSDs  Both physical and logical  Volumes / Images  Hot SSD pool  Cold SATA pool  DMCrypt pool 13 Thursday, January 16, 14
  • REPLICATION  Original data durability mechanism  Ceph creates N replicas of each RADOS object  Uses CRUSH to determine replica placement  Required for mutable objects (RBD, CephFS)  More reasonable for smaller installations 14 Thursday, January 16, 14
  • ERASURE CODING  (8:4) MDS code in example  1.5x overhead  8 units of client data to write  4 parity units generated using FEC  All 12 units placed with CRUSH  8/12 total units to satisfy a read 15 Thursday, January 16, 14 Firefly Release
  • CLIENT COMPONENTS Native API  Mutable object store  Many language bindings  Object classes CephFS  Linux Kernel CephFS client since 2.6.34  FUSE client  Hadoop JNI bindings 16 Thursday, January 16, 14
  • CLIENT COMPONENTS Block Storage  Linux Kernel RBD client since 2.6.37+  KVM/QEMU integration  Xen integration S3/Swift S3/SWIFT OSD  RESTful interfaces (HTTP)  CRUD operations  Usage accounting for billing 17 Thursday, January 16, 14
  • Ceph Networking Thursday, January 16, 14
  • INFINIBAND  Currently only supported via IPoIB  Accelio (libxio) integration in Ceph is in early stages  Accelio supports multiple transports RDMA, TCP and Shared-Memory  Accelio supports multiple RDMA transports (IB, RoCE, iWARP) 19 Thursday, January 16, 14
  • ETHERNET  Tried and true  Proven at scale  Economical  Many suitable vendors 20 Thursday, January 16, 14
  • 10GbE or 1GbE  Cost of 10GbE trending downward  White box switches turning up heat on vendors  Twinax relatively inexpensive and low power  SFP+ is versatile wrt distance  Single 10GbE for object  Dual 10GbE for block storage (public/cluster)  Bonding many 1GbE links adds lots of complexity 21 Thursday, January 16, 14
  • IPv4 or IPv6 Native  It’s 2014, is this really a question?  Ceph fully supports both modes of operation  Hierarchal allocation models allows “roll up” of routes  Optimal efficiency in RIB  Some tools believe the earth is flat 22 Thursday, January 16, 14
  • LAYER 2  Spanning tree  Switch table size  Broadcast domains (ARP)  MAC frame checksum  Storage protocols (FCoE, ATAoE)  TRILL, MLAG  Layer 2 DCI is crazy pants  Layer 2 tunneled over internet is super crazy pants 23 Thursday, January 16, 14
  • LAYER 3  Address and subnet planning  Proven scale at big web shops  Error detection only on TCP header  Equal cost multi-path (ECMP)  Reasonable for inter-site connectivity 24 Thursday, January 16, 14
  • Public Topologies Thursday, January 16, 14
  • CLIENT TOPOLOGIES  Path diversity for resiliency  Minimize network diameter  Consistent hop count to minimize net long tail latency  Ease of scaling  Tolerate adversarial traffic patterns (fan-in/fan-out) 26 Thursday, January 16, 14
  • FOLDED CLOS  Sometimes called Fat Tree or Spine and Leaf  Minimum 4 fixed switches, grows to 10k+ node fabrics  Rack or cluster oversubscription possible  Non-blocking also possible S S S S  Path diversity S .... .... 1 27 Thursday, January 16, 14 2 N 1 2 S .... N 1 2 .... N 1 2 N
  • Cluster Topologies Thursday, January 16, 14
  • REPLICA TOPOLOGIES  Replica and erasure fan-out  Recovery and remap impact on cluster bandwidth  OSD peering  Backfill served from primary  Tune backfills to avoid large fan-in 29 Thursday, January 16, 14
  • FOLDED CLOS  Sometimes called Fat Tree or Spine and Leaf  Minimum 4, grows to 10k+ node fabrics  Rack or cluster oversubscription possible  Non-blocking also possible S S S S  Path diversity S .... .... 1 30 Thursday, January 16, 14 2 N 1 2 S .... N 1 2 .... N 1 2 N
  • N-WAY PARTIAL MESH 31 Thursday, January 16, 14
  • EVALUATE  Replication  Erasure coding  Special purpose vs general purpose  Extra port cost 32 Thursday, January 16, 14
  • Network Hardware Thursday, January 16, 14
  • Features  Buffer sizes  Cut through vs store and forward  Oversubscribed vs non-blocking  Automation and monitoring 34 Thursday, January 16, 14
  • FIXED  Fixed switches can easily build large clusters  Easier to source  Smaller failure domains  Fixed designs have many control planes  Virtual chassis.. L3 split brain hilarity? 35 Thursday, January 16, 14
  • LESS SKU  Utilize as few vendor SKUs as possible  If permitted, use same fixed switch for spine and leaf  More affordable to have spares on site or more spares  Quicker MTTR when gear is ready to go 36 Thursday, January 16, 14
  • Thanks to our host! 37 Thursday, January 16, 14
  • Kyle Bader Sr. Solutions Architect kyle@inktank.com Thursday, January 16, 14