Your SlideShare is downloading. ×
SF Ceph Users Jan. 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

SF Ceph Users Jan. 2014

625
views

Published on

Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in …

Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in such a way to facilitate easy scaling without service disruption. After an introduction to Ceph itself this talk will dive into the design of Ceph client and cluster network topologies.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
625
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SF BAY AREA CEPH USERS GROUP INAUGURAL MEETUP Thursday, January 16, 14
  • 2. AGENDA Intro to Ceph Ceph Networking Public Topologies Cluster Topologies Network Hardware 2 Thursday, January 16, 14
  • 3. THE FORECAST By 2020 over 39 ZB of data will be stored. 1.5 ZB are stored today. 3
  • 4. THE PROBLEM Growth of data  Existing systems don’t scale IT Storage Budget  Increasing cost and complexity 2010 4 Thursday, January 16, 14 2020  Need to invest in new platforms ahead of time
  • 5. THE SOLUTION PAST: SCALE UP FUTURE: SCALE OUT 5 Thursday, January 16, 14
  • 6. CEPH Thursday, January 16, 14
  • 7. INTRO TO CEPH  Distributed storage system  Horizontally scalable  No single point of failure  Self healing and self managing  Runs on commodity hardware  GPLv2 License 7 Thursday, January 16, 14
  • 8. ARCHITECTURE 8 Thursday, January 16, 14
  • 9. SERVICE COMPONENTS MONITOR  PAXOS for consensus  Maintain cluster state  Typically 3-5 nodes  NOT in write path OSD  Object storage interface  Gossips with peers  Data lives here 9 Thursday, January 16, 14 PART 1
  • 10. SERVICE COMPONENTS RADOS GATEWAY  Provides S3/Swift compatibility  Scale out METADATA  Object storage interface  Gossips with peers  Dynamic subtree partitioning 10 Thursday, January 16, 14 PART 2
  • 11. CRUSH  Ceph uses CRUSH for data placement  Aware of cluster topography  Statistically even distribution across pool  Supports asymmetric nodes and devices  Hierarchal weighting 11 Thursday, January 16, 14
  • 12. DATA PLACEMENT 12 Thursday, January 16, 14
  • 13. POOLS  Groupings of OSDs  Both physical and logical  Volumes / Images  Hot SSD pool  Cold SATA pool  DMCrypt pool 13 Thursday, January 16, 14
  • 14. REPLICATION  Original data durability mechanism  Ceph creates N replicas of each RADOS object  Uses CRUSH to determine replica placement  Required for mutable objects (RBD, CephFS)  More reasonable for smaller installations 14 Thursday, January 16, 14
  • 15. ERASURE CODING  (8:4) MDS code in example  1.5x overhead  8 units of client data to write  4 parity units generated using FEC  All 12 units placed with CRUSH  8/12 total units to satisfy a read 15 Thursday, January 16, 14 Firefly Release
  • 16. CLIENT COMPONENTS Native API  Mutable object store  Many language bindings  Object classes CephFS  Linux Kernel CephFS client since 2.6.34  FUSE client  Hadoop JNI bindings 16 Thursday, January 16, 14
  • 17. CLIENT COMPONENTS Block Storage  Linux Kernel RBD client since 2.6.37+  KVM/QEMU integration  Xen integration S3/Swift S3/SWIFT OSD  RESTful interfaces (HTTP)  CRUD operations  Usage accounting for billing 17 Thursday, January 16, 14
  • 18. Ceph Networking Thursday, January 16, 14
  • 19. INFINIBAND  Currently only supported via IPoIB  Accelio (libxio) integration in Ceph is in early stages  Accelio supports multiple transports RDMA, TCP and Shared-Memory  Accelio supports multiple RDMA transports (IB, RoCE, iWARP) 19 Thursday, January 16, 14
  • 20. ETHERNET  Tried and true  Proven at scale  Economical  Many suitable vendors 20 Thursday, January 16, 14
  • 21. 10GbE or 1GbE  Cost of 10GbE trending downward  White box switches turning up heat on vendors  Twinax relatively inexpensive and low power  SFP+ is versatile wrt distance  Single 10GbE for object  Dual 10GbE for block storage (public/cluster)  Bonding many 1GbE links adds lots of complexity 21 Thursday, January 16, 14
  • 22. IPv4 or IPv6 Native  It’s 2014, is this really a question?  Ceph fully supports both modes of operation  Hierarchal allocation models allows “roll up” of routes  Optimal efficiency in RIB  Some tools believe the earth is flat 22 Thursday, January 16, 14
  • 23. LAYER 2  Spanning tree  Switch table size  Broadcast domains (ARP)  MAC frame checksum  Storage protocols (FCoE, ATAoE)  TRILL, MLAG  Layer 2 DCI is crazy pants  Layer 2 tunneled over internet is super crazy pants 23 Thursday, January 16, 14
  • 24. LAYER 3  Address and subnet planning  Proven scale at big web shops  Error detection only on TCP header  Equal cost multi-path (ECMP)  Reasonable for inter-site connectivity 24 Thursday, January 16, 14
  • 25. Public Topologies Thursday, January 16, 14
  • 26. CLIENT TOPOLOGIES  Path diversity for resiliency  Minimize network diameter  Consistent hop count to minimize net long tail latency  Ease of scaling  Tolerate adversarial traffic patterns (fan-in/fan-out) 26 Thursday, January 16, 14
  • 27. FOLDED CLOS  Sometimes called Fat Tree or Spine and Leaf  Minimum 4 fixed switches, grows to 10k+ node fabrics  Rack or cluster oversubscription possible  Non-blocking also possible S S S S  Path diversity S .... .... 1 27 Thursday, January 16, 14 2 N 1 2 S .... N 1 2 .... N 1 2 N
  • 28. Cluster Topologies Thursday, January 16, 14
  • 29. REPLICA TOPOLOGIES  Replica and erasure fan-out  Recovery and remap impact on cluster bandwidth  OSD peering  Backfill served from primary  Tune backfills to avoid large fan-in 29 Thursday, January 16, 14
  • 30. FOLDED CLOS  Sometimes called Fat Tree or Spine and Leaf  Minimum 4, grows to 10k+ node fabrics  Rack or cluster oversubscription possible  Non-blocking also possible S S S S  Path diversity S .... .... 1 30 Thursday, January 16, 14 2 N 1 2 S .... N 1 2 .... N 1 2 N
  • 31. N-WAY PARTIAL MESH 31 Thursday, January 16, 14
  • 32. EVALUATE  Replication  Erasure coding  Special purpose vs general purpose  Extra port cost 32 Thursday, January 16, 14
  • 33. Network Hardware Thursday, January 16, 14
  • 34. Features  Buffer sizes  Cut through vs store and forward  Oversubscribed vs non-blocking  Automation and monitoring 34 Thursday, January 16, 14
  • 35. FIXED  Fixed switches can easily build large clusters  Easier to source  Smaller failure domains  Fixed designs have many control planes  Virtual chassis.. L3 split brain hilarity? 35 Thursday, January 16, 14
  • 36. LESS SKU  Utilize as few vendor SKUs as possible  If permitted, use same fixed switch for spine and leaf  More affordable to have spares on site or more spares  Quicker MTTR when gear is ready to go 36 Thursday, January 16, 14
  • 37. Thanks to our host! 37 Thursday, January 16, 14
  • 38. Kyle Bader Sr. Solutions Architect kyle@inktank.com Thursday, January 16, 14