Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dan Lambright1
Erasure Codes and
Storage Tiers on
Gluster
Dan Lambright
SA summit
Sep 23, 2014
Dan Lambright2
AGENDA
●
Why erasure codes (ec) in Gluster
●
How ec works
●
Brief peek at underlying mathematics
●
Storage ...
Dan Lambright3
Why erasure codes in gluster?
● Desire protection from double failure
● RAID6 controllers are expensive
● I...
Dan Lambright4
Why erasure codes in gluster?
● Triplication (3 way replication) is expensive
● Two redundant disks for eve...
Dan Lambright5
Erasure codes
● Store m disks worth of data on k disks (k>m)
● n redundant disks (k-m),
● can pick n to cho...
Dan Lambright6
Overhead analysis
● Can also consider mean time before failure
k total disks n how many
failures
admitted
m...
ERASURE CODES PRIMER
Dan Lambright8
ERASURE CODE TERMS
● m data disks
● n parity disks
● k total number disks = m+n
● Symbol – Smallest data un...
Dan Lambright9
ERASURE CODE TERMS
●
r=6
m=4
n =2
k=6
w=1
symbol
fragment
“Stripe” of
6 fragments
011010
Dan Lambright10
Systematic
● m data chunks, n coding chunks
● (can stripe parity and data chunks on the same disk)
● Reads...
Dan Lambright11
Non-Systematic
● All k chunks in a stripe are coded
● Do not to distinguish data from code servers
● Encod...
Dan Lambright12
Encoding / Decoding Overhead
● Network RTT dominate the encode/decode overhead
●
Packages exist to impleme...
GLUSTER IMPLEMENTATION
Dan Lambright14
GLUSTERFS “Disperse Volumes”
● Done by Datalab corp. by Xavier Hernandez.
● Use case : archiving medical r...
Dan Lambright15
CLI
Two new options have been added to the 'create' command of the cli interface:
gluster volume create <n...
Dan Lambright16
“Disperse volumes” design choices
● The “symbols” are bytes: w = 8
● The fragment size r = 128
● Algorithm...
STORAGE TIERS
Dan Lambright18
Storage Tiers
● Different “subvolume” tiers presented as a single volume
● HDD, SSD, tape, “persistent mem...
Dan Lambright19
Example: Erasure codes + SSD
● User sees one volume
● SSD “caches” ec data
Tiered volume
“cache”:
on SSD
e...
Dan Lambright20
Future : Data classification (DC)
● Add rules to storage graph
● Rule determines subvolume
● File name
● A...
Dan Lambright21
Future flexibility
● Many use cases
● Compliance
● Multi-tenancy
● Rack-aware placement (for performance)
...
DEMO
promote
ONE MORE THING..
promote
Dan Lambright24
Bitrot
● A daemon that scans gluster volumes
● Finds corrupted data
● Digest associated with each file
● A...
25
Do it!
● Learn the math:
● http://web.eecs.utk.edu/~plank/plank/papers/FAST-
2013-Tutorial.html
● Get the bits:
● https...
RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE
Thank You!
● dlambright@redhat.com
● RHS:
www.redhat.com/storage/
● GlusterFS:
ww...
Upcoming SlideShare
Loading in …5
×

Erasure codes and storage tiers on gluster

5,795 views

Published on

In this session, we'll discuss new volume types in Red Hat Gluster Storage. We will talk about erasure codes and storage tiers, and how they can work together. Future directions will also be touched on, including rule based classifiers and data transformations.

You will learn about:

How erasure codes lower the cost of storage.
How to configure and manage an erasure coded volume.
How to tune Gluster and Linux to optimize erasure code performance.
Using erasure codes for archival workloads.
How to utilize an SSD inexpensively as a storage tier.
Gluster's erasure code and storage tiering design.

  • Be the first to comment

Erasure codes and storage tiers on gluster

  1. 1. Dan Lambright1 Erasure Codes and Storage Tiers on Gluster Dan Lambright SA summit Sep 23, 2014
  2. 2. Dan Lambright2 AGENDA ● Why erasure codes (ec) in Gluster ● How ec works ● Brief peek at underlying mathematics ● Storage tiering in gluster ● Demo ● “One more thing”
  3. 3. Dan Lambright3 Why erasure codes in gluster? ● Desire protection from double failure ● RAID6 controllers are expensive ● Imagine a 64 node volume ● Each brick on a separate bare metal machine ● Cost is 64 x $ for LSI MegaRaid controller 20K =
  4. 4. Dan Lambright4 Why erasure codes in gluster? ● Triplication (3 way replication) is expensive ● Two redundant disks for every data disk ● 200% overhead! :(
  5. 5. Dan Lambright5 Erasure codes ● Store m disks worth of data on k disks (k>m) ● n redundant disks (k-m), ● can pick n to choose failure tolerance ● A generalization of RAID6 ● Distributed across nodes
  6. 6. Dan Lambright6 Overhead analysis ● Can also consider mean time before failure k total disks n how many failures admitted m number of data disks Capacity overhead (n/k) RAID level 3 1 2 33.33% 5 5 1 4 20% 5 6 2 4 33.33% 6 7 3 4 42.86% E 9 1 8 11.11% 5 10 2 8 20% 6 11 3 8 27.27% E 12 4 8 33.33% E
  7. 7. ERASURE CODES PRIMER
  8. 8. Dan Lambright8 ERASURE CODE TERMS ● m data disks ● n parity disks ● k total number disks = m+n ● Symbol – Smallest data unit. w bits. ● Typically w = 8 = a byte ● Chunk (aka fragment) – r symbols per disk ● Stripe – collection of m+n chunks across k disks ● Unit of manipulation for recovery ● Also known as a “slice”
  9. 9. Dan Lambright9 ERASURE CODE TERMS ● r=6 m=4 n =2 k=6 w=1 symbol fragment “Stripe” of 6 fragments 011010
  10. 10. Dan Lambright10 Systematic ● m data chunks, n coding chunks ● (can stripe parity and data chunks on the same disk) ● Reads are simple, only decode on repairs Slice 1 Slice 2 Slice 3
  11. 11. Dan Lambright11 Non-Systematic ● All k chunks in a stripe are coded ● Do not to distinguish data from code servers ● Encode/decode on writes and reads Slice 1 Slice 2 Slice 3
  12. 12. Dan Lambright12 Encoding / Decoding Overhead ● Network RTT dominate the encode/decode overhead ● Packages exist to implement the math ● Intel has fast routines for Inverse, dot product, encoding, decoding, etc ● Jerasure library from academia ● Gluster's is purpose built and fast
  13. 13. GLUSTER IMPLEMENTATION
  14. 14. Dan Lambright14 GLUSTERFS “Disperse Volumes” ● Done by Datalab corp. by Xavier Hernandez. ● Use case : archiving medical records ● Developed over last 2 years ● Now part of gluster upstream
  15. 15. Dan Lambright15 CLI Two new options have been added to the 'create' command of the cli interface: gluster volume create <name> disperse <count> redundancy <count> Disperse is “k” (total number volumes) Redundancy is “n”
  16. 16. Dan Lambright16 “Disperse volumes” design choices ● The “symbols” are bytes: w = 8 ● The fragment size r = 128 ● Algorithm: Reed solomon ● Generator matrix: Vandermonde ● Non–systematic ● Encoding / decoding done on client side ● Modeled after AFR ● Concurrent writes must be processed in order
  17. 17. STORAGE TIERS
  18. 18. Dan Lambright18 Storage Tiers ● Different “subvolume” tiers presented as a single volume ● HDD, SSD, tape, “persistent memory”, etc. ● Plug-in policy describes how data moves between tiers ● V1 policy: Cache ● slow and fast tiers ● CLI to add/remove cache tier from existing volume
  19. 19. Dan Lambright19 Example: Erasure codes + SSD ● User sees one volume ● SSD “caches” ec data Tiered volume “cache”: on SSD ec on HDD Hot Cold demote promote
  20. 20. Dan Lambright20 Future : Data classification (DC) ● Add rules to storage graph ● Rule determines subvolume ● File name ● Attribute (size, content) ● Etc. Filename = *.lock ?` Yes No Secure / Encrypted HDD
  21. 21. Dan Lambright21 Future flexibility ● Many use cases ● Compliance ● Multi-tenancy ● Rack-aware placement (for performance) ● Policies described by language ● Arbitrary number of tiers, rules, subvolumes .. ● Template based
  22. 22. DEMO promote
  23. 23. ONE MORE THING.. promote
  24. 24. Dan Lambright24 Bitrot ● A daemon that scans gluster volumes ● Finds corrupted data ● Digest associated with each file ● Alert / recover on mismatch ● “Plug-ins” to daemon may do other things.. ● Tuning parameters to be non-intrusive to performance ● Encryption ● Compression ● Etc.
  25. 25. 25 Do it! ● Learn the math: ● http://web.eecs.utk.edu/~plank/plank/papers/FAST- 2013-Tutorial.html ● Get the bits: ● https://forge.gluster.org/disperse
  26. 26. RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE Thank You! ● dlambright@redhat.com ● RHS: www.redhat.com/storage/ ● GlusterFS: www.gluster.org ● @Glusterorg @RedHatStorage Gluster Red Hat Storage Slides Available on Mojo

×