Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop Ozone

195 views

Published on

All the distributed file systems and NoSQL databases work very well during the normal operation. We can find the big differences if we investigate the behaviour in case of emergency. Data replication strategies and recovery algorithms are the key ingredients of a distributed data storage to save data.

Recent studies proved that the random replication is not the safest choice for storing data as it almost guarantees to lose data in the common scenario of simultaneous node failures. Copyset Replication method significantly reduces the frequency of data loss events with selecting the replica groups in a smart way.

In this talk we will introduce the key elements of a successful data replication and show how advanced data replication strategies could help to survive outages. We will show how Apache Hadoop Ozone solves the problem with advanced techniques and present the challenges of using Copyset algorithm with advanced cluster topology support.

Published in: Technology
  • Be the first to comment

Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop Ozone

  1. 1. Losing Data in a Safe WayLosing Data in a Safe Way Advanced Replication Strategies in ApacheAdvanced Replication Strategies in Apache Hadoop OzoneHadoop Ozone
  2. 2. DATADATA SCIENCESCIENCE
  3. 3. https://flokkr.github.io RAFT library for Java Marton Elek elek@apache.org Apache Hadoop Ozone https://github.com/elek/flekszible @anzix
  4. 4. DATADATA SCIENCESCIENCE
  5. 5. Storage in the          worldStorage in the          world hadoop-hdfs
  6. 6. Storage in the          worldStorage in the          world hadoop-hdfs hadoop-aws/aliyun/azure
  7. 7. Storage in the          worldStorage in the          world hadoop-hdfs hadoop-aws/aliyun/azure (Eventual) Consistency Cost (PB scale) Slower (no local data) Easy to use / no management Can't handle   small files Tool support Only HadoopFS protocol is supported
  8. 8. B1 B2 B3F1 Replication: split the filesReplication: split the files
  9. 9. B1 B2 B3F1 Replication: split the filesReplication: split the files
  10. 10. Replication: split the filesReplication: split the files Namenode Datanode Report the state of all the managed blocks In-memory File -> []block mapping
  11. 11. B1 B2 B3File path Inside HDFS NamenodeInside HDFS Namenode Blocks
  12. 12. B1 B2 B3File path Inside HDFS NamenodeInside HDFS Namenode Blocks B1 B2 DN1 DN2 DN3 DN1 DN5 DN6 Datanodes
  13. 13. What is Ozone?What is Ozone? Ozone provides Object Store semantic (S3) for Hadoop storage   HDDS: On top of a lower level replication layer (Hadoop Distributed Data Store)   Consistent   Fast   Cloud native   Easy to use
  14. 14. Bonus: reporting superblocksBonus: reporting superblocks Replicate blocks in bigger groups: Can handle 10 billions of blocks Namenode Datanode
  15. 15. Bonus: reporting superblocksBonus: reporting superblocks Replicate blocks in bigger groups: Can handle 10 billions of blocks Namenode Datanode
  16. 16. Open containersOpen containers Replicated with RAFT Apache Ratis If the size is < 5Gb If everything is fine Closed ContainersClosed Containers Full containers Immutable data! Can be merged after deleting data
  17. 17. Apache Hadoop OzoneApache Hadoop Ozone CSICSI Hadoop FSHadoop FS S3 protocolS3 protocol hadoop.apache.org/ozonehadoop.apache.org/ozone
  18. 18. RoadmapRoadmap 0.2.1-alpha: first release   0.3.0-alpha: s3g + stability   0.4.0-alpha: security   0.5.0-beta: HA
  19. 19. DATADATA SCIENCESCIENCE
  20. 20. CopysetsCopysets https://web.stanford.edu/~sk atti/pubs/usenix13- copysets.pdf Tiered replctnTiered replctn https://www.usenix.org/syste m/files/conference/atc15/atc 15-paper-cidon.pdf
  21. 21. Kyle Kingsbury: Anna Concurrenina #jepsen https://www.youtube.com/watch?v=eSaFVX4izsQ
  22. 22. What could go wrong?What could go wrong?
  23. 23. Node failuresNode failures IndependentIndependent CorrelatedCorrelated
  24. 24. Node failuresNode failures IndependentIndependent CorrelatedCorrelated
  25. 25. Node failuresNode failures IndependentIndependent CorrelatedCorrelated Topology- related
  26. 26. Node failuresNode failures IndependentIndependent CorrelatedCorrelated Topology- related Topology- independent
  27. 27. correlated failurescorrelated failures Topology-independentTopology-independent
  28. 28. B1 B2 B3F1 Replication: split the filesReplication: split the files
  29. 29. B1 B2 B3F1 Replication: split the filesReplication: split the files
  30. 30. B1F1 Replication: split the filesReplication: split the files
  31. 31. B1F1 ReplicationReplication B1 B1 DN1 DN2 DN3 DN4 DN5 F2 B2 B2 B2 DN6
  32. 32. = #2= #2 RandomRandom datanodedatanode failurefailure
  33. 33. B1F1 ReplicationReplication B1 B1 DN1 DN2 DN3 DN4 DN5 F2 B2 B2 B2 DN6
  34. 34. = #3= #3 RandomRandom datanodedatanode failurefailure
  35. 35. B1F1 Replication: split the filesReplication: split the files B1 B1 DN1 DN2 DN3 DN4 DN5 F2 B2 B2 B2 DN6
  36. 36. = #5= #5 RandomRandom datanodedatanode failurefailure
  37. 37. B1F1 Replication: split the filesReplication: split the files B1 B1 DN1 DN2 DN3 DN4 DN5 F2 B2 B2 B2 DN6
  38. 38. failure offailure of 3 nodes3 nodes data lossdata loss with 100% probwith 100% prob
  39. 39. B1F1 ReplicationReplication B1 B1 DN1 DN2 DN3 DN4 DN5 F2 B2 B2 B2 DN6
  40. 40. 1 2 3 4 5 6 Datanodes replicas blocks 1 2 3 4 5
  41. 41. 1 2 3 4 5 6 Datanodes B Block B is replaced to datanode: 1,2,3 1 1 1
  42. 42. 1 2 3 4 5 6 Datanodes B Block B is replaced to datanode: 1,2,3 ReplicaSet: Set of datanodes (3) 1 1 1 1 2 4 ReplicaSet (1,2,4) manages replicas of B block 1
  43. 43. 1 2 4 1 1 2 3 4 5 6 1 2 3 4 5 6
  44. 44. 1 2 4 1 2 3 5 2 1 2 3 4 5 6 1 2 3 4 5 6
  45. 45. 1 2 4 1 2 3 5 2 3 5 6 3 1 2 3 4 5 6 1 2 3 4 5 6
  46. 46. 1 2 4 1 2 3 5 2 3 5 6 3 2 3 6 4 1 2 3 4 5 6 1 2 3 4 5 6
  47. 47. 1 2 4 1 2 3 5 2 5 3 5 6 3 2 3 6 4 1 2 3 4 5 6 1 2 3 4 5 6
  48. 48. 6 1 2 4 1 2 3 5 2 5 3 5 6 3 2 3 6 4 1 2 3 4 5 6 1 2 3 4 5 6
  49. 49. 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4 1 3 5 1 3 6 1 4 5 1 4 6 1 5 6 2 3 4 2 3 5 2 3 6 2 4 5 2 4 6 2 5 6 3 4 5 3 4 6 3 5 6 4 5 6
  50. 50. B1F1 B1 B1 DN1 DN2 DN3 DN4 DN5 F2 B2 B2 B2 DN6
  51. 51. 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4 1 3 5 1 3 6 1 4 5 1 4 6 1 5 6 2 3 4 2 3 5 2 3 6 2 4 5 2 4 6 2 5 6 3 4 5 3 4 6 3 5 6 4 5 6
  52. 52. failure offailure of 3 independent3 independent nodesnodes Kill oneKill one ReplicasetReplicaset
  53. 53. Random replicationRandom replication 6 datanodes --> 20 different 3-node set: 1 2 3 2000 blocks (20 * 100) 1 2 4 1 2 5 ~100 blocks ~100 blocks ~100 blocks 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4
  54. 54. 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4 1 3 5 1 3 6 1 4 5 1 4 6 1 5 6 2 3 4 2 3 5 2 3 6 2 4 5 2 4 6 2 5 6 3 4 5 3 4 6 3 5 6 4 5 6
  55. 55. 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4 1 3 5 1 3 6 1 4 5 1 4 6 1 5 6 2 3 4 2 3 5 2 3 6 2 4 5 2 4 6 2 5 6 3 4 5 3 4 6 3 5 6 4 5 6
  56. 56. Scale it up? Doesn't helpScale it up? Doesn't help 600 datanodes -->                               different 3-node replicaset 1 2 3 35 000 000 000 blocks (1000 * 35 000 000) 1 2 4 1 2 5 ~1000 bocks ~1000 bocks ~1000 bocks 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4
  57. 57. How can we do itHow can we do it better?better? ??
  58. 58. B1F1 DN1 DN5 DN6DN2 DN3 DN4 B1 B1 B2F2 B2 B2 B3F3 B3 B3
  59. 59. 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4 1 3 5 1 3 6 1 4 5 1 4 6 1 5 6 2 3 4 2 3 5 2 3 6 2 4 5 2 4 6 2 5 6 3 4 5 3 4 6 3 5 6 4 5 6 2 3 5 3 6
  60. 60. 2 groups2 groups 6 datanodes --> 2 copysets [1 2 3], [4 5 6] 2000 -> blocks [1 2 3] --> ~1000 blocks [4 5 6] --> ~1000 blocks 3 DN failure --> data loss: 2:20 6 datanodes -->  20 copysets [1 3 6], [4 7 9], .... 2000 -> blocks [1 2 3] --> ~100 blocks [4 5 6] --> ~100 blocks 3 DN failure --> dataloss: 20:20 RandomRandom
  61. 61. 2 groups2 groups 6 datanodes --> 2 copysets [1 2 3], [4 5 6] 2000 -> blocks [1 2 3] --> ~1000 blocks [4 5 6] --> ~1000 blocks 3 DN failure --> data loss: 2:20 Lost data: 1000 blocks 6 datanodes -->  20 copysets [1 3 6], [4 7 9], .... 2000 -> blocks [1 2 3] --> ~100 blocks [4 5 6] --> ~100 blocks 3 DN failure --> dataloss: 20:20 Lost data: 100 blocks RandomRandom
  62. 62. 2 groups2 groups 10% chance 50% loss 100% chance 10% loss RandomRandom
  63. 63. 2 groups2 groups 10% chance 50% loss 100% chance 10% loss RandomRandom
  64. 64. Problem?Problem?
  65. 65. B1F1 B1 B1 DN1 DN2 DN3 DN4
  66. 66. B1F1 B1 B1 DN1 DN2 DN3 DN4DN2'
  67. 67. B1F1 B1 DN1 DN3 DN4DN2'
  68. 68. B1F1 B1 DN1 DN3 DN4DN2' B1 copy copy
  69. 69. 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4 1 3 5 1 3 6 1 4 5 1 4 6 1 5 6 2 3 4 2 3 5 2 3 6 2 4 5 2 4 6 2 5 6 3 4 5 3 4 6 3 5 6 4 5 6
  70. 70. 1 2 3 1 2 4 1 2 5 1 2 6 1 3 4 1 3 5 1 3 6 1 4 5 1 4 6 1 5 6 2 3 4 2 3 5 2 3 6 2 4 5 2 4 6 2 5 6 3 4 5 3 4 6 3 5 6 4 5 6 2 3 5 3 6
  71. 71. 1 2 3 4 5 6
  72. 72. CopySetsCopySets
  73. 73. DN4 DN7 DN1 DN5 DN8 DN2 DN6 DN9 DN3
  74. 74. DN4 DN7 DN1 DN5 DN8 DN2 DN6 DN9 DN3 # replicasets:   6 groups: 3 col, 3 row   3 random node failures:   possible in 84 ways   Chance to data loss:   6/84 = 7%   Source for replication:   4 datanodes
  75. 75. Chance of data loss (3 failure) Source of replic. (1 failure) Random 2 group 6 group/ copysets 100% 3.5% 7% all 2 4 SummarySummary
  76. 76. Summary (Copysets)Summary (Copysets) Two main forces: Number of distinct sets (copysets) Number of source datanode to recover data Random replication is not the most effective way We can do better Lower chance to loose data More data to loose
  77. 77. Apache  Hadoop OzoneApache  Hadoop Ozone S3 compatible Object store for Hadoop On top of the Hadoop Distributed Storage layer Advanced replication Advanced block report handling (We        small files) Easy to use and run in containerized environments
  78. 78. Q&AQ&A
  79. 79. Apache Hadoop OzoneApache Hadoop Ozone CSICSI Hadoop FSHadoop FS S3 protocolS3 protocol hadoop.apache.org/ozonehadoop.apache.org/ozone
  80. 80. Note: in-place upgradeNote: in-place upgrade Upgrade path from existing HDFS without (or minimal) Data Move The math is the same Grouping block in the same replicaset Report them in groups (containers) Better: Less replicasets with more blocks in each Smart preprocessing/balancing may be required

×