Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBase Read High Availability Using Timeline-Consistent Region Replicas

2,840 views

Published on

Speakers: Enis Soztutar and Devaraj Das (Hortonworks)

HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.

Published in: Software, Technology
  • Nice work. Even though, according to slide 24, you are already giving up on consistency anyway, so one would expect really high performance gains for READS. Have you tested or experimented with YCSB to see what is the gain in throughput regarding this approach? About consistency again, and without a Quorum of replicas in place such as in Cassandra, a data value can potentially diverge at every single replica and so a client doing READS needs to know the replication offset respect to the primary in order to make a judicious decision regarding which value to use or discard; therefore increasing communications, and in the end, possibly ending up with a yet a not so consistent view of the system. My impression is that overcoming strong consistency performance limitations with enhanced availability is acceptable but not resilient for every use case. Our previous work on HBase addressed the issue of Timeline Consistency as potential future work, but more in terms of data semantics as you can read here: Towards quality-of-service driven consistency for Big Data management.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

HBase Read High Availability Using Timeline-Consistent Region Replicas

  1. 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HBase Read High Availability Using Timeline-Consistent Region Replicas Enis Soztutar (enis@hortonworks.com) Devaraj Das (ddas@hortonworks.com)
  2. 2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved About Us Enis Soztutar Committer and PMC member in Apache HBase and Hadoop since 2007 HBase team @Hortonworks Twitter @enissoz Devaraj Das Committer and PMC member in Hadoop since 2006 Committer at HBase Co-founder @Hortonworks            
  3. 3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Outline of the talk PART I: Use case and semantics §  CAP recap §  Use case and motivation §  Region replicas §  Timeline consistency §  Semantics PART II : Implementation and next steps §  Server side §  Client side §  Data replication §  Next steps & Summary
  4. 4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Part I Use case and semantics
  5. 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved CAP reCAP Partition tolerance Consistency Availability Pick Two HBase is CP
  6. 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Availability CAP reCAP •  In a distributed system you cannot NOT have P •  C vs A is about what happens if there is a network partition! •  A an C are NEVER binary values, always a range •  Different operations in the system can have different A / C choices •  HBase cannot be simplified as CP Partition tolerance Consistency Pick Two HBase is CP
  7. 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HBase consistency model For a single row, HBase is strongly consistent within a data center Across rows HBase is not strongly consistent (but available!). When a RS goes down, only the regions on that server become unavailable. Other regions are unaffected. HBase multi-DC replication is “eventual consistent” HBase applications should carefully design the schema for correct semantics / performance tradeoff
  8. 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Use cases and motivation More and more applications are looking for a “0 down time” platform §  30 seconds downtime (aggressive MTTR time) is too much Certain classes of apps are willing to tolerate decreased consistency guarantees in favor of availability §  Especially for READs Some build wrappers around the native API to be able to handle failures of destination servers §  Multi-DC: when one server is down in one DC, the client switches to a different one Can we do something in HBase natively? §  Within the same cluster?
  9. 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Use cases and motivation Designing the application requires careful tradeoff consideration §  In schema design since single-row is strong consistent, but no multi-row trx §  Multi-datacenter replication (active-passive, active-active, backups etc) It is good to be able to give the application flexibility to pick-and-choose §  Higher availability vs stronger consistency Read vs Write §  Different consistency models for read vs write §  Read-repair, latest ts-wins vs linearizable updates
  10. 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Initial goals Support applications talking to a single cluster really well §  No perceived downtime §  Only for READs If apps wants to tolerate cluster failures §  Use HBase replication §  Combine that with wrappers in the application
  11. 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Introducing…. Region Replicas in HBase Timeline Consistency in HBase
  12. 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Region replicas For every region of the table, there can be more than one replica §  Every region replica has an associated “replica_id”, starting from 0 §  Each region replica is hosted by a different region server Tables can be configured with a REGION_REPLICATION parameter §  Default is 1 §  No change in the current behavior One replica per region is the “default” or “primary” §  Only this can accepts WRITEs §  All reads from this region replica return the most recent data Other replicas, also called “secondaries” follow the primary §  They see only committed updates
  13. 13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Region replicas Secondary region replicas are read-only §  No writes are routed to secondary replicas §  Data is replicated to secondary regions (more on this later) §  Serve data from the same data files are primary §  May not have received the recent data §  Reads and Scans can be performed, returning possibly stale data Region replica placement is done to maximize availability of any particular region §  Region replicas are not co-located on same region servers §  And same racks (if possible)
  14. 14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved rowkey column:value column:value … RegionServer Region memstore DataNode b2 b9 b1 DataNode b2 b1 DataNode b1 Client Read and write RegionServer
  15. 15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 15 rowkey column:value column:value … RegionServer Region DataNode b2 b9 b1 DataNode b2 b1 DataNode b1 Client Read and write memstore RegionServer rowkey column:value column:value … memstore Region replica Read only
  16. 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Introduced a Consistency enum §  STRONG §  TIMELINE Consistency.STRONG is default Consistency can be set per read operation (per-get or per-scan) Timeline-consistent read RPCs sent to more than one replica Semantics is a bit different than Eventual Consistency model
  17. 17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency public enum Consistency { STRONG, TIMELINE } Get get = new Get(row); get.setConsistency(Consistency.TIMELINE); ... Result result = table.get(get); … if (result.isStale()) { ... }
  18. 18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Semantics Can be though of as in-cluster active-passive replication Single homed and ordered updates §  All writes are handled and ordered by the primary region §  All writes are STRONG consistency Secondaries apply the mutations in order Only get/scan requests to secondaries Get/Scan Result can be inspected to see whether the result was from possibly stale data
  19. 19. Page  19   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved   TIMELINE Consistency Example Client1   X=1   Client2   WAL   Data:   Replica_id=0  (primary)   Replica_id=1     Replica_id=2   replicaJon   replicaJon   X=3   WAL   Data:   WAL   Data:   X=1  X=1  Write  
  20. 20. Page  20   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved   TIMELINE Consistency Example Client1   X=1   Client2   WAL   Data:   Replica_id=0  (primary)   Replica_id=1     Replica_id=2   replicaJon   replicaJon   X=3   WAL   Data:   WAL   Data:   X=1   X=1   X=1   X=1   X=1   X=1  Read   X=1  Read   X=1  Read  
  21. 21. Page  21   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved   TIMELINE Consistency Example Client1   X=1   Client2   WAL   Data:   Replica_id=0  (primary)   Replica_id=1     Replica_id=2   replicaJon   replicaJon   WAL   Data:   WAL   Data:   Write   X=1   X=1   X=2   X=2   X=2  
  22. 22. Page  22   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved   TIMELINE Consistency Example Client1   X=1   Client2   WAL   Data:   Replica_id=0  (primary)   Replica_id=1     Replica_id=2   replicaJon   replicaJon   WAL   Data:   WAL   Data:   X=2   X=1   X=2   X=2   X=2   X=2  Read   X=2  Read   X=1  Read  
  23. 23. Page  23   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved   TIMELINE Consistency Example Client1   X=1   Client2   WAL   Data:   Replica_id=0  (primary)   Replica_id=1     Replica_id=2   replicaJon   replicaJon   WAL   Data:   WAL   Data:   X=2   X=1   X=3   X=2   Write   X=3   X=3  
  24. 24. Page  24   ©  Hortonworks  Inc.  2011  –  2014.  All  Rights  Reserved   TIMELINE Consistency Example Client1   X=1   Client2   WAL   Data:   Replica_id=0  (primary)   Replica_id=1     Replica_id=2   replicaJon   replicaJon   WAL   Data:   WAL   Data:   X=2   X=1   X=3   X=2   X=3   X=3  Read   X=2  Read   X=1  Read  
  25. 25. Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved PART II Implementation and next steps
  26. 26. Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Region replicas – recap Every region replica has an associated “replica_id”, starting from 0 Each region replica is hosted by a different region server §  All replicas can serve READs One replica per region is the “default” or “primary” §  Only this can accepts WRITEs §  All reads from this region replica return the most recent data
  27. 27. Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Updates in the Master Replica creation §  Created during table creation No distinction between primary & secondary replicas Meta table contain all information in one row Load balancer improvements §  LB made aware of replicas §  Does best effort to place replicas in machines/racks to maximize availability Alter table support §  For adjusting number of replicas
  28. 28. Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Updates in the RegionServer Treats non-default replicas as read-only Storefile management §  Keeps itself up-to-date with the changes to do with store file creation/deletions
  29. 29. Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved IPC layer high level flow Client YES Response within timeout (10 millis)? NO Send READ to all secondaries Send READ to primary Poll for response Wait for response Take the first successful response; cancel others Similar flow for GET/Batch-GET/ Scan, except that Scan is sticky to the server it sees success from.
  30. 30. Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Performance and Testing No significant performance issues discovered §  Added interrupt handling in the RPCs to cancel unneeded replica RPCs Deeper level of performance testing work is still in progress Tested via IT tests §  fails if response is not received within a certain time
  31. 31. Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Next steps What has been described so far is in “Phase-1” of the project Phase-2 §  WAL replication §  Handling of Merges and Splits §  Latency guarantees – Cancellation of RPCs server side – Promotion of one Secondary to Primary, and recruiting a new Secondary Use the infrastructure to implement consensus protocols for read/write within a single datacenter
  32. 32. Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Replication Data should be replicated from primary regions to secondary regions A regions data = Data files on hdfs + in-memory data in Memstores Data files MUST be shared. We do not want to store multiple copies Do not cause more writes than necessary Two solutions: §  Region snapshots : Share only data files §  Async WAL Replication : Share data files, every region replica has its own in-memory data
  33. 33. Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Replication – Region Snapshots Primary region works as usual §  Buffer up mutations in memstore §  Flush to disk when full §  Compact files when needed §  Deleted files are kept in archive directory for some time Secondary regions periodically look for new files in primary region §  When a new flushed file is seen, just open it and start serving data from there §  When a compaction is seen, open new file, close the files that are gone §  Good for read-only, bulk load data or less frequently updated data Implemented in phase 1
  34. 34. Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Replication - Async WAL Replication Being implemented in Phase 2 Uses replication source to tail the WAL files from RS §  Plugs in a custom replication sink to replay the edits on the secondaries §  Flush and Compaction events are written to WAL. Secondaries pick new files when they see the entry A secondary region open will: §  Open region files of the primary region §  Setup a replication queue based on last seen seqId §  Accumulate edits in memstore (memory management issues in the next slide) §  Mimic flushes and compactions from primary region
  35. 35. Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Memory management & flushes Memory Snapshots-based approach §  The secondaries looks for WAL-edit entries Start-Flush, Commit-Flush §  They mimic what the primary does in terms of taking snapshots – When a flush is successful, the snapshot is let go §  If the RegionServer hosting secondary is under memory pressure – Make some other primary region flush Flush-based approach §  Treat the secondary regions as regular regions §  Allow them to flush as usual §  Flush to the local disk, and clean them up periodically or on certain events – Treat them as a normal store file for serving reads
  36. 36. Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary Pros §  High-availability for read-only tables §  High-availability for stale reads §  Very low-latency for the above Cons §  Increased memory from memstores of the secondaries §  Increased blockcache usage §  Extra network traffic for the replica calls §  Increased number of regions to manage in the cluster
  37. 37. Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved References Apache branch hbase-10070 (https://github.com/apache/hbase/tree/ hbase-10070) HDP-2.1 comes with experimental support for Phase-1 More on the use cases for this work is in Sudarshan’s (Bloomberg) talk §  “Case Studies” track titled “HBase at Bloomberg: High Availability Needs for the Financial Industry”
  38. 38. Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thanks Q & A

×