HBASEATBLOOMBERG//
HBASE AT
BLOOMBERGHIGH AVAILABILITY NEEDS FOR THE FINANCIAL INDUSTRY
MAY // 05 // 2014
HBASEATBLOOMBERG//
BLOOMBERG
LEADING DATA AND ANALYTICS PROVIDER TO THE FINANCIAL INDUSTRY
2
HBASEATBLOOMBERG//
BLOOMBERG DATA – DIVERSITY
3
HBASEATBLOOMBERG//
DATA MANAGEMENT AT BLOOMBERG
• Data is our business
• Bloomberg doesn’t have a “big data” problem. It has a “medium data” problem…
• Speed and availability are paramount
• Hundreds of thousands of users with expensive requests
Among the systems we’ve built (we had to!)
• A relational database based on Berkeley DB and SQLite
• A shared-memory based key-value store
• In-memory data cubes for real time security universe screening
We are consolidating many of our systems around open platforms.
4
HBASEATBLOOMBERG//
TIME SERIES
• The Pricehistory service serves up all end of day time series data at Bloomberg
• Single security requests drive most charting functionality
• Multi security requests drive applications such as Portfolio Analytics
• > 5 billion requests a day serving terabytes of data
• 100K queries per second average and 500k per second at peak
5
SECURITY FIELD DATE VALUE
IBM VOLUME 20140321 12,535,281
VOLUME 20140320 5,062,629
VOLUME 20140319 4,323,930
GOOG CLOSE PX 20140321 1,183.04
CLOSE PX 20140320 1,197.16
HBASEATBLOOMBERG//
TIME SERIES AND HBASE
• Time series data fetches are embarrassingly parallel
• Simplistic data types and models mean we do not require rich type
support or query capabilities
• No need for joins, lookups only by [security, field, date]
• Data sets are large enough to require manual sharding…
administrative overhead
• Require a commodity framework to consolidate various disparate
systems built over time and bring about simplicity
• Frameworks bring benefit of additional analytical tools
HBase is an excellent fit for this problem domain
6
HBASEATBLOOMBERG//
OUR REQUIREMENTS FOR HBASE
• Read performance – fast with low variance
• High availability
• Operational simplicity
• Efficient use of our hardware
[16 cores, 100+ GB RAM, SSD storage]
• Bloomberg has been investing in all these aspects of HBase
• In the rest of this talk, we’ll focus on High Availability
>>>>>>>>>>>>>>
HIGH AVAILABILITY
HBASEATBLOOMBERG//
DISASTER RECOVERY – THE MODEL AT
BLOOMBERG
Like any other good engineering organization, we take service uptime very
seriously…
• Applications in multiple data centers share the workload
• Clusters must have excess capacity to absorb load from loss of a peer
data center...
• and still have excess capacity to account for failures & upgrades
• Latency penalty for failover to a different data center
• Read vs. Write Availability
9
HBASEATBLOOMBERG//
MTTR IN HBASE – THE MANY STAGES TO
RECOVERY
• Failure detection by Zookeeper
• Region re-assignment by Master
• Log split and HFile creation
10
HBASEATBLOOMBERG//
MTTR – A BRIEF HISTORY*
• Distributed log split [HBASE-1364]
• Routing around datanode failures via the hdfs stale state [HDFS-3912,
HDFS-4350]
• Assignment manager enhancements [HBASE-7247]
• Multicast notifications to clients with list of failed Region Servers
• Distributed log replay [HBASE-7006]
…
All this phenomenal work means HBase MTTR is now in
the order of 10s of seconds
11
HBASEATBLOOMBERG//
MTTR IN HBASE – GAPS AND
OPPORTUNITIES
• … but 1 min of downtime still high for certain classes of applications
• Even if recovery time is optimized down to zero, still have to wait to
detect failure before we do something
• Lowering ZK session time out introduces false positives
• What if threshold for read unavailability was 1 sec or lower?
• Reads must be serviceable while recovery is still in progress.
12
HBASEATBLOOMBERG//
SOLUTION LANDSCAPE
• The requirement is to be able to read data from elsewhere after a pre-
configured timeout
• Where could that be?
• Another cluster in another DC?
• Another cluster in the same DC?
• Two tables – primary and a shadow kept in the same HBase instance?
• HOYA? Multiple HBase instances on the same physical YARN cluster?
13
HBASEATBLOOMBERG//
SOLUTION LANDSCAPE
All these solutions work by having more than one copy of
the data and being able to quickly access it.
…
But why keep more than one copy of data at the HBase
level when there is already more than one copy at the
HDFS level?
14
HBASEATBLOOMBERG//
WARM STANDBYS – ARCHITECTURE
15
HBASEATBLOOMBERG//
WARM STANDBYS
• Idea is to have more than one Region Server be responsible
for serving data for a region
• All Region Servers are primary for some regions and standby
for others
• The standby is read-only and rejects any writes accidentally
sent to it
• How do standbys serve up data? Remember, there are 3
copies of the HFiles in HDFS
• Even with 1 node down, the standbys should be able serve up
data from a different datanode
16
HBASEATBLOOMBERG//
WARM STANDBYS – THE OPTIONS
• The standbys can fetch data from HFiles
• How about writes only in the memstore?
• Depending on the flush size/interval, the standbys could
be quite behind
• Should we flush more often?*
• Are updates in the memstore also kept somewhere?
Yes, In the WAL (which is on HDFS)
17
HBASEATBLOOMBERG//
WARM STANDBYS – THE WAL
• Reading the WAL will help standbys keep up with the
primary
• Option #1: the standby can “tail” the WAL
• Option #2: the primary sends the WAL-edits to the
standby using mechanisms similar to what is done for
async replication
18
HBASEATBLOOMBERG//
WHY NOT EVENTUAL CONSISTENCY?
• Standbys are behind the primary in updates
• If an application can tolerate this, why not use an eventually consistent store?
• In our design, each record is mastered at a single server that decides the order
of updates
• All standbys process updates in the same order
• Reads at a given replica ALWAYS move forward in time
• Enter… Timeline Consistency
• Consider two operations (from the PNUTS paper):
• Remove mother from access list to shared picture album
• Upload spring break pictures
19
HBASEATBLOOMBERG//
HBASE-10070
Targets applications that care about write ordering, but can tolerate brief periods of
read inconsistency
Confident this will not take HBase too far from its roots
All of this isn’t theoretical…
work actively underway in HBASE-10070
Shout out to Devaraj Das, Enis Soztutar and the entire HBase team for being great
partners in this effort
20
HBASEATBLOOMBERG//
WARM STANDBYS – EXTENSIONS
• Combine warm standbys with favored nodes allowing
standbys to run on secondary and tertiary datanodes
• On RS failure, standby running on a favored node is promoted
to primary, rather than being chosen randomly
• This benefits post-recovery read performance
• Could also be combined with wal per region, making it easier
to do region re-assignments without co-location constraints
• WAL per region makes MTTR faster - obviates need for a log
split (or, makes log replay faster)
21
>>>>>>>>>>>>>>
AND WITH AN EYE
TO THE FUTURE…
HBASEATBLOOMBERG//
HBASE – CHALLENGE THROW DOWN
• Performance
• Lowering average read latency AND latency variation critical for
HBase to be the leader in the low-latency NoSQL space.
• And GC appears to be the single largest blocker to that.
• Multi-tenancy
• HBase doesn’t have a good story for multi-tenancy yet.
• A single HBase instance ought to be able to support multiple
workloads with proper resource isolation.
• Why? Consolidate disparate applications that work on the same
datasets and still achieve some degree of QoS for each individual
app.
23
HBASEATBLOOMBERG//
QUESTIONS?
http://www.openbloomberg.com/
https://github.com/bloomberg

HBase at Bloomberg: High Availability Needs for the Financial Industry

  • 1.
    HBASEATBLOOMBERG// HBASE AT BLOOMBERGHIGH AVAILABILITYNEEDS FOR THE FINANCIAL INDUSTRY MAY // 05 // 2014
  • 2.
    HBASEATBLOOMBERG// BLOOMBERG LEADING DATA ANDANALYTICS PROVIDER TO THE FINANCIAL INDUSTRY 2
  • 3.
  • 4.
    HBASEATBLOOMBERG// DATA MANAGEMENT ATBLOOMBERG • Data is our business • Bloomberg doesn’t have a “big data” problem. It has a “medium data” problem… • Speed and availability are paramount • Hundreds of thousands of users with expensive requests Among the systems we’ve built (we had to!) • A relational database based on Berkeley DB and SQLite • A shared-memory based key-value store • In-memory data cubes for real time security universe screening We are consolidating many of our systems around open platforms. 4
  • 5.
    HBASEATBLOOMBERG// TIME SERIES • ThePricehistory service serves up all end of day time series data at Bloomberg • Single security requests drive most charting functionality • Multi security requests drive applications such as Portfolio Analytics • > 5 billion requests a day serving terabytes of data • 100K queries per second average and 500k per second at peak 5 SECURITY FIELD DATE VALUE IBM VOLUME 20140321 12,535,281 VOLUME 20140320 5,062,629 VOLUME 20140319 4,323,930 GOOG CLOSE PX 20140321 1,183.04 CLOSE PX 20140320 1,197.16
  • 6.
    HBASEATBLOOMBERG// TIME SERIES ANDHBASE • Time series data fetches are embarrassingly parallel • Simplistic data types and models mean we do not require rich type support or query capabilities • No need for joins, lookups only by [security, field, date] • Data sets are large enough to require manual sharding… administrative overhead • Require a commodity framework to consolidate various disparate systems built over time and bring about simplicity • Frameworks bring benefit of additional analytical tools HBase is an excellent fit for this problem domain 6
  • 7.
    HBASEATBLOOMBERG// OUR REQUIREMENTS FORHBASE • Read performance – fast with low variance • High availability • Operational simplicity • Efficient use of our hardware [16 cores, 100+ GB RAM, SSD storage] • Bloomberg has been investing in all these aspects of HBase • In the rest of this talk, we’ll focus on High Availability
  • 8.
  • 9.
    HBASEATBLOOMBERG// DISASTER RECOVERY –THE MODEL AT BLOOMBERG Like any other good engineering organization, we take service uptime very seriously… • Applications in multiple data centers share the workload • Clusters must have excess capacity to absorb load from loss of a peer data center... • and still have excess capacity to account for failures & upgrades • Latency penalty for failover to a different data center • Read vs. Write Availability 9
  • 10.
    HBASEATBLOOMBERG// MTTR IN HBASE– THE MANY STAGES TO RECOVERY • Failure detection by Zookeeper • Region re-assignment by Master • Log split and HFile creation 10
  • 11.
    HBASEATBLOOMBERG// MTTR – ABRIEF HISTORY* • Distributed log split [HBASE-1364] • Routing around datanode failures via the hdfs stale state [HDFS-3912, HDFS-4350] • Assignment manager enhancements [HBASE-7247] • Multicast notifications to clients with list of failed Region Servers • Distributed log replay [HBASE-7006] … All this phenomenal work means HBase MTTR is now in the order of 10s of seconds 11
  • 12.
    HBASEATBLOOMBERG// MTTR IN HBASE– GAPS AND OPPORTUNITIES • … but 1 min of downtime still high for certain classes of applications • Even if recovery time is optimized down to zero, still have to wait to detect failure before we do something • Lowering ZK session time out introduces false positives • What if threshold for read unavailability was 1 sec or lower? • Reads must be serviceable while recovery is still in progress. 12
  • 13.
    HBASEATBLOOMBERG// SOLUTION LANDSCAPE • Therequirement is to be able to read data from elsewhere after a pre- configured timeout • Where could that be? • Another cluster in another DC? • Another cluster in the same DC? • Two tables – primary and a shadow kept in the same HBase instance? • HOYA? Multiple HBase instances on the same physical YARN cluster? 13
  • 14.
    HBASEATBLOOMBERG// SOLUTION LANDSCAPE All thesesolutions work by having more than one copy of the data and being able to quickly access it. … But why keep more than one copy of data at the HBase level when there is already more than one copy at the HDFS level? 14
  • 15.
  • 16.
    HBASEATBLOOMBERG// WARM STANDBYS • Ideais to have more than one Region Server be responsible for serving data for a region • All Region Servers are primary for some regions and standby for others • The standby is read-only and rejects any writes accidentally sent to it • How do standbys serve up data? Remember, there are 3 copies of the HFiles in HDFS • Even with 1 node down, the standbys should be able serve up data from a different datanode 16
  • 17.
    HBASEATBLOOMBERG// WARM STANDBYS –THE OPTIONS • The standbys can fetch data from HFiles • How about writes only in the memstore? • Depending on the flush size/interval, the standbys could be quite behind • Should we flush more often?* • Are updates in the memstore also kept somewhere? Yes, In the WAL (which is on HDFS) 17
  • 18.
    HBASEATBLOOMBERG// WARM STANDBYS –THE WAL • Reading the WAL will help standbys keep up with the primary • Option #1: the standby can “tail” the WAL • Option #2: the primary sends the WAL-edits to the standby using mechanisms similar to what is done for async replication 18
  • 19.
    HBASEATBLOOMBERG// WHY NOT EVENTUALCONSISTENCY? • Standbys are behind the primary in updates • If an application can tolerate this, why not use an eventually consistent store? • In our design, each record is mastered at a single server that decides the order of updates • All standbys process updates in the same order • Reads at a given replica ALWAYS move forward in time • Enter… Timeline Consistency • Consider two operations (from the PNUTS paper): • Remove mother from access list to shared picture album • Upload spring break pictures 19
  • 20.
    HBASEATBLOOMBERG// HBASE-10070 Targets applications thatcare about write ordering, but can tolerate brief periods of read inconsistency Confident this will not take HBase too far from its roots All of this isn’t theoretical… work actively underway in HBASE-10070 Shout out to Devaraj Das, Enis Soztutar and the entire HBase team for being great partners in this effort 20
  • 21.
    HBASEATBLOOMBERG// WARM STANDBYS –EXTENSIONS • Combine warm standbys with favored nodes allowing standbys to run on secondary and tertiary datanodes • On RS failure, standby running on a favored node is promoted to primary, rather than being chosen randomly • This benefits post-recovery read performance • Could also be combined with wal per region, making it easier to do region re-assignments without co-location constraints • WAL per region makes MTTR faster - obviates need for a log split (or, makes log replay faster) 21
  • 22.
    >>>>>>>>>>>>>> AND WITH ANEYE TO THE FUTURE…
  • 23.
    HBASEATBLOOMBERG// HBASE – CHALLENGETHROW DOWN • Performance • Lowering average read latency AND latency variation critical for HBase to be the leader in the low-latency NoSQL space. • And GC appears to be the single largest blocker to that. • Multi-tenancy • HBase doesn’t have a good story for multi-tenancy yet. • A single HBase instance ought to be able to support multiple workloads with proper resource isolation. • Why? Consolidate disparate applications that work on the same datasets and still achieve some degree of QoS for each individual app. 23
  • 24.