2015
High Availability and
High Frequency Big
Data Analytics
Esther Kundin
Bloomberg LP
10/15/2015
#GHC15
2015
2015
Outline
 The Problem Space
 High Availability
 High Frequency
 Takeaways
 Questions
2015
The Problem Space
 The Problem Space
 High Availability
 High Frequency
 Takeaways
 Questions
2015
The Problem Space
2015
The Problem Space
 Total data set: 2 TB – roughly 2x1013 data points
− “medium data”
 Average Write: 4 billion data points a day
 Average read: 140 trillion data points a day
 Read/Write latency: 50 ms
 Read throughput: 3 trillion points in the peak
minute – 2000 bulk requests
 Allowable downtime < read latency
2015
High Availability – Pain Points and Solutions
 The Problem Space
 High Availability
 High Frequency
 Takeaways
 Questions
2015
High Availability - Major Points of
Failure
Client
HDFS
RegionServer RegionServer RegionServer
Meta Region
Server
2015
High Availability – Solution
HBASE-10070
Client
HDFS
RegionServer 1 RegionServer 2 RegionServer 3
Meta Region
Server
SecondaryRegion
Server 1
SecondaryRegion
Server 2
SecondaryRegion
Server 3
Secondary Meta
Region Server
2015
High Availability Across Data
Centers
 3 Options
− HBASE-12259 – HydraBase integration – HBASE +
Raft – In Progress
− Cloudera BDR in Cloudera Enterprise 5 – Not
Open Source
− Roll Your Own!
2015
Replication Across Data Centers
HBase 1 HBase 2
Writer1 Writer2
Reader1 Reader2
Global ZK
Replication
2015
High Frequency – Pain Points and
Solutions
 The Problem Space
 High Availability
 High Frequency
 Takeaways
 Questions
2015
HA to remove fat tails
0
2
4
6
8
10
12
50 60 80 90 95 99
Latencyinms
Percentile
Avg Latency per-Get Distribution
2015
High Frequency – Pain Points
 Speed bounded by slowest responding region
server
 Garbage Collection causes spikes in latency
2015
The Art of Fine Tuning
 Use Data to set your heuristics
− Identify repeatable base-line tests
− Identify performance parameters
− Tweak one setting at a time
2015
Tuning Your DB – Garbage Collection
 What Did Not Work
− Stop The World
− Small Memory Footprint – 4GB
− Synchronized GC via coprocessors
 What worked for us:
− CMS – shorter pauses
− Very large memory footprint – 28GB
− Read from backup RS when GC in progress
2015
Takeaways
 The Problem Space
 High Availability
 High Frequency
 Takeaways
 Questions
2015
Takeaways
 High Availability can solve most availability
and latency concerns
 Multiple Data Center Support Needed
 Tune those settings!
2015
Questions?
 The Problem Space
 High Availability
 High Frequency
 Takeaways
 Questions
2015
Resources:
Tuning Your DB – What to Tweak
 Key Design
 Column Family Design
 hbase_site.xml - Lots of configuration to try!
 Bloom Filters
 Short-Circuit Reads
 Block Cache
 Scheduling Major Compactions Judiciously
2015
Got Feedback?
Rate and review the session on our mobile app
Download at http://ddut.ch/ghc15
or search GHC 2015 in the app store

2015 GHC Presentation - High Availability and High Frequency Big Data Analytics

  • 1.
    2015 High Availability and HighFrequency Big Data Analytics Esther Kundin Bloomberg LP 10/15/2015 #GHC15 2015
  • 2.
    2015 Outline  The ProblemSpace  High Availability  High Frequency  Takeaways  Questions
  • 3.
    2015 The Problem Space The Problem Space  High Availability  High Frequency  Takeaways  Questions
  • 4.
  • 5.
    2015 The Problem Space Total data set: 2 TB – roughly 2x1013 data points − “medium data”  Average Write: 4 billion data points a day  Average read: 140 trillion data points a day  Read/Write latency: 50 ms  Read throughput: 3 trillion points in the peak minute – 2000 bulk requests  Allowable downtime < read latency
  • 6.
    2015 High Availability –Pain Points and Solutions  The Problem Space  High Availability  High Frequency  Takeaways  Questions
  • 7.
    2015 High Availability -Major Points of Failure Client HDFS RegionServer RegionServer RegionServer Meta Region Server
  • 8.
    2015 High Availability –Solution HBASE-10070 Client HDFS RegionServer 1 RegionServer 2 RegionServer 3 Meta Region Server SecondaryRegion Server 1 SecondaryRegion Server 2 SecondaryRegion Server 3 Secondary Meta Region Server
  • 9.
    2015 High Availability AcrossData Centers  3 Options − HBASE-12259 – HydraBase integration – HBASE + Raft – In Progress − Cloudera BDR in Cloudera Enterprise 5 – Not Open Source − Roll Your Own!
  • 10.
    2015 Replication Across DataCenters HBase 1 HBase 2 Writer1 Writer2 Reader1 Reader2 Global ZK Replication
  • 11.
    2015 High Frequency –Pain Points and Solutions  The Problem Space  High Availability  High Frequency  Takeaways  Questions
  • 12.
    2015 HA to removefat tails 0 2 4 6 8 10 12 50 60 80 90 95 99 Latencyinms Percentile Avg Latency per-Get Distribution
  • 13.
    2015 High Frequency –Pain Points  Speed bounded by slowest responding region server  Garbage Collection causes spikes in latency
  • 14.
    2015 The Art ofFine Tuning  Use Data to set your heuristics − Identify repeatable base-line tests − Identify performance parameters − Tweak one setting at a time
  • 15.
    2015 Tuning Your DB– Garbage Collection  What Did Not Work − Stop The World − Small Memory Footprint – 4GB − Synchronized GC via coprocessors  What worked for us: − CMS – shorter pauses − Very large memory footprint – 28GB − Read from backup RS when GC in progress
  • 16.
    2015 Takeaways  The ProblemSpace  High Availability  High Frequency  Takeaways  Questions
  • 17.
    2015 Takeaways  High Availabilitycan solve most availability and latency concerns  Multiple Data Center Support Needed  Tune those settings!
  • 18.
    2015 Questions?  The ProblemSpace  High Availability  High Frequency  Takeaways  Questions
  • 19.
    2015 Resources: Tuning Your DB– What to Tweak  Key Design  Column Family Design  hbase_site.xml - Lots of configuration to try!  Bloom Filters  Short-Circuit Reads  Block Cache  Scheduling Major Compactions Judiciously
  • 20.
    2015 Got Feedback? Rate andreview the session on our mobile app Download at http://ddut.ch/ghc15 or search GHC 2015 in the app store

Editor's Notes

  • #9 Added back in to major release 2.2 based on feedback from Bloomberg !HBASE-10070 MTTR GC Throughput, all fixed! Soon, Rack-Aware H B A S E ­ 7 5 0 9 – same thing but at HDFS level Need to enable on the hbase_site.xml, at the table level, and update your get requets with get.setConsistency(CONSISTENCY.TIMELINE);.
  • #11 Consistency will be the same as with one data center – last writer wins, just like in one-cloud hbase Latency would be the same as with any multi-writer multi-datacenter setup
  • #15 This is where most of the grunt work is. Exhaustive testing that tweaks one parameter at a time was needed to figure out the best settings to use. Very data-driven process Still a work in progress.
  • #21 This is the last slide and must be included in the slide deck