Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Time-Series Apache HBase

3,256 views

Published on

Vladimir Rodionov (Hortonworks)

Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.

Published in: Software
  • Be the first to comment

Time-Series Apache HBase

  1. 1. TIME SERIES IN HBASE Staff Software Engineer VLADIMIR RODIONOV
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TIME SERIES  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE]  Stock Closing Value DJIA  User behavior (web clicks)  Credit card transactions  Health data  Fitness indicators  Sensor data (IoT)  Application and system metrics - ODS
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB requirements  Data Store MUST preserve temporal locality of data for better in-memory caching – Facebook ODS : 85% queries are for last 26 hours  Data Store MUST provide efficient compression – Time – series are highly compressible (less than 2 bytes per data point in some cases) – Facebook custom compression codec produces less than 1.4 bytes per data point  Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg, min, max, etc., by min, hour, day and so on – configurable. Most of the time its aggregated data we are interested in.  Efficient caching policy (RAM/SSD), presumably FIFO  SQL API (nice to have, but it is optional)
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved OpenTSDB 2.x  Data Store MUST preserve temporal locality of data for better in-memory caching – NO – Size-Tiered HBase compaction does not preserve temporal locality of data. Major compaction creates single file, for example, where recent data is stored with data which is months or years old. – Compaction trashes block cache as well decreases read performance and increases latencies.  Data Store MUST provide efficient compression – NO – OpenTSDB supports compression, but its very heavy (runs externally) and usually users disable it in production.  Data Store MUST provide automatic time-based rollup aggregations – NOT IMPLEMENTED  SQL – Not supported
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ideal HBase Time Series DB  Keeps raw data for hours  Does not compact raw data at all  Preserves raw data in memory cache for periodic compactions and time-based rollup aggregations  Stores full resolution data only in compressed form  Has different TTL for different aggregation resolutions: – Days for by_min, by_10min etc. – Months, years for by_hour  Compaction should preserve temporal locality of both: full resolution data and aggregated data.  FIFO block cache  Integration with Phoenix (SQL)
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Compressed – TTL days/months CF:Aggregates – TTL months/years (CF per resolution) CF:Raw – TTL hours
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-14468 FIFO compaction  First-In-First-Out  No compaction at all  TTL expired data just get archived  Ideal for raw data storage  No compaction – no block cache trashing  Raw data can be cached on write or on read  Sustains 100s MB/s write throughput per RS  Available 0.98.17, 1.2+, HDP-2.4+  Can be easily back ported to 1.0-1.1
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Exploring (Size-Tiered) Compaction  Does not preserve temporal locality of data.  Compaction trashes block cache  No efficient caching of data is possible  It hurts most-recent-most-valuable data access pattern.  Compression/Aggregation is very heavy.  To read back recent raw data and run it through compressor, many IO operations are required, because …  We can’t guarantee recent data in a block cache.
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-15181 Date Tiered Compaction  DateTieredCompactionPolicy  CASSANDRA-6602  Works better for time series than ExploringCompactionPolicy  Better temporal locality helps with reads  Good choice for compressed full resolution and aggregated data.  Available in 0.98.17, 1.2+, HDP-2.4 will have it as well  But, too many knobs to control
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Date Tiered Compaction Policy
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Exploring Compaction + Max Size  Set hbase.hstore.compaction.max.size  This efficiently emulates Date-Tiered Compaction  Preserves temporal locality of data.  Compaction does not trash block cache  Efficient caching of recent data is possible  Good for most-recent-most-valuable data access pattern.  Use it for compressed and aggregated data  Helps to keep recent data in a block cache.  ECPM
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-14496 Delayed compaction  Files are eligible for minor compaction if their age > delay  Good for application where most recent data is most valuable.  Prevents block cache from trashing for recent data due to frequent minor compactions of a fresh store files  Will enable this feature for Exploring Compaction Policy  Improves read latency for most recent data.  ECP + Max +Delay (1-2 days) is good option for compressed full resolution and aggregated data. ECPMD  Patch available.  HBase 1.0+ (can be back-ported to 0.98)
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Compressed – TTL days/months CF:Aggregates – TTL months/years (CF per resolution) CF:Raw – TTL hours ECPM(D) FIFO ECPM(D)
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase Block Cache  Current policy (LRU) is not optimal for time-series applications  We need something similar to FIFO (both in RAM and on SSD)  We need support for TB size RAM/SSD-based caches  Current off-heap bucket cache does not scale well (it keeps keys in Java heap)  For SSD cache we could mirror most recent store files, thus providing FIFO semantics w/o any complexity of disk-based cache management.  Today … – Disable cache for raw data – Enable cache on write/read for compressed data and aggregations
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary  Disable major compaction  Do not run HDFS balancer  Disable HBase auto region balancing: balance_switch false  Disable region splits (DisabledRegionSplitPolicy)  Presplit table in advance.  Have separate column families for raw, compressed and aggregated data (each aggregate resolution – its own family)  Increase hbase.hstore.blockingStoreFiles for all column families  FIFO for Raw, ECPM(D) or DTCP for compressed and aggregations
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary (continued)  Run periodically internal job (coprocessor) to compress data and produce time-based rollup aggregations.  Do not cache raw data, write/read cache for others (if ECPM(D))  Enable WAL Compression, use maximum compression for Raw data.
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you  Q&A

×