More Related Content Similar to Time-Series Apache HBase (20) Time-Series Apache HBase2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
TIME SERIES
Sequence of data points
Triplet: [ID][TIME][VALUE] – basic
Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE]
Stock Closing Value DJIA
User behavior (web clicks)
Credit card transactions
Health data
Fitness indicators
Sensor data (IoT)
Application and system metrics - ODS
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB requirements
Data Store MUST preserve temporal locality of data for better in-memory caching
– Facebook ODS : 85% queries are for last 26 hours
Data Store MUST provide efficient compression
– Time – series are highly compressible (less than 2 bytes per data point in some cases)
– Facebook custom compression codec produces less than 1.4 bytes per data point
Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg,
min, max, etc., by min, hour, day and so on – configurable. Most of the time its
aggregated data we are interested in.
Efficient caching policy (RAM/SSD), presumably FIFO
SQL API (nice to have, but it is optional)
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OpenTSDB 2.x
Data Store MUST preserve temporal locality of data for better in-memory caching – NO
– Size-Tiered HBase compaction does not preserve temporal locality of data. Major compaction
creates single file, for example, where recent data is stored with data which is months or years old.
– Compaction trashes block cache as well decreases read performance and increases latencies.
Data Store MUST provide efficient compression – NO
– OpenTSDB supports compression, but its very heavy (runs externally) and usually users disable it in
production.
Data Store MUST provide automatic time-based rollup aggregations – NOT
IMPLEMENTED
SQL – Not supported
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ideal HBase Time Series DB
Keeps raw data for hours
Does not compact raw data at all
Preserves raw data in memory cache for periodic compactions and time-based rollup
aggregations
Stores full resolution data only in compressed form
Has different TTL for different aggregation resolutions:
– Days for by_min, by_10min etc.
– Months, years for by_hour
Compaction should preserve temporal locality of both: full resolution data and
aggregated data.
FIFO block cache
Integration with Phoenix (SQL)
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB HBase
Raw Events
Region Server
HDFS
CF:Compressed
CF:Raw
CF:Aggregates
C
A
C
A
Compressor Coprocessor
Aggregator Coprocessor
CF:Aggregates
CF:Compressed – TTL days/months
CF:Aggregates – TTL months/years (CF per resolution)
CF:Raw – TTL hours
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-14468 FIFO compaction
First-In-First-Out
No compaction at all
TTL expired data just get archived
Ideal for raw data storage
No compaction – no block cache trashing
Raw data can be cached on write or on read
Sustains 100s MB/s write throughput per RS
Available 0.98.17, 1.2+, HDP-2.4+
Can be easily back ported to 1.0-1.1
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Exploring (Size-Tiered) Compaction
Does not preserve temporal locality of data.
Compaction trashes block cache
No efficient caching of data is possible
It hurts most-recent-most-valuable data access pattern.
Compression/Aggregation is very heavy.
To read back recent raw data and run it through compressor, many IO operations are
required, because …
We can’t guarantee recent data in a block cache.
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-15181 Date Tiered Compaction
DateTieredCompactionPolicy
CASSANDRA-6602
Works better for time series than ExploringCompactionPolicy
Better temporal locality helps with reads
Good choice for compressed full resolution and aggregated data.
Available in 0.98.17, 1.2+, HDP-2.4 will have it as well
But, too many knobs to control
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Date Tiered Compaction Policy
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Exploring Compaction + Max Size
Set hbase.hstore.compaction.max.size
This efficiently emulates Date-Tiered Compaction
Preserves temporal locality of data.
Compaction does not trash block cache
Efficient caching of recent data is possible
Good for most-recent-most-valuable data access pattern.
Use it for compressed and aggregated data
Helps to keep recent data in a block cache.
ECPM
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-14496 Delayed compaction
Files are eligible for minor compaction if their age > delay
Good for application where most recent data is most valuable.
Prevents block cache from trashing for recent data due to frequent minor compactions
of a fresh store files
Will enable this feature for Exploring Compaction Policy
Improves read latency for most recent data.
ECP + Max +Delay (1-2 days) is good option for compressed full resolution and
aggregated data. ECPMD
Patch available.
HBase 1.0+ (can be back-ported to 0.98)
13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB HBase
Raw Events
Region Server
HDFS
CF:Compressed
CF:Raw
CF:Aggregates
C
A
C
A
Compressor Coprocessor
Aggregator Coprocessor
CF:Aggregates
CF:Compressed – TTL days/months
CF:Aggregates – TTL months/years (CF per resolution)
CF:Raw – TTL hours
ECPM(D)
FIFO
ECPM(D)
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBase Block Cache
Current policy (LRU) is not optimal for time-series applications
We need something similar to FIFO (both in RAM and on SSD)
We need support for TB size RAM/SSD-based caches
Current off-heap bucket cache does not scale well (it keeps keys in Java heap)
For SSD cache we could mirror most recent store files, thus providing FIFO semantics
w/o any complexity of disk-based cache management.
Today …
– Disable cache for raw data
– Enable cache on write/read for compressed data and aggregations
15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
Disable major compaction
Do not run HDFS balancer
Disable HBase auto region balancing: balance_switch false
Disable region splits (DisabledRegionSplitPolicy)
Presplit table in advance.
Have separate column families for raw, compressed and aggregated data (each
aggregate resolution – its own family)
Increase hbase.hstore.blockingStoreFiles for all column families
FIFO for Raw, ECPM(D) or DTCP for compressed and aggregations
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary (continued)
Run periodically internal job (coprocessor) to compress data and produce time-based
rollup aggregations.
Do not cache raw data, write/read cache for others (if ECPM(D))
Enable WAL Compression, use maximum compression for Raw data.