Column encoding

Column Qualifier Encoding
in
Apache Phoenix
Samarth Jain

About Me
• Software Engineer @ Salesforce
• Previously, Software Engineer @ eBay
• Apache Phoenix PMC
• PHOENIX-1819 – Metrics collection framework
• PHOENIX-1504 – Altering views
• PHOENIX-1779 – 8x faster non-aggregate, non-
ordered queries
• PHOENIX-914 – Row timestamp feature
• PHOENIX-1598 – Column qualifier encoding

Overview
• Data model
• Drawbacks
• Column qualifier encoding
• Benefits
• ORDER BY performance
• GROUP BY performance

Drawbacks
• Column names used as column qualifiers
• Size bloat: dense tables with large column
names
• Inefficient column renaming
• GC pressure, network i/o, block cache
• Lack of control on column qualifiers prevents
possible optimizations

Column Qualifier Encoding
• Simple
• Don’t use column names as HBase column
qualifiers
• Using numbers (short/integer) as column
qualifiers
• Controlled assignment of column qualifiers

Column Renaming
• Currently renaming column isn’t possible without
having to copy data using the new column
qualifier (PHOENIX-2341)
• Phoenix stores table related metadata in its
SYSTEM.CATALOG table
• Store mapping of column name to column
qualifier
• Renaming a column would then just involve
updating a few metadata rows in
SYSTEM.CATALOG

Packing Key Values in
Immutable Tables
• Store all column values in a single column
qualifier per column family PHOENIX-2565
• Uses variable width array format for storing
values
• Column encoding provides capability to index
into array for accessing the value of a key
value column

Performance Benefits
• Considerable disk size reduction
• More number of rows will fit into block cache
• Relieved GC pressure as garbage size would go
down
• Replace binary search of column qualifiers
with O(1) look up

ORDER BY Overview
• Phoenix compiles the query into scans
projecting columns in <SELECT >, <ORDER BY>,
and <WHERE>
• Phoenix co-processor retrieves rows by asking
HRegionScanner to fill List<Cell>
• List<Cell> lexicographically sorted
• Adds rows to the Phoenix sort data-structure
• Sort keys formed by doing binary search in the
List<Cell> filled by HRegionScanner

ORDER BY with Column Encoding
• Use numbers as column qualifiers
• Custom list implementation for HBase
scanners to fill the key values in
• Key values added to the list at index
determined by converting qualifier byte[] to
integer/short
• Replaces binary search with O(1) lookup

ORDER BY with Column Encoding
• Table with 50 columns
• Dense
• 20 characters column names generated
randomly
• ORDER BY columns selected were the first and
last in lexicographical order

ORDER BY Performance
Test Results
• Table size: 5x smaller (4byte CQ vs 20 byte)
• 2x faster (with and without block cache)
• Near constant/slow growth with increase in
number of columns projected

ORDER BY Test Results
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Number of Columns Projected
Encoded
Non-encoded

GROUPED AGGREGATIONS Overview
• Queries compiled to scans that project key
value columns in <SELECT>, <GROUP BY>, and
<WHERE>
• Rows aggregated in a GROUP BY map in
Phoenix’s aggregate co-processor
• Map key – binary search in List<Cell>
• Number based column qualifiers and custom
list implementation to rescue again

GROUPED AGGREGATION
with Column Encoding
• PHOENIX-1940 – TPC-Q1
• TPC Data
• 60% smaller disk size
• Query time – with and without block cache
enabled – 25% faster

Possible Performance Gains
(to be measured)
• Faster bulk load times because of smaller data
size
• Reduced index build times – both ASYNC and
SYNC
• Reduction in network I/O – faster UPSERT,
UPSERT SELECT
• Faster joins – smaller hash caches

Work in Progress..
• https://github.com/apache/phoenix/tree/enc
odecolumns
• 4.9 release
• Make joins take advantage of encoded
columns
• More encoding schemes (2 byte column
qualifiers)
• More perf. testing and tuning

Column encoding

More Related Content

What's hot

Similar to Column encoding

Recently uploaded

Column encoding