Optimizing Columnar Stores
StreamBright Data
2016 05 19
Introduction
Istvan Szukacs
CTO - StreamBright Data
Working with (big) data since 2009.
Building & optimizing data pipelines for
companies like:
Amazon, Riot Games, Symantec
StreamBright Data
“On Demand DevOps and Data Expertise”
Founded in 2015, serving US and Western
European clients
Building “Decision Pipelines” - end-to-end,
scalable solutions to get business insights
from data
Development base in Budapest - looking for
big data and data science competency
Row vs Column Oriented Data Stores
Row vs Column Oriented Data Stores
Row Oriented
PROS:
- easy to add/modify a record
- suitable for write heavy load (UPDATE, INSERT)
CONS:
- might read in unnecessary data
Column Oriented
PROS:
- only need to read in relevant data
- suitable for read-heavy analytical load (SELECT)
CONS:
- row writes require multiple accesses
Short History Of Columnar Stores
● "A Modular, Self-Describing Clinical Databank
System," Computers and Biomedical Research,
1975
● “An Overview Of Cantor: A New System For Data
Analysis” Karasalo, Svensson, SSDBM 1983
● “The Design of Cantor: A New System For Data
Analysis” Karasalo, Svensson, SSDBM 1986
Short History Of Columnar Stores
Short History Of Columnar Stores
Fully transposed file reference:
On searching transposed files, Don Steve Batory, Univ. of
Toronto, Toronto, Ont., Canada, 1979
“A transposed file is a collection of nonsequential files called
subfiles. Each subfile contains selected attribute data for all
records. It is shown that transposed file performance can be
enhanced by using a proper strategy to process queries. Analytic
cost expressions for processing conjunctive, disjunctive, and
batched queries are developed and an effective heuristic for
minimizing query processing costs is presented.”
Notable Features For Columnar Stores
● Data Encoding
● Efficient Compression
● Lazy Decompression
Notable Features For Columnar Stores
● Data Encoding
● Efficient Compression
● Lazy Decompression
Data Encoding
From the smallest to the largest data types:
- Boolean (1 bit)
- Integer (1-8 bytes)
- Float (4-8 bytes)
- Datetime (3-8 bytes)
- String, UTF-8 (1 and 4 bytes per character, 64 chars -> 64-512 bytes)
- Complex Structures (depends)
Data Encoding
How can we save space?
- Let’s address the widest columns, strings
- Assigning an integer to each distinct value could save us few bytes
every row
- Real world example: storing SHA2 hashes
- 64 - 512 bytes -> 1-8 bytes / row
- Storing the dictionary + data << unchanged data
- This is called dictionary encoding
Run Length Encoding
- If there is repetition in any sort of data, store the value and the number
of times it is repeated
- A,A,A,A,A -> A,4
- This works on sorted data the best
- Sometimes multiple columns can be sorted in the same data block
Run Length Encoding
RLE example:
A B C A B C
-------- ----------------------
a 3 e => (4, a) (2, 3) (2, e)
a 3 e (4, b) (3, 2) (2, g)
a 2 g (3, 1) (4, f)
a 2 g
b 2 f
b 1 f
b 1 f
b 1 f
Notable Features For Columnar Stores
● Data Encoding
● Compression
● Lazy Decompression
Compression
- Compression is applied on the top of encodings
- There are tradeoffs between encryption time and space
- Widely used compressions:
- Snappy (fast, smaller space saving)
- Zlib (slower, better space efficiency)
Notable Features For Columnar Stores
● Data Encoding
● Compression
● Lazy Decompression
Lazy Decompression
- Lazy decompression is the notion of decompressing values at
the reader
- It saves bandwidth and speeds up queries
- Dictionary has to be sent to the reader
Hadoop/Hive Columnar Stores
RCFILE -- (Note: Available in Hive 0.6.0 and later)
ORC -- (Note: Available in Hive 0.11.0 and later)
PARQUET -- (Note: Available in Hive 0.13.0 and later)
ORC 101
ORC 101
- Data is stored in stripes within a file
- Each stripe has its own index
- Index has basic statistics (min, max)
- ORC:
- Supports predicate pushdown
- Bloom filters
- Lazy decompression
- Snappy and Zlib as compression
- Bucketing (requires sorting)
Optimizing A Petabyte Scale DWH
- We know many of the moving parts, let’s check a real world use case
- One client asked for performance improvements
- 1TB/day, 83 columns in the table, 1PB full size, Snappy compressed
- Cannot change something that would break the application using the table
- No explicit sorting anywhere
- Few extremely wide columns, high repetition
Optimizing A Petabyte Scale DWH
Our assumption:
The problem is IO bound, hence decreasing the size
on disk will result in better performance.
How can assume this?
Optimizing A Petabyte Scale DWH
Optimizing A Petabyte Scale DWH
- I spare you from the iteration steps we took
- Ended up with the following changes:
- Explicit sorting for the widest column
- Explicit sorting for other wide columns
- Snappy -> Zlib
- Bucketing 20 -> 256
- Stripe size from 64M -> 128M
Optimizing A Petabyte Scale DWH
CLUSTERED BY (
some_sha2)
SORTED BY (
some_sha2, some_other_sha2, some_md5)
INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://cluter/orc_test'
TBLPROPERTIES (
'orc.compress'='ZLIB', 'orc.create.index'='true', 'orc.stripe.
size'='130023424',
'orc.row.index.stride'='64000', 'orc.create.index'='true'
)
Optimizing A Petabyte Scale DWH
Baseline:
Time taken: 84.259 seconds, Size:860.1 G
Improved:
Time taken: 27.697 seconds, Size: 205.7 G
Optimizing A Petabyte Scale DWH
Key changes & findings:
- Introduced explicit sorting, saving huge amount of space
- Traded insertion speed for better compression, saving some
space (this is a good trade off)
- Saving space almost linearly corresponds with query execution
speedups
- Disk IO is still the biggest bottleneck for large scale DWHs
- Default settings are not good enough for petabyte scale
- Knowing the details of your columnar store helps what to change
- You can change things around without breaking anything
Q & A
Q & A
Optimizing columnar stores

Optimizing columnar stores

  • 1.
  • 2.
    Introduction Istvan Szukacs CTO -StreamBright Data Working with (big) data since 2009. Building & optimizing data pipelines for companies like: Amazon, Riot Games, Symantec StreamBright Data “On Demand DevOps and Data Expertise” Founded in 2015, serving US and Western European clients Building “Decision Pipelines” - end-to-end, scalable solutions to get business insights from data Development base in Budapest - looking for big data and data science competency
  • 3.
    Row vs ColumnOriented Data Stores
  • 4.
    Row vs ColumnOriented Data Stores Row Oriented PROS: - easy to add/modify a record - suitable for write heavy load (UPDATE, INSERT) CONS: - might read in unnecessary data Column Oriented PROS: - only need to read in relevant data - suitable for read-heavy analytical load (SELECT) CONS: - row writes require multiple accesses
  • 5.
    Short History OfColumnar Stores ● "A Modular, Self-Describing Clinical Databank System," Computers and Biomedical Research, 1975 ● “An Overview Of Cantor: A New System For Data Analysis” Karasalo, Svensson, SSDBM 1983 ● “The Design of Cantor: A New System For Data Analysis” Karasalo, Svensson, SSDBM 1986
  • 6.
    Short History OfColumnar Stores
  • 7.
    Short History OfColumnar Stores Fully transposed file reference: On searching transposed files, Don Steve Batory, Univ. of Toronto, Toronto, Ont., Canada, 1979 “A transposed file is a collection of nonsequential files called subfiles. Each subfile contains selected attribute data for all records. It is shown that transposed file performance can be enhanced by using a proper strategy to process queries. Analytic cost expressions for processing conjunctive, disjunctive, and batched queries are developed and an effective heuristic for minimizing query processing costs is presented.”
  • 8.
    Notable Features ForColumnar Stores ● Data Encoding ● Efficient Compression ● Lazy Decompression
  • 9.
    Notable Features ForColumnar Stores ● Data Encoding ● Efficient Compression ● Lazy Decompression
  • 10.
    Data Encoding From thesmallest to the largest data types: - Boolean (1 bit) - Integer (1-8 bytes) - Float (4-8 bytes) - Datetime (3-8 bytes) - String, UTF-8 (1 and 4 bytes per character, 64 chars -> 64-512 bytes) - Complex Structures (depends)
  • 11.
    Data Encoding How canwe save space? - Let’s address the widest columns, strings - Assigning an integer to each distinct value could save us few bytes every row - Real world example: storing SHA2 hashes - 64 - 512 bytes -> 1-8 bytes / row - Storing the dictionary + data << unchanged data - This is called dictionary encoding
  • 12.
    Run Length Encoding -If there is repetition in any sort of data, store the value and the number of times it is repeated - A,A,A,A,A -> A,4 - This works on sorted data the best - Sometimes multiple columns can be sorted in the same data block
  • 13.
    Run Length Encoding RLEexample: A B C A B C -------- ---------------------- a 3 e => (4, a) (2, 3) (2, e) a 3 e (4, b) (3, 2) (2, g) a 2 g (3, 1) (4, f) a 2 g b 2 f b 1 f b 1 f b 1 f
  • 14.
    Notable Features ForColumnar Stores ● Data Encoding ● Compression ● Lazy Decompression
  • 15.
    Compression - Compression isapplied on the top of encodings - There are tradeoffs between encryption time and space - Widely used compressions: - Snappy (fast, smaller space saving) - Zlib (slower, better space efficiency)
  • 16.
    Notable Features ForColumnar Stores ● Data Encoding ● Compression ● Lazy Decompression
  • 17.
    Lazy Decompression - Lazydecompression is the notion of decompressing values at the reader - It saves bandwidth and speeds up queries - Dictionary has to be sent to the reader
  • 18.
    Hadoop/Hive Columnar Stores RCFILE-- (Note: Available in Hive 0.6.0 and later) ORC -- (Note: Available in Hive 0.11.0 and later) PARQUET -- (Note: Available in Hive 0.13.0 and later)
  • 19.
  • 20.
    ORC 101 - Datais stored in stripes within a file - Each stripe has its own index - Index has basic statistics (min, max) - ORC: - Supports predicate pushdown - Bloom filters - Lazy decompression - Snappy and Zlib as compression - Bucketing (requires sorting)
  • 21.
    Optimizing A PetabyteScale DWH - We know many of the moving parts, let’s check a real world use case - One client asked for performance improvements - 1TB/day, 83 columns in the table, 1PB full size, Snappy compressed - Cannot change something that would break the application using the table - No explicit sorting anywhere - Few extremely wide columns, high repetition
  • 22.
    Optimizing A PetabyteScale DWH Our assumption: The problem is IO bound, hence decreasing the size on disk will result in better performance. How can assume this?
  • 23.
  • 24.
    Optimizing A PetabyteScale DWH - I spare you from the iteration steps we took - Ended up with the following changes: - Explicit sorting for the widest column - Explicit sorting for other wide columns - Snappy -> Zlib - Bucketing 20 -> 256 - Stripe size from 64M -> 128M
  • 25.
    Optimizing A PetabyteScale DWH CLUSTERED BY ( some_sha2) SORTED BY ( some_sha2, some_other_sha2, some_md5) INTO 256 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://cluter/orc_test' TBLPROPERTIES ( 'orc.compress'='ZLIB', 'orc.create.index'='true', 'orc.stripe. size'='130023424', 'orc.row.index.stride'='64000', 'orc.create.index'='true' )
  • 26.
    Optimizing A PetabyteScale DWH Baseline: Time taken: 84.259 seconds, Size:860.1 G Improved: Time taken: 27.697 seconds, Size: 205.7 G
  • 27.
    Optimizing A PetabyteScale DWH Key changes & findings: - Introduced explicit sorting, saving huge amount of space - Traded insertion speed for better compression, saving some space (this is a good trade off) - Saving space almost linearly corresponds with query execution speedups - Disk IO is still the biggest bottleneck for large scale DWHs - Default settings are not good enough for petabyte scale - Knowing the details of your columnar store helps what to change - You can change things around without breaking anything
  • 28.