2. Introduction
Istvan Szukacs
CTO - StreamBright Data
Working with (big) data since 2009.
Building & optimizing data pipelines for
companies like:
Amazon, Riot Games, Symantec
StreamBright Data
“On Demand DevOps and Data Expertise”
Founded in 2015, serving US and Western
European clients
Building “Decision Pipelines” - end-to-end,
scalable solutions to get business insights
from data
Development base in Budapest - looking for
big data and data science competency
4. Row vs Column Oriented Data Stores
Row Oriented
PROS:
- easy to add/modify a record
- suitable for write heavy load (UPDATE, INSERT)
CONS:
- might read in unnecessary data
Column Oriented
PROS:
- only need to read in relevant data
- suitable for read-heavy analytical load (SELECT)
CONS:
- row writes require multiple accesses
5. Short History Of Columnar Stores
● "A Modular, Self-Describing Clinical Databank
System," Computers and Biomedical Research,
1975
● “An Overview Of Cantor: A New System For Data
Analysis” Karasalo, Svensson, SSDBM 1983
● “The Design of Cantor: A New System For Data
Analysis” Karasalo, Svensson, SSDBM 1986
7. Short History Of Columnar Stores
Fully transposed file reference:
On searching transposed files, Don Steve Batory, Univ. of
Toronto, Toronto, Ont., Canada, 1979
“A transposed file is a collection of nonsequential files called
subfiles. Each subfile contains selected attribute data for all
records. It is shown that transposed file performance can be
enhanced by using a proper strategy to process queries. Analytic
cost expressions for processing conjunctive, disjunctive, and
batched queries are developed and an effective heuristic for
minimizing query processing costs is presented.”
8. Notable Features For Columnar Stores
● Data Encoding
● Efficient Compression
● Lazy Decompression
9. Notable Features For Columnar Stores
● Data Encoding
● Efficient Compression
● Lazy Decompression
10. Data Encoding
From the smallest to the largest data types:
- Boolean (1 bit)
- Integer (1-8 bytes)
- Float (4-8 bytes)
- Datetime (3-8 bytes)
- String, UTF-8 (1 and 4 bytes per character, 64 chars -> 64-512 bytes)
- Complex Structures (depends)
11. Data Encoding
How can we save space?
- Let’s address the widest columns, strings
- Assigning an integer to each distinct value could save us few bytes
every row
- Real world example: storing SHA2 hashes
- 64 - 512 bytes -> 1-8 bytes / row
- Storing the dictionary + data << unchanged data
- This is called dictionary encoding
12. Run Length Encoding
- If there is repetition in any sort of data, store the value and the number
of times it is repeated
- A,A,A,A,A -> A,4
- This works on sorted data the best
- Sometimes multiple columns can be sorted in the same data block
13. Run Length Encoding
RLE example:
A B C A B C
-------- ----------------------
a 3 e => (4, a) (2, 3) (2, e)
a 3 e (4, b) (3, 2) (2, g)
a 2 g (3, 1) (4, f)
a 2 g
b 2 f
b 1 f
b 1 f
b 1 f
14. Notable Features For Columnar Stores
● Data Encoding
● Compression
● Lazy Decompression
15. Compression
- Compression is applied on the top of encodings
- There are tradeoffs between encryption time and space
- Widely used compressions:
- Snappy (fast, smaller space saving)
- Zlib (slower, better space efficiency)
16. Notable Features For Columnar Stores
● Data Encoding
● Compression
● Lazy Decompression
17. Lazy Decompression
- Lazy decompression is the notion of decompressing values at
the reader
- It saves bandwidth and speeds up queries
- Dictionary has to be sent to the reader
18. Hadoop/Hive Columnar Stores
RCFILE -- (Note: Available in Hive 0.6.0 and later)
ORC -- (Note: Available in Hive 0.11.0 and later)
PARQUET -- (Note: Available in Hive 0.13.0 and later)
20. ORC 101
- Data is stored in stripes within a file
- Each stripe has its own index
- Index has basic statistics (min, max)
- ORC:
- Supports predicate pushdown
- Bloom filters
- Lazy decompression
- Snappy and Zlib as compression
- Bucketing (requires sorting)
21. Optimizing A Petabyte Scale DWH
- We know many of the moving parts, let’s check a real world use case
- One client asked for performance improvements
- 1TB/day, 83 columns in the table, 1PB full size, Snappy compressed
- Cannot change something that would break the application using the table
- No explicit sorting anywhere
- Few extremely wide columns, high repetition
22. Optimizing A Petabyte Scale DWH
Our assumption:
The problem is IO bound, hence decreasing the size
on disk will result in better performance.
How can assume this?
24. Optimizing A Petabyte Scale DWH
- I spare you from the iteration steps we took
- Ended up with the following changes:
- Explicit sorting for the widest column
- Explicit sorting for other wide columns
- Snappy -> Zlib
- Bucketing 20 -> 256
- Stripe size from 64M -> 128M
25. Optimizing A Petabyte Scale DWH
CLUSTERED BY (
some_sha2)
SORTED BY (
some_sha2, some_other_sha2, some_md5)
INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://cluter/orc_test'
TBLPROPERTIES (
'orc.compress'='ZLIB', 'orc.create.index'='true', 'orc.stripe.
size'='130023424',
'orc.row.index.stride'='64000', 'orc.create.index'='true'
)
26. Optimizing A Petabyte Scale DWH
Baseline:
Time taken: 84.259 seconds, Size:860.1 G
Improved:
Time taken: 27.697 seconds, Size: 205.7 G
27. Optimizing A Petabyte Scale DWH
Key changes & findings:
- Introduced explicit sorting, saving huge amount of space
- Traded insertion speed for better compression, saving some
space (this is a good trade off)
- Saving space almost linearly corresponds with query execution
speedups
- Disk IO is still the biggest bottleneck for large scale DWHs
- Default settings are not good enough for petabyte scale
- Knowing the details of your columnar store helps what to change
- You can change things around without breaking anything