Vertica

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Vertica
In Depth
Basic introduction
Samchu Li/ Jan 3rd, 2013
Updated: Samchu Li/ May 23rd, 2014

Agend
a• History
• Storage Model – compare DSM with NSM, PAX
• Column Store
• Compression
• Projection & record construction
• Joins
• Vertica SQL process (Hybrid Storage Model)
• Flex Zone
• 4C / Availability
• Query Execution Workflow
• Udx
• Eco-system

Vertica
History
C-Store & Vertica
Samchu Li / Jan 3rd, 2013

Vertica History
1. Miguel C. Ferreira <<Compression and Query Execution within Column Oruented
Database>> for Master of Engineering in Computer Science and Electrical Engineering at
MIT; June 2005. Where C-Store comes from.
2. MIT’s open source project C-Store. <<C-Store: A Column-oriented DBMS>>, VLDB 2005
3. Vertica was set up in 2005 based on C-Store, Billerica, Massachusetts , US. Co-founder is
Michael Stonebraker.
4. March, 2011, HP acquired Vertica. <<The Vertica Analytic Database: C-Store 7 Years
Later>>, VLDB, 2012

Michael Stonebraker，SQLServer/Sysbase奠基人。
著名的数据库科学家，他在1992年提出对象关系数据库模型在加州伯克利分
校计算机教授达25年。在此期间他创作了Ingres,Illustra, Cohera, StreamBase
Systems和Vertica等系统。Stonebraker教授也曾担任过Informix的CEO，目
前他是MIT麻省理工学院客席教授。
Stonebraker教授领导了称为Postgres的后Ingres项目。这个项目的成果非常
巨大，在现代数据库的许多方面都做出的大量的贡献。Stonebraker教授还做
出了一件造福全人类的事情，那就是把Postgres放在了BSD版权的保护下。
如今Postgres名字已经变成了PostgreSQL，功能也是日渐强大。
87年左右，Sybase联合了微软，共同开发SQLServer。原始代码的来源与
Ingres有些渊源。后来1994年，两家公司合作终止。此时，两家公司都拥有一
套完全相同的SQLServer代码。可以认为，Stonebraker教授是目前主流数据
库的奠基人。
Ingres(Michael Stonebraker)  Informix (2000 年被 IBM 收购)
 Sybase MS SQLServer (1992年将产品卖给微软)
 NonStop SQL (Tandem 被 Compaq 并购并在 2000 年开始重写,HP2002年收购Compad)  Neoview  SeaQuest
 Postgres Illustra (1997 年被 Informix 收购)
 PostgreSQL
C-Store  Vertica

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
Vertica Market Share
2012

Vertica
Storage Model
NSM, DSM, PAX

Column Storage

Storage Model in DB
NSM
70s ~ 1985
DSM
1985, <<A decomposition storage paper>>, Copeland and
Khoshafian, SIGMOD
PAX
2001, <<Weaving Relations for Cache Performance>>, Ailamaki,
DeWitt, Hill, Skounakis, VLDB

NSM

DSM

PAX (Partition Attributes Across) - MonetDB

Why DSM/Columnar DB so quick?
ID(PK,
INT4)
Name(Va
rchar5)
Ag
e(I
NT
2)
0962 Jane 3
0
7658 John 4
5
3859 Jim 2
0
5523 Susan 5
2
… … …
SELECT NAME FROM
TABLEName, Average length 4 BYTE
1.NSM
Row length = 4+5+2 =11 BYTE
100 Million * 11 BYTE /1024=10742.1875KB
1 block = 32KB | 1 block contains =2978 complete records
Block scan = 10742.1875KB/32KB = 336
2. DSM
Length = 4 BYTE
100 Million * 4 BYTE /1024=3906.25KB
1 block = 32KB | 1 block contains = 8192 complete records
Block scan = 123

Why DSM/Columnar DB so quick?
0
50
100
150
200
250
300
350
400
Block Scans
NSM
DSM
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
records/block
NSM
DSM

Weakness - Scan performance for example
Read-Optimized Databases, In Depth; Allison L. Holloway and David J. DeWitt; 2008, VLDB

Vertica
Compression

Clustering & Compression

Compression
• Trades I/O for CPU
• Increased column-store opportunities:
• Higher data value locality in column stores
• Techniques such as run length encoding far more useful
• Can use extra space to store multiple copies of data in different sort orders
• Operating Directly on Compressed Data
• I/O - CPU tradeoff is no longer a tradeoff
• Reduces memory–CPU bandwidth requirements
• Opens up possibility of operating on multiple
records at once

Run-length Encoding (RLE)
Integrating Compression and Execution in Column-Oriented Database Systems, SIGMOD, 2006

Bit-vector Encoding

Dictionary Encoding

Frame Of Reference Encoding
+/1, one bit; the max 111=7

Differential Encoding
100=4

What Compression Scheme To Use?

Group iteration
Normally, The DB deal with the records with iteration method once a
time before, But in row-oriented DB, for its storage model NSM, the
cache efficient is low.
But in Row-oriented DB, we could using this method to read more
records one time, like RLE, (100,1,100), 100 hundred records once a
time.

Vertica
Projection & record
construction

Logic Schema & Physical Schema
Logic Schema
In traditional database architectures, data is primarily stored in tables. Additionally,
secondary tuning structures such as index and materialized view structures are created
for improved query performance.
Physical Schema
But in contrast, tables do not occupy any physical storage at all in Vertica. Instead,
physical storage consists of collections of table columns called projection.

Projection

Projection
ID(PK,
INT4)
Name(Va
rchar5)
Ag
e(I
NT
2)
0962 Jane 3
0
7658 John 4
5
3859 Jim 2
0
ID(PK,
INT4)
Name(Va
rchar5)
Ag
e(I
NT
2)
3859 Jim 2
0
5523 Susan 5
2
7658 John 4
5
0962 Jane 3
0
… … …
Name(V
archar5
)
ID(PK,
INT4)
Ag
e(I
NT
2)
Jane 0962 3
0
Jim 3859 2
0
John 7658 4
5
Age(IN
T2)
ID(PK,
INT4)
20 3859
30 0962
45 7658
52 5523
… …
Super
Projection
Projection1
Projection2

Projection & Index
Vertica is designed for Data warehouse/Big Data, no specific design for
single data query.
No Index
In a highly simplified view, you can think of a Vertica projection as a single
level, densely packed, clustered index which stores the actual data values,
is never updated in place, and has no logging. Any “maintenance” such as
merging sorted chunks or purging deleted records is done as automatic
and background activity, not in the path of real-time loads. So yes,
projections are a type of native index if you will, but they are very different
from traditional indexes like Bitmap and Btrees.

Query Benefits of Storing Sorted Data
How does Vertica query huge volumes without indexes?
It’s easy… the data is sorted by column value, something we can do
because we wrote both our storage engine and execution engine from
scratch. We don’t store the data by insert order, nor do we limit sorting to
within a set of disk blocks. Instead, we have put significant engineering
effort into keeping the data totally sorted during its entire lifetime in
Vertica. It should be clear how sorted data increases compression ratios
(by putting similar values next to each other in the data stream), but it
might be less obvious at first how we use sorted data to increase query
speed as well.

An example
SELECT stock, price FROM
ticks ORDER BY stock, price;
SELECT stock, price FROM
ticks WHERE stock=’IBM’
ORDER BY price;

An example
One pass aggregation
SELECT stock, AVG(price)
FROM ticks ORDER BY stock,
price;

Projection

How tuples be constructed
Two ways:
1. EM (Early Materialization)
Like the row-oriented databases(where projections are almost always
performed as soon as an attribute is no longer needed) suggest a natural
tuple construction policy: at each point at which a column is accessed,
add the column to an intermediate tuple representation if that column is
needed by some later operator or is included in the set of output columns.
1. Perform an inner join to constructed the
record which the operator needed
2. Then send to the operator to operate on
the real record
Materialization Strategies in a Column-Oriented DBMS, IEEE, 2007

How tuples be constructed
2. LM (Late Materialization)
a. First, scan the column blocks and output its positions (ordinal offsets
of values within the column)
b. Repeat with other columns to output its positions which satisfy the
operations like WHERE… (these position can take the form of ranges,
lists, or a bitmap)
c. Use position-wise AND operations to intersect the position lists.
d. Finally, re-access these columns and extract the values of records that
satisfy all predicates and stich these values together into output tuples
Materialization Strategies in a Column-Oriented DBMS, IEEE, 2007

When should tuples be constructed? - EM
Early Materialized – No join

When should tuples be constructed? -LM
Late Materialized – with Joins

EM with Joins

LM with Joins

EM vs LM
Naïve LM join about 2X slower than EM join on typical queries(due to
random I/O)
This number is very dependent on
Amount of memory available
Number of projected attributes
Join cardinality
But Here is some new join algorithms for LM do better:
Invisible Join
Jive/Flash Join
Radix cluster/decluster join

Pre-join projections
Vertica supports prejoin projections which permit joining the projection’s anchor
table with any number of dimension tables via N:1 joins. This permits a
normalized logical schema, while allowing the physical storage to be
denormalized. The cost of storing physically denormalized data is much less
than in traditional systems because of the available encoding and compression.
Prejoin projections are not used as often in practice as we expected. This is
because Vertica’s execution engine handles joins with small dimension tables
very well (using highly optimized hash and merge join algorithms), so the
benefits of a prejoin for query execution are not as significant as we initially
predicted.
<<The Vertica Analytic Database: C-Store 7 Years Later>>, VLDB, 2012

Pre-join projections
In the case of joins involving a fact and a large dimension table or
two large fact tables where the join cost is high, most customers
are unwilling to slow down bulk loads to optimize such joins. In addition, joins
during load offer fewer optimization opportunities than joins during query
because the database knows nothing apriori about the data in the load stream.
Pre-join projections can have only inner joins between tables on their primary
and foreign key columns. Outer joins are not allowed.
<<The Vertica Analytic Database: C-Store 7 Years Later>>, VLDB, 2012

Vertica
Joins

Invisible Join
Designed for typical joins when data is modeled using a star schema
One(“Fact”) table is joined with multiple dimension tables
select c_nation, s_nation, d_year,sum(lo_revenue) as revenue
from customer, lineorder, supplier, date
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_region = 'ASIA‘
and s_region = 'ASIA‘
and d_year >= 1992 and d_year <= 1997
group by c_nation, s_nation, d_year
order by d_year asc, revenue desc;

Invisible Join

Invisible Join
Bottom Line
Many data warehouses model data using star/snowflake schemes
Joins of one (fact) table with many dimension tables is common
Invisible join takes advantage of this by making sure that the table that can be accessed in
position order is the fact table for each join
Position lists from the fact table are then intersected (in position order)
This reduces the amount of data that must be accessed out of order from the dimension
tables
“Between-predicate rewriting” trick not relevant for this discussion

Invisible Join

Jive/Flash Join

Jive/Flash Join
Bottom Line
Instead of probing projected columns from inner table out of order:
• Sort join index
• Probe projected columns in order
• Sort result using an added column
LM vs EM tradeoffs:
LM has the extra sorts (EM accesses all columns in order)
LM only has to fit join columns into memory (EM needs join columns and all projected
columns)
• § Results in big memory and CPU savings (see part 3 for why there is CPU savings)
LM only has to materialize relevant columns
In many cases LM advantages outweigh disadvantages
LM would be a clear winner if not for those pesky sorts …

Radix Cluster/Decluster
The full sort from the Jive join is actually overkill
We just want to access the storage blocks in order (we don’t mind random access within a
block)
So do a radix sort and stop early
By stopping early, data within each block is accessed out of order, but in the order
specified in the original join index
• Use this pseudo-order to accelerate the post-probe sort as well
Radix Sort
将所有待比较数值（正整数）统一为同样的数位长度，数位较短的数前面补零。
然后，从最低位开始，依次进行一次排序。这样从最低位排序一直到最高位排序
完成以后, 数列就变成一个有序序列。

Radix Sort Example - LSD
LSD (Least Significant Digital); positive integer
From the right side.
73, 22, 93, 43, 55, 14, 28, 65, 390, 81
First, from units digit
0 1 2 3 4 5 6 7 8 9
073
022
093
043
055
014
028
065
390
081
390,081,022,073,093,043,014,055,0
65,028

Second round, sort tens digit
390,081,022,073,093,043,014,055,065,028
0 1 2 3 4 5 6 7 8 9
390
081
022
073
093
043
014
055
065
028
014,022,028,043,055,065,073,081,3
90,093

The last round, sort hundreds digit
014,022,028,043,055,065,073,081,390,093
0 1 2 3 4 5 6 7 8 9
014
022
028
043
055
065
073
081
390
093
014,022,028,043,055,065,073,081,0
93,390

Radix Sort Example - MSD
MSD (Most Significant Digital); positive integer
From the left side, suitable for the case have large digits.
The Process is the same.

Radix Cluster/Decluster
Bottom line
Both sorts from the Jive join can be significantly reduced in overhead
Only been tested when there is sufficient memory for the entire join index
to be stored three times
Technique is likely applicable to larger join indexes, but utility will go down a little
Only works if random access within a storage block
Don’t want to use radix cluster/decluster if you have variablewidth column values or
compression schemes that can only be decompressed starting from the beginning of the
block

Tuple Construction Heuristics
For queries with selective predicates, aggregations, or compressed data,
use late materialization
For joins:
Research papers:
Always use late materialization
Commercial systems:
Inner table to a join often materialized before join (reduces system complexity):
Some systems will use LM only if columns from inner table can fit entirely in memory

Query Optimization
Almost, all the query optimization is auto, or by using Vertica Database
Designer, we can do nothing, three generations:
1. StarOpt
Only optimize the Data Warehouse style queries like Star and Snow.
2. StarifiedOpt
Add non-start/snow queries optimization ability
3.V2Opt
Optimized the Query Optimization, add the ability for a set of extensible
modules, new algorithms for using the statistics …

Vertica SQL process
(Hybrid Storage Model)

Vertica SQL process (Hybrid Storage Model)

Disk Physical Structure
http://blog.163.com/sonyericssonss/blog/static/109683969200911233723670/ 硬盘结构详细易懂图解讲解

I/O – Sequential I/O & Random I/O
连续和随机,是指本次I/O给出的初始扇区地址和上一次I/O的结束扇区地址一致. 是不是完全连续
的，或者相隔不多的，如果是，则本次I/O应该算是一个连续I/O, 如果相差太大，则算一次随机I/O.
连续I/O，因为本次初始扇区和上次结束扇区相隔很近，则磁头几乎不用换道或者换道时间极短；如
果相差太大，则磁头需要很长的换道时间，如果随机I/O很多，导致磁头不停的换道，效率大大降
低。
优化数据库性能最重要的一个方面是调整I/O性能，一般来说，15000转的服务器硬盘也就是能提供
75个左右的不连续(随机)的I/O的操作和150个连续的I/O操作。一般这种磁盘的标称传输速率在
100MB/S, 但是实际上影响和限制数据库服务器的传输率是每秒的 75/150 I/O.
假设，一次I/O的操作数据块大小是8KB, 则：
每秒75次的随机I/O操作 * 8KB = 600KB/S
每秒150次的连续I/O操作*8KB = 1200KB/S 跟标称传输100MB/S相差巨大
实际上不是这样，每次I/O的数据操作不会这么小，同时还有预读，硬盘缓存等机制。

B+ Tree
If one insert happens in the last leaf node, it is fast, for it is Sequential IO.

B+ Tree
But if there is an write operation like update, insert, delete, need to read
many leaf node, many Random IO happens, it will waste time for disk seek
time, low efficient.

How to avoid this problem?
1. Give up for some read performance
a. COLA (Cached-Oblivious Look Ahead Array) tokuDB
b. LSM Tree (Log-structured merge Tree) Vertica,cassandra,hbase,bdb
java editon,levelDB etc.
2. Memory / SSD
The Design and Implementation of a Log-Structure File System, 1996

LSM tree

Vertica SQL process (Hybrid Storage Model)

WOS
INSERT
COPY
DELETE
UPDATE
WOS(Memor
y)
Small
Data
Data in the WOS is
solely in memory,
where column or
row oriented
doesn’t matter.
Cust Price
Andrew $100.00
Cust Price
Andrew $98.00
Cust Price
Nga $90.00
Chuck $87.00
Merge
When using WOS, directly put
the data into WOS(memory), no
need sorting, clustering,
compressing
ID Cust
1,2 Andrew
3 Chuck
4 Nga
ID Price
1 $98.00
2 $100.00
3 $87.00
4 $90.00
Tuple Mover: Moveout ROS
Cust Price
Andrew $98.00
… …
Merge multiple small files into large ones with
sorting, clustering, compressing

ROS
Slow, need sorting, clustering, compressing and so on… …
In Real World, prefer WOS, it is fast, suitable for large data be dived into
large batches small job, or modify the parameter MoveOutInterval and
MoveOutSizePct to let the WOS moveout the data quickly. But, be
careful, could cause the Vertica down if your job workload is heavy, and
WOS is out of memory.
INSERT
COPY
DELETE
UPDATE
Large Data
ROS

Data Modifications and Delete Vectors
Data in Vertica is never modified in place. When a tuple
is deleted or updated from either the WOS or ROS, Vertica
creates a delete vector. A delete vector is a list of positions of
rows that have been deleted. Delete vectors are stored in the
same format as user data: they are first written to a DVWOS
in memory, then moved to DVROS containers on disk by the
tuple mover and stored using efficient compression mechanisms. There may be
multiple delete vectors for the WOS and multiple delete vectors for any
particular ROS container. SQL UPDATE is supported by deleting the row being
updated and then inserting a row containing the updated column values

Histories Query – EPOCH
An epoch is associated with each COMMIT
the current_epoch at the time of the COMMIT is the epoch for that load.
Vertica supports historical queries, though it's not a common use case for
most customers. You can only query epochs that are after the current
AHM, which is kept aggressively current by default. Deleted data prior to
the AHM (Ancient History Mark) is eligible for being purged when a
mergeout or explicit purge happens. After it's purged, delete vectors no
longer need to be maintained. The Last Good Epoch is the epoch at which
all data has been written from WOS to ROS. Any data after the LGE will be
lost if the cluster shuts down abnormally from something like a power
loss or a set of exceptions across multiple nodes. Refresh Epoch - don't
worry about it, it doesn't get referenced in practice.

Example
dbadmin=> select current_epoch from system;
current_epoch
---------------
44
(1 row)
==================================================
====
dbadmin=> insert into A values(1); commit;
OUTPUT
--------
1
(1 row)
COMMIT
==================================================
====
dbadmin=> select current_epoch from system;
current_epoch
---------------
45
dbadmin=> insert into A values(2); commit;
OUTPUT
--------
1
(1 row)
==================================================
====
dbadmin=> at epoch 46 select * from A;
i
---
1
2
(2 rows)
==================================================
====
i
---
1
(1 row)

Example
dbadmin=> select make_ahm_now();
make_ahm_now
-----------------------------
AHM set (New AHM Epoch: 46)
(1 row)
==================================================
====
ERROR 2318: Can't run historical queries at epochs prior to
the Ancient History Mark

Flex Zone
Samchu Li / May 23rd, 2013

Flex Zone – New with 7.0!
Easily load, explore, analyze and monetize semi-structured data such as
text, videos, call records
More information in Loading Data module
Vertica Analytics
Flex Zone Tables
Store
and
Explore
Columnar Tables
Daily Analytics

Vertica
4C / Availablity

4C

K-Safety – Clustering/MPP
Your database must have a minimum number of nodes to be able to have a K-
safety level greater than zero.
Note: Vertica does not officially support values of K higher than 2.
K-level Number of Nodes
Required
0 1+
1 3+
2 5+
K (K+1)/2

K-Safety
K=1

Projection

Segmentation & Partition
Segmentation
Here the Segmentation = Neoview’s Partition
Hash Segmentation
Range Segmentation
Partition
Means on a single node, you still could divide this segmentation tabel into
different parts to improve the performance.
Range.

Segmentation & Partition

Vertica
Query Execution
Workflow

- MPP

Vertica
UDx

UDx
User Defined Extension (UDx) refers to all extensions to Vertica developed
using the APIs in the Vertica SDK.
UDF – User Defined Functions, five types:
• User Defined Scalar Functions (UDSFs)
• User Defined Transform Functions (UDTFs)
• User Defined Aggregate Functions (UDAF)
• User Defined Analytic Functions (UDAnF)
• The User Defined Load (UDL)

Fenced Mode
Fenced mode runs UDx code outside of the main Vertica process in a separate
zygote process. UDx code that crashes while running in fenced mode does not
impact the core Vertica process. There is a small performance impact when
running UDx code in fenced mode. On average, using fenced mode adds about
10% more time to execution compared to unfenced mode.
Zygote process
The Vertica zygote process starts when Vertica starts. Each node has a single
zygote process. Side processes are created "on demand". The zygote listens for
requests and spawns a UDx side session that runs the UDx in fenced mode
when a UDx is called by the user.

Vertica R
User Defined Functions developed in R always run in Fenced Mode in a
process outside of the main Vertica process.
You can create Scalar Functions and Transform Functions using the R
language. Other UDx types are not supported with the R language.
R Packages
The Vertica R Language Pack includes the following R packages in addition to
the default packages bundled with R:
• Rcpp
• RInside
• IpSolve
• lpSolveAPI

Vertica R
The R programming language is fast gaining popularity among data
scientists to perform statistical analyses. It is extensible and has a large
community of users, many of whom contribute packages to extend its
capabilities. However, it is single-threaded and limited by the amount of
RAM on the machine it is running on, which makes it challenging to run R
programs on big data.
There are efforts under way to remedy this situation, which essentially fall
into one of the following two categories:
• Integrate R into a parallel database, or
• Parallelize R so it can process big data

Running multiple instances of the R algorithm
in parallel (query partitioned data)
The first major performance benefit from Vertica R implementation has to
do with running multiple instances of the R algorithm in parallel with
queries that chunk the data independently.

Running multiple instances of the R algorithm
in parallel (query partitioned data)

Leveraging column-store technology for
optimized data exchange (query non-
partitioned data)
It is important to note that even for non-data parallel tasks (functions that
operate on input that is basically one big chunk of non-partitioned data) ,
Vertica’s implementation provides better performance since computation
runs on a server instead of client, and we have optimized data flow
between DB and R (no need to parse data again).

partitioned data)

partitioned data)
As the chart above indicates performance improvements are also
achieved by the optimizing the data transfers between Vertica and
R. Since Vertica is a column store and R is vector based it is very efficient
to move data from a Vertica column in very large blocks to R vectors.

Vertica eco-system
Unstructured data Vertica Build-in engine
RDBMS
Index
Store/idx/table
ETL
Hadoop/pig/HDFS/Hc
atalog connector
Flex
Table
Support
Reporting

Thank you

Vertica

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Vertica

Similar to Vertica (20)

Vertica

Editor's Notes