Introduction into columnar databaseson example of Infobright Community Edition (ICE)
Row-based vs Column-based approachIn a row-based data store data is written in the pages asrows:1, Paul, Nielsen;2, Josh, Jones;3, Kalen, Delaney.A columnar database writes data to the page as columns:1, 2, 3;Paul, Josh, Kalen;Nielsen, Jones, Delaney
So...Difference between row and columnar baseddatabases is only in the storage engine layer.Query engine layer can still use straight SQLand then pass optimized query to the storageengine, which writes / reads either row-basedor columnar based tables.
Important factors1. Column always have only (!) one data type.You can design compression algorithms fordata in the column and work with thecompressed format instead of the original data.2. Have structure similar to B-tree indexNo need of indexes3. Read only columns needed to satisfy a query(vs full row) - automatically eliminateunnecessary I/O
DisadvantagesWhen you read multiple columns, you have tocombine together the rows of the table usingcolumns read from various parts of the disk,effectively identical to doing a bunch of joins.For some workloads the columnar storage is awin, and for some workloads, row-basedstorage is the best bet.
ICEICE is a specialized Open Source load and read-onlyanalytic database, as opposed to a general-purposedatabase.ICE is optimized for use under the followingconditions: ● Up to 50TB of data, supporting hundreds of millions of rows in a single table ● Data loading and storage, no DML support required ● 64-bit & 32-bit Linux environments and 64-bit & 32-bit Windows
Supported Data TypesMain types from mysql and more
Efficient Data TypesSome of data types are identified as being moreefficient within Infobright:TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT, DECIMAL, DATE, TIMESpecial case data types:CHAR, VARCHAR — these types are covered in the Knowledge Grid,but where possible should be replaced numeric values, as they arebetter optimized and faster to decompressLess optimized data types:BINARY, VARBINARY, FLOAT, DOUBLE, TINYTEXT, TEXT
Creating the Database from an ExistingSchemaICE supports standard MySQL DDL (with someextensions and omissions)● No indices● No referential integrity checks● No DEFAULT values● Lookup fields
LookupsLookups are a powerful tool that can reduce bothstorage requirements and query times when usedeffectively.Lookups replace a CHAR or VARCHAR value with aninteger value for a column where the total values tounique values ratio is more than 10:1.This is particularly beneficial for fields like Statesand other fields that traditionally have a smallrange of acceptable values.
LookupsSo, if you had 100,000 values, and only 10,000unique values, then you would have a goodcandidate for the lookup function.The lookup function should not be used if thiscriteria is not met – a ratio less than 10:1 will resultin slower queries and longer load times.
Data LoadingICE includes a dedicated high-performance loader,that differs from the standard MySQL Loader.LOAD DATA INFILE /full_path/file_name INTO TABLE table_name[FIELDS[TERMINATED BY char][ENCLOSED BY char][ESCAPED char]];
Optimized QueriesBest performance can be achieved when Infobrightexecutes SQL statements by using its query optimizer.When a query cannot exploit Infobright’s proprietaryretrieval methods, the query will invoke the MySQLdatabase optimizer. In these cases, users should expectthe query to perform as it would using the MySQL databaseengine itsef.
Knowledge GridThe Knowledge Grid is a set of Infobright metadata used bythe Infobright storage engine (named “Brighthouse”) tooptimize query execution.The Knowledge Grid consists of Knowledge Nodes, whichare optimization data for particular tables and columns.
Kinds of Knowledge NodesHistogram - Used by Infobright to enhance the speed of most queriesconsisting of numerical conditions (including date/time, decimal, etc). Createdautomatically during data loadCharacter Map - Used by Infobright to enhance the speed of most queriesconsisting of text conditions. Created automatically during data loadPack/Pack - Used to enhance joining of tables. Created or updatedautomatically while executing user queries. Created or updated automaticallywhileexecuting user queries.DPN (Data Pack Nodes) - Statistical metadata that describes the content ofthe Data Pack. Used to assist in data access. Created automatically duringdata load.