Column based Databases




      Shweta Agrawal




        
What is a column based DB?
 ID      NAME             SEX     AGE   SALARY ADDRRESS PHONE       PAN...
 1       Sunil Sharm...
Column stores
1|2|3|4|5|       Shweta            M|F|M|M|M|F|      40|25|28|30|   10000|12000|
6|....           Agrawal|Ne...
Query processing on row store
    SELECT name, salary FROM employee WHERE age > 40

   Evaluate condition age>40 possibly...
Query processing on  a column 
            store
SELECT name, salary FROM employee WHERE age > 40
   Evaluate condition a...
A quick calculation of IO
   Table has 10 columns
   1 million rows.
   Each row is 100 bytes 
   30% of employees are...
Why is it important?
   Wide fact tables in data­warehouses
   Analytics queries on data­warehouse tend to 
    aggregat...
An example star schema




        
Benefits of column based DB
   Low pages read = Less IO = faster queries
   Processes CPU bound instead of IO bound
   ...
Row based equivalents
   Index every column?
       Maintenance: updates/insert/deletes
       Storage
       Most imp...
Row based equivalents
   Vertical partitioning?
       Joins (although fast ones)
       Table overhead
       Cannot ...
Summary
   For adhoc analytics queries, column based 
    storage reduces IO, and makes queries faster
   Column based q...
References
   Commercial products
       Sybase IQ
       Vertica
       MySQL's InfoBright storage engine
   To know...
Upcoming SlideShare
Loading in...5
×

Column orientation - rotate your thinking 90 degrees

2,910

Published on

With ever increasing data and greater analytics requirements, a new breed of databases is becoming popular - column-based databases. Some popular real world examples of column based DBs are - Sybase IQ, Vertica, and to some degree, Infobright - MySQL's column based storage engine. These databases store data "column-wise" in pages instead of "row-wise". This re-orientation claims to provide significant advantages over row-based storage for read type analytics queries. In my talk, I will discuss the technicalities, benefits and motivating use-cases for column-based databases. We shall also see why more indexing or partitioning in a row-based storage won't achieve the same effect.

Published in: Technology, Business
1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,910
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
100
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

Column orientation - rotate your thinking 90 degrees

  1. 1. Column based Databases Shweta Agrawal    
  2. 2. What is a column based DB? ID NAME SEX AGE SALARY ADDRRESS PHONE PAN... 1 Sunil Sharma M 40 10,000 ... ... ... 2 Neha Agarwal F 25 12,000 ... ... ... 3 Anant Agarwal M 28 15,000 ... ... ... 4 Vishal Mehta M 30 8,000 ... ... ... One page of the table storage 1|Shweta Agrawal|M| 1|2|3|4...|Shweta  40|10000...|2|Neha  Agrawal|Neha Agrawal| Agrawal|F|25| Anant Agarwal|Vishal  12000...|3|Anant  Mehta...|M|F|M|M...| Agarwal|M|28| 40|25|28|30...|10000| 15000...|4|Vishal  12000|15000|8000... Mehta|M|30|8000... Row based storage Column based storage    
  3. 3. Column stores 1|2|3|4|5| Shweta  M|F|M|M|M|F| 40|25|28|30| 10000|12000| 6|.... Agrawal|Neha  F... 45|20... 15000|8000| Agrawal| 15000| Anant  5000... Agarwal| Vishal  ... Mehta| Srinivas  Pathak| Rubina  Mehta.... 1st page of each column store    
  4. 4. Query processing on row store SELECT name, salary FROM employee WHERE age > 40  Evaluate condition age>40 possibly using an index  on age.  Get a found­set containing row number/ID of rows  that satisfy above condition.  Retrieve all rows in the above  found­set.  Send only name, and salary from the rows as result  to client    
  5. 5. Query processing on  a column  store SELECT name, salary FROM employee WHERE age > 40  Evaluate condition age > 40 on column age, using an  index if present  Get a found­set containing row number/ID of rows that  satisfy above condition  Retrieve name's from name's column store for all rows in  the found­set  Retrieve salary's from salary column for all rows in the  found­set  Associate name with salary by row id/number for final  result    
  6. 6. A quick calculation of IO  Table has 10 columns  1 million rows.  Each row is 100 bytes   30% of employees are above age 40  Total amount of data read in row based store =  100MB * 0.3 = 30MB  Total amount of data read in column based  store 100MB * 0.3 * 0.2 (only 2 columns) = 6MB    
  7. 7. Why is it important?  Wide fact tables in data­warehouses  Analytics queries on data­warehouse tend to  aggregate/analyse a few columns but a large  number of rows.  Full table scans for analytics queries in row  stores  Normalization means more joins    
  8. 8. An example star schema    
  9. 9. Benefits of column based DB  Low pages read = Less IO = faster queries  Processes CPU bound instead of IO bound  Compression  Page level compression  Column level compression (lookup tables)  Natural intra­query parallelism on conditions on  different columns    
  10. 10. Row based equivalents  Index every column?  Maintenance: updates/insert/deletes  Storage  Most importantly: Index is value=>id, column is  id=>value  Useful for selective queries only    
  11. 11. Row based equivalents  Vertical partitioning?  Joins (although fast ones)  Table overhead  Cannot use horizontal partitioning  Row based query engine not geared up to make  use of the column based storage.    
  12. 12. Summary  For adhoc analytics queries, column based  storage reduces IO, and makes queries faster  Column based query engines written ground up  for analytics queries make good use of this  storage.  Indexing every column, or vertical partioning not  same as column based storage.    
  13. 13. References  Commercial products  Sybase IQ  Vertica  MySQL's InfoBright storage engine  To know more, read  http://databasecolumn.vertica.com/    
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×