Column orientation - rotate your thinking 90 degrees
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Column orientation - rotate your thinking 90 degrees

  • 4,220 views
Uploaded on

With ever increasing data and greater analytics requirements, a new breed of databases is becoming popular - column-based databases. Some popular real world examples of column based DBs are -......

With ever increasing data and greater analytics requirements, a new breed of databases is becoming popular - column-based databases. Some popular real world examples of column based DBs are - Sybase IQ, Vertica, and to some degree, Infobright - MySQL's column based storage engine. These databases store data "column-wise" in pages instead of "row-wise". This re-orientation claims to provide significant advantages over row-based storage for read type analytics queries. In my talk, I will discuss the technicalities, benefits and motivating use-cases for column-based databases. We shall also see why more indexing or partitioning in a row-based storage won't achieve the same effect.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
4,220
On Slideshare
4,191
From Embeds
29
Number of Embeds
4

Actions

Shares
Downloads
87
Comments
1
Likes
3

Embeds 29

http://www.linkedin.com 12
http://www.slideshare.net 10
https://www.linkedin.com 6
http://www.lmodules.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Column based Databases Shweta Agrawal    
  • 2. What is a column based DB? ID NAME SEX AGE SALARY ADDRRESS PHONE PAN... 1 Sunil Sharma M 40 10,000 ... ... ... 2 Neha Agarwal F 25 12,000 ... ... ... 3 Anant Agarwal M 28 15,000 ... ... ... 4 Vishal Mehta M 30 8,000 ... ... ... One page of the table storage 1|Shweta Agrawal|M| 1|2|3|4...|Shweta  40|10000...|2|Neha  Agrawal|Neha Agrawal| Agrawal|F|25| Anant Agarwal|Vishal  12000...|3|Anant  Mehta...|M|F|M|M...| Agarwal|M|28| 40|25|28|30...|10000| 15000...|4|Vishal  12000|15000|8000... Mehta|M|30|8000... Row based storage Column based storage    
  • 3. Column stores 1|2|3|4|5| Shweta  M|F|M|M|M|F| 40|25|28|30| 10000|12000| 6|.... Agrawal|Neha  F... 45|20... 15000|8000| Agrawal| 15000| Anant  5000... Agarwal| Vishal  ... Mehta| Srinivas  Pathak| Rubina  Mehta.... 1st page of each column store    
  • 4. Query processing on row store SELECT name, salary FROM employee WHERE age > 40  Evaluate condition age>40 possibly using an index  on age.  Get a found­set containing row number/ID of rows  that satisfy above condition.  Retrieve all rows in the above  found­set.  Send only name, and salary from the rows as result  to client    
  • 5. Query processing on  a column  store SELECT name, salary FROM employee WHERE age > 40  Evaluate condition age > 40 on column age, using an  index if present  Get a found­set containing row number/ID of rows that  satisfy above condition  Retrieve name's from name's column store for all rows in  the found­set  Retrieve salary's from salary column for all rows in the  found­set  Associate name with salary by row id/number for final  result    
  • 6. A quick calculation of IO  Table has 10 columns  1 million rows.  Each row is 100 bytes   30% of employees are above age 40  Total amount of data read in row based store =  100MB * 0.3 = 30MB  Total amount of data read in column based  store 100MB * 0.3 * 0.2 (only 2 columns) = 6MB    
  • 7. Why is it important?  Wide fact tables in data­warehouses  Analytics queries on data­warehouse tend to  aggregate/analyse a few columns but a large  number of rows.  Full table scans for analytics queries in row  stores  Normalization means more joins    
  • 8. An example star schema    
  • 9. Benefits of column based DB  Low pages read = Less IO = faster queries  Processes CPU bound instead of IO bound  Compression  Page level compression  Column level compression (lookup tables)  Natural intra­query parallelism on conditions on  different columns    
  • 10. Row based equivalents  Index every column?  Maintenance: updates/insert/deletes  Storage  Most importantly: Index is value=>id, column is  id=>value  Useful for selective queries only    
  • 11. Row based equivalents  Vertical partitioning?  Joins (although fast ones)  Table overhead  Cannot use horizontal partitioning  Row based query engine not geared up to make  use of the column based storage.    
  • 12. Summary  For adhoc analytics queries, column based  storage reduces IO, and makes queries faster  Column based query engines written ground up  for analytics queries make good use of this  storage.  Indexing every column, or vertical partioning not  same as column based storage.    
  • 13. References  Commercial products  Sybase IQ  Vertica  MySQL's InfoBright storage engine  To know more, read  http://databasecolumn.vertica.com/