SlideShare a Scribd company logo
Row or Columnar Database
1
©asquareb llc
If someone is evaluating database or data stores to use in their application, there are so many options to
choose from especially in the data ware house space. If narrowed down to the relational database (RDBMS)
paradigm, one of the choices to make is whether to use row based or columnar based database. Vendors
claim superiority of one over the other on whether their product is columnar or row based. So we looked into
the details about columnar and row databases to understand the fundamental differences. The following is the
summary.
Why the name?
Row Based RDBMS (R-RDBMS):
In a row based DBMS, data related to a tuple (row) i.e. all the column data are stored contiguously on disk.
For IO efficiency, disk reads and writes are done at block size, for e.g. a 4K (4096 byte) block size by
Operating Systems. Database management systems use “pages” of size which is a multiple of the block size to
read and write data to disk. In R-RDBMS rows of data are stored in data pages and if the row size is less than
half the page size, then multiple rows are stored in a single page. When a row is required by a query, the
whole page in which the row is stored is retrieved back from the disk into the memory for further processing.
The following is a representative page layout based on one of the RDBMS used in the industry.
As we can see the R-RDBMS stores additional information regarding each page and rows with in the page to
help with maintaining the ACID (Atomicity, Consistency, Isolation, Durability) property expected out of
RDBMS.
In order to improve query performance, R-RDBMS uses additional structure called indexes. Indexes store the
indexed column value, the page in which the row with the indexed column value is stored on disk and the
offset within the page to reach out to the particular row. If an index is not present, when a query is executed
against a table, the DBMS needs to read through all the pages from the disk pertaining to the table to find the
rows which satisfy the query. If the index is present and the query uses the indexed columns in its predicate,
the DBMS can use the index to identify the rows which satisfy the query and read the pages where the rows
are stored reducing the time to identify the rows. This also reduces the amount of data read from the disk i.e.
Data Page
Page Header
(20 Bytes)
Row
Header
(6 bytes)
(
Row Data
Row
Header
(6 bytes)
(
Trail
byte
Row
Pointer
Row
Pointer
Col 1
Col 1 Col 2
Col 2
…
Row Data
Row or Columnar Database
2
©asquareb llc
reducing the slowest operation in the query processing sequence which is disk IO. The following is a
representative index page in a typical DBMS. There can be other representations based on various index
structures like BTree, Bit map etc.
Columnar RDBMS (C-RDBMS):
As you may have guessed, columnar databases store each column data from all the tuples together. The
following diagram shows the translation of data storage between a row based DBMS and columnar DBMS.
Contrary to storing all the column data corresponding to a row sequentially in a page, values for each column
in rows are stored together in the same page. This results in data for each row getting stored in different
pages. When a query requires data for a row, column data for the row is pulled from all the pages storing the
column values, appends them together before returning it to the user as a single row. The sequence in which
the column values are stored in the database pages determines the row to which corresponds to. For e.g. the
second entry stored in the various pages “ID2”, “Mark”, “Waugh”, ”Researcher” corresponds to the same
row which is row 2.
By storing the columns separately, each column acts as an index since the sequence of storage identifies the
row. For e.g. if a query requests the row with first name ‘Steve’ the DBMS can identify the row number using
Index Page
Page Header
Col 2 Page Id + Offset
Col 2 Page Id + Offset
ID1
ID2
ID3
Mark Antony
Mark Waugh
Steve Aurelius
Engineer
Researcher
Engineer
ID1 ID2 ID3
…
Mark Mark Steve
…
Antony Waugh Aurelius
…
Engineer Researcher Engineer
…
Row DB page to Columnar DB page
Data Page
Page Header
(20 Bytes)
Row
Header
(6 bytes)
(
Row Data
Row
Header
(6 bytes)
(
Trail
byte
Row
Pointer
Row
Pointer
Col 1 Col 2
…
Row Data
Col 1 Col 2
Row or Columnar Database
3
©asquareb llc
the pages storing the column “First Name” which in this case is row 3. Then the DBMS can retrieve the third
entry from pages storing all the other column values stitch them together and return the row back to the user.
How they differ?
Given the understanding of the key difference between R and C-RDBMS, we can look at how they differ
operationally.
 If the usage pattern involves retrieval and update of all or most of the columns in a row like in an
OLTP application, then R-RDBMS is a better option than the C-RDBMS. The reason being that the
C-RDBMS needs to retrieve the columns values separately and stitch them together to return the row
as a response and this doesn’t provide the performance expected in an OLTP environment. Also in C-
RDBMS updates need to be made on multiple pages in contrast to updating a single page in R-RDBMS
which is inefficient. C-RDBMS is primarily suited for data ware housing where the usage pattern is read
only.
 If the usage pattern involves retrieval of all the columns in a row in bulk, then R-RDBMS is a better
option due to the same reason described above. But if the bulk retrieval involves only a small subset of
columns, then the C-RDBMS will perform better. The reason being that C-RDBMS can deal with the
subset of columns since they are stored separately while R-RDBMS need to bring in all the rows and
columns into memory from the disk and process it through CPU to eliminate the unwanted columns.
Some R-RDBMS products like Netezza may be able to eliminate the unwanted columns using
specialized hardware during disk read but still need to deal with all the rows columns.
 If the usage pattern involves aggregation on columns then C-RDBMS performs better than the R-
RDBMS since they can act on individual columns efficiently compared to R-RDBMS.
 C-RDBMS can implement optimization techniques like late materialization where conditions on
columns can be applied separately, identify the rows which satisfies all the conditions before retrieving
the columns to generate rows whereas R-RDBMS needs to retrieve rows much earlier to identify the
satisfying rows and to return to the user.
 Storage required for C-RDBMS will be less than the R-RDBMS since they don’t have the same page
and row overheads as R-RDBMS. Also they don’t need additional structures like indexes since the
columns themselves act as indexes.
 Compression on data is efficient on C-RDBMS since data which are similar are stored together
compared to R-RDBMS where mixed data in rows are stored together. This helps reduce space usage
in C-RDBMS and also improves the disk IO since the data is much compressed.
Can R-RDBMS implement C-RDBMS?
One can try to mimic C-RDBMS storage in an R-RDBMS using any of the following techniques
 Store columns as separate tables with a common identifier column to identify the row to which the
columns value corresponds to.
Row or Columnar Database
4
©asquareb llc
 Create indexes for each of the columns in a table so that queries can be satisfied by using the indexes
only.
Also there are commercial DBMS products which support both columnar and row based storage. Apart from
the increased (more than double) storage requirement to implement these techniques, research from MIT
database group shows that these techniques do not provide the same performance as the C-RDBMS for all
the usage patterns for which C-RDBMS is best suited for.
Summary
C-RDBMS are more suited for data warehousing use cases and it is how they are utilized currently in the
industry. Also C-RDBMS may perform much better when usage involves small set of column retrieval and
column aggregations. R-RDBMS are good for use cases where data is dealt at the row level and where
updates are often made on rows. C-RDBMS and R-RDBMS vendors may find ways to incorporate some of
the advantages of the other in their product. The key is to understand the data usage pattern and choose the
best product which matches the usage. Even though we have eliminated other complexities in a typical
DBMS system and looked only at the fundamental difference between R and C RDBMS, hope it helps you
choose the best option for your application.
bnair@asquareb.com
blog.asquareb.com
https://github.com/bijugs
@gsbiju
http://www.slideshare.net/bijugs

More Related Content

What's hot

Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Amazon Web Services
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
Cloud design principles
Cloud design principlesCloud design principles
Cloud design principles
Masashi Narumoto
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data Warehousing
Amazon Web Services
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
Sudheer Kondla
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
Bethmi Gunasekara
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
Neo4j
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
Sungmin Kim
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
Suvradeep Rudra
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
confluent
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
Neo4j
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 

What's hot (20)

Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
Migrating Your Databases to AWS - Deep Dive on Amazon RDS and AWS Database Mi...
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Cloud design principles
Cloud design principlesCloud design principles
Cloud design principles
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data Warehousing
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 

Viewers also liked

Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
Biju Nair
 
Project Risk Management
Project Risk ManagementProject Risk Management
Project Risk Management
Biju Nair
 
Concurrency
ConcurrencyConcurrency
Concurrency
Biju Nair
 
Websphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentalsWebsphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentals
Biju Nair
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
Biju Nair
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
Biju Nair
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
Biju Nair
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
Lars Hofhansl
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
Biju Nair
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
Biju Nair
 

Viewers also liked (10)

Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Project Risk Management
Project Risk ManagementProject Risk Management
Project Risk Management
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Websphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentalsWebsphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentals
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 

Similar to Row or Columnar Database

Choosing your NoSQL storage
Choosing your NoSQL storageChoosing your NoSQL storage
Choosing your NoSQL storage
Imteyaz Khan
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
Amar Jagdale
 
Vertica
VerticaVertica
MapReduce and parallel DBMSs: friends or foes?
MapReduce and parallel DBMSs: friends or foes?MapReduce and parallel DBMSs: friends or foes?
MapReduce and parallel DBMSs: friends or foes?
Spyros Eleftheriadis
 
Database
DatabaseDatabase
Database
Zahid Soomro
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
Shahbaz Sidhu
 
Bigtable
Bigtable Bigtable
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
Suvradeep Rudra
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
Navanit Katiyar
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB Database
Tariqul islam
 
2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx
RushikeshChikane2
 
Column oriented Transactions
Column oriented TransactionsColumn oriented Transactions
Column oriented Transactions
Aerial Telecom Solutions (ATS) Pvt. Ltd.
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
PL dream
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
Yadhu Kiran
 
No sql databases
No sql databasesNo sql databases
No sql databases
Walaa Hamdy Assy
 
Databases and its representation
Databases and its representationDatabases and its representation
Databases and its representation
Ruhull
 
Mdb dn 2016_04_check_constraints
Mdb dn 2016_04_check_constraintsMdb dn 2016_04_check_constraints
Mdb dn 2016_04_check_constraints
Daniel M. Farrell
 
Chapter02
Chapter02Chapter02
Chapter02
sasa_eldoby
 
Beyond Aurora. Scale-out SQL databases for AWS
Beyond Aurora. Scale-out SQL databases for AWS Beyond Aurora. Scale-out SQL databases for AWS
Beyond Aurora. Scale-out SQL databases for AWS
Clustrix
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
Gaurav Agrawal
 

Similar to Row or Columnar Database (20)

Choosing your NoSQL storage
Choosing your NoSQL storageChoosing your NoSQL storage
Choosing your NoSQL storage
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
 
Vertica
VerticaVertica
Vertica
 
MapReduce and parallel DBMSs: friends or foes?
MapReduce and parallel DBMSs: friends or foes?MapReduce and parallel DBMSs: friends or foes?
MapReduce and parallel DBMSs: friends or foes?
 
Database
DatabaseDatabase
Database
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable
Bigtable Bigtable
Bigtable
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB Database
 
2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx
 
Column oriented Transactions
Column oriented TransactionsColumn oriented Transactions
Column oriented Transactions
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Databases and its representation
Databases and its representationDatabases and its representation
Databases and its representation
 
Mdb dn 2016_04_check_constraints
Mdb dn 2016_04_check_constraintsMdb dn 2016_04_check_constraints
Mdb dn 2016_04_check_constraints
 
Chapter02
Chapter02Chapter02
Chapter02
 
Beyond Aurora. Scale-out SQL databases for AWS
Beyond Aurora. Scale-out SQL databases for AWS Beyond Aurora. Scale-out SQL databases for AWS
Beyond Aurora. Scale-out SQL databases for AWS
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
 

More from Biju Nair

Chef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scaleChef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scale
Biju Nair
 
HBase Internals And Operations
HBase Internals And OperationsHBase Internals And Operations
HBase Internals And Operations
Biju Nair
 
Apache Kafka Reference
Apache Kafka ReferenceApache Kafka Reference
Apache Kafka Reference
Biju Nair
 
Serving queries at low latency using HBase
Serving queries at low latency using HBaseServing queries at low latency using HBase
Serving queries at low latency using HBase
Biju Nair
 
Multi-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-finalMulti-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-final
Biju Nair
 
Cursor Implementation in Apache Phoenix
Cursor Implementation in Apache PhoenixCursor Implementation in Apache Phoenix
Cursor Implementation in Apache Phoenix
Biju Nair
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
Biju Nair
 
Chef patterns
Chef patternsChef patterns
Chef patterns
Biju Nair
 

More from Biju Nair (8)

Chef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scaleChef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scale
 
HBase Internals And Operations
HBase Internals And OperationsHBase Internals And Operations
HBase Internals And Operations
 
Apache Kafka Reference
Apache Kafka ReferenceApache Kafka Reference
Apache Kafka Reference
 
Serving queries at low latency using HBase
Serving queries at low latency using HBaseServing queries at low latency using HBase
Serving queries at low latency using HBase
 
Multi-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-finalMulti-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-final
 
Cursor Implementation in Apache Phoenix
Cursor Implementation in Apache PhoenixCursor Implementation in Apache Phoenix
Cursor Implementation in Apache Phoenix
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Chef patterns
Chef patternsChef patterns
Chef patterns
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Row or Columnar Database

  • 1. Row or Columnar Database 1 ©asquareb llc If someone is evaluating database or data stores to use in their application, there are so many options to choose from especially in the data ware house space. If narrowed down to the relational database (RDBMS) paradigm, one of the choices to make is whether to use row based or columnar based database. Vendors claim superiority of one over the other on whether their product is columnar or row based. So we looked into the details about columnar and row databases to understand the fundamental differences. The following is the summary. Why the name? Row Based RDBMS (R-RDBMS): In a row based DBMS, data related to a tuple (row) i.e. all the column data are stored contiguously on disk. For IO efficiency, disk reads and writes are done at block size, for e.g. a 4K (4096 byte) block size by Operating Systems. Database management systems use “pages” of size which is a multiple of the block size to read and write data to disk. In R-RDBMS rows of data are stored in data pages and if the row size is less than half the page size, then multiple rows are stored in a single page. When a row is required by a query, the whole page in which the row is stored is retrieved back from the disk into the memory for further processing. The following is a representative page layout based on one of the RDBMS used in the industry. As we can see the R-RDBMS stores additional information regarding each page and rows with in the page to help with maintaining the ACID (Atomicity, Consistency, Isolation, Durability) property expected out of RDBMS. In order to improve query performance, R-RDBMS uses additional structure called indexes. Indexes store the indexed column value, the page in which the row with the indexed column value is stored on disk and the offset within the page to reach out to the particular row. If an index is not present, when a query is executed against a table, the DBMS needs to read through all the pages from the disk pertaining to the table to find the rows which satisfy the query. If the index is present and the query uses the indexed columns in its predicate, the DBMS can use the index to identify the rows which satisfy the query and read the pages where the rows are stored reducing the time to identify the rows. This also reduces the amount of data read from the disk i.e. Data Page Page Header (20 Bytes) Row Header (6 bytes) ( Row Data Row Header (6 bytes) ( Trail byte Row Pointer Row Pointer Col 1 Col 1 Col 2 Col 2 … Row Data
  • 2. Row or Columnar Database 2 ©asquareb llc reducing the slowest operation in the query processing sequence which is disk IO. The following is a representative index page in a typical DBMS. There can be other representations based on various index structures like BTree, Bit map etc. Columnar RDBMS (C-RDBMS): As you may have guessed, columnar databases store each column data from all the tuples together. The following diagram shows the translation of data storage between a row based DBMS and columnar DBMS. Contrary to storing all the column data corresponding to a row sequentially in a page, values for each column in rows are stored together in the same page. This results in data for each row getting stored in different pages. When a query requires data for a row, column data for the row is pulled from all the pages storing the column values, appends them together before returning it to the user as a single row. The sequence in which the column values are stored in the database pages determines the row to which corresponds to. For e.g. the second entry stored in the various pages “ID2”, “Mark”, “Waugh”, ”Researcher” corresponds to the same row which is row 2. By storing the columns separately, each column acts as an index since the sequence of storage identifies the row. For e.g. if a query requests the row with first name ‘Steve’ the DBMS can identify the row number using Index Page Page Header Col 2 Page Id + Offset Col 2 Page Id + Offset ID1 ID2 ID3 Mark Antony Mark Waugh Steve Aurelius Engineer Researcher Engineer ID1 ID2 ID3 … Mark Mark Steve … Antony Waugh Aurelius … Engineer Researcher Engineer … Row DB page to Columnar DB page Data Page Page Header (20 Bytes) Row Header (6 bytes) ( Row Data Row Header (6 bytes) ( Trail byte Row Pointer Row Pointer Col 1 Col 2 … Row Data Col 1 Col 2
  • 3. Row or Columnar Database 3 ©asquareb llc the pages storing the column “First Name” which in this case is row 3. Then the DBMS can retrieve the third entry from pages storing all the other column values stitch them together and return the row back to the user. How they differ? Given the understanding of the key difference between R and C-RDBMS, we can look at how they differ operationally.  If the usage pattern involves retrieval and update of all or most of the columns in a row like in an OLTP application, then R-RDBMS is a better option than the C-RDBMS. The reason being that the C-RDBMS needs to retrieve the columns values separately and stitch them together to return the row as a response and this doesn’t provide the performance expected in an OLTP environment. Also in C- RDBMS updates need to be made on multiple pages in contrast to updating a single page in R-RDBMS which is inefficient. C-RDBMS is primarily suited for data ware housing where the usage pattern is read only.  If the usage pattern involves retrieval of all the columns in a row in bulk, then R-RDBMS is a better option due to the same reason described above. But if the bulk retrieval involves only a small subset of columns, then the C-RDBMS will perform better. The reason being that C-RDBMS can deal with the subset of columns since they are stored separately while R-RDBMS need to bring in all the rows and columns into memory from the disk and process it through CPU to eliminate the unwanted columns. Some R-RDBMS products like Netezza may be able to eliminate the unwanted columns using specialized hardware during disk read but still need to deal with all the rows columns.  If the usage pattern involves aggregation on columns then C-RDBMS performs better than the R- RDBMS since they can act on individual columns efficiently compared to R-RDBMS.  C-RDBMS can implement optimization techniques like late materialization where conditions on columns can be applied separately, identify the rows which satisfies all the conditions before retrieving the columns to generate rows whereas R-RDBMS needs to retrieve rows much earlier to identify the satisfying rows and to return to the user.  Storage required for C-RDBMS will be less than the R-RDBMS since they don’t have the same page and row overheads as R-RDBMS. Also they don’t need additional structures like indexes since the columns themselves act as indexes.  Compression on data is efficient on C-RDBMS since data which are similar are stored together compared to R-RDBMS where mixed data in rows are stored together. This helps reduce space usage in C-RDBMS and also improves the disk IO since the data is much compressed. Can R-RDBMS implement C-RDBMS? One can try to mimic C-RDBMS storage in an R-RDBMS using any of the following techniques  Store columns as separate tables with a common identifier column to identify the row to which the columns value corresponds to.
  • 4. Row or Columnar Database 4 ©asquareb llc  Create indexes for each of the columns in a table so that queries can be satisfied by using the indexes only. Also there are commercial DBMS products which support both columnar and row based storage. Apart from the increased (more than double) storage requirement to implement these techniques, research from MIT database group shows that these techniques do not provide the same performance as the C-RDBMS for all the usage patterns for which C-RDBMS is best suited for. Summary C-RDBMS are more suited for data warehousing use cases and it is how they are utilized currently in the industry. Also C-RDBMS may perform much better when usage involves small set of column retrieval and column aggregations. R-RDBMS are good for use cases where data is dealt at the row level and where updates are often made on rows. C-RDBMS and R-RDBMS vendors may find ways to incorporate some of the advantages of the other in their product. The key is to understand the data usage pattern and choose the best product which matches the usage. Even though we have eliminated other complexities in a typical DBMS system and looked only at the fundamental difference between R and C RDBMS, hope it helps you choose the best option for your application. bnair@asquareb.com blog.asquareb.com https://github.com/bijugs @gsbiju http://www.slideshare.net/bijugs