NoSQL - Leo's notes

NoSQL
Leo’s notes
Those slides are Leopold Gault's notes, when reading :
• https://www.thoughtworks.com/insights/blog/nosql-databases-overview
• https://www.slideshare.net/arangodb/query-mechanisms-for-nosql-databases
• https://www.slideshare.net/arangodb/introduction-to-column-oriented-databases
• https://neo4j.com/developer/guide-data-modeling/
I am not a NoSQL expert; those notes are just my understanding of the aforementioned sources

Relational data models (OLTP and OLAP) vs NoSQL data models
NoSQL data modelsRelational data models
Transactional (OLTP)
Note that they represent a
document as a hierarchical
tree of data (it makes sense)
I think they meant to
represent a star schema

Who do I think is meant to be normalized ?
normalized
normalizedDeliberately
de-normalized
Normalized ?
Not normalized
Not normalized

Who do I think natively supports ACID transactions
Always
Most of the times
(e.g. Node4J)
Maybe
sometimes
Always
Maybe
sometimes
Maybe
sometimes

Why aggregates
Let’s say that my application always uses
a set of data like this one

Why aggregates
In a RDBMS, such set of data would
have to be fetched from many
tables (requiring plenty of JOINs)

Why aggregates
In a RDBMS, such set of data would
have to be fetched from many
tables (requiring plenty of JOINs)
We can see that there is a big mismatch between the way the data is
aggregated by this application (i.e. the data is aggregated), and how the
data was scattered in tables of the RDBMS.

Aggregate-oriented DBMS
NoSQL DBMS (bar Graph DBMS) are aggregate-oriented.
An aggregate is a set of data, that will form the boundaries for ACID
operations.
Hence, the “acidity scope” is not at the transaction level, but at the aggregate level. Note
however that some aggregate-oriented DBMS also support ACID transactions.
An aggregate’s data have been grouped together only because it makes sense to do so,
from the application’s point of view.
This grouping is masterminded by a human. By:
• the developer: when coding an app, the developer will try to identify which sets of
data will be accessed together by the app. He will hence decide to write/read each set
of data as an aggregate.
• or the creator of materialized views, i.e. new aggregates emitted from disparate data.

Why aggregate-oriented DBMS
Working with aggregates is more performant. Indeed, an aggregate is stored together,
instead of being scattered among many tables. The same applies when reading: it is
quicker to retrieve a set of data that has been stored together, than if it had been
scattered throughout many tables.
In a cluster of an aggregate-oriented DBMS, an aggregate can live on the same node (or
be replicated on the same few nodes). Thus our cluster can scale out without reducing
the response time, as sets of data frequently accessed together (i.e. aggregates) are not
cut into pieces that are scattered through many nodes. The same logic applies for
sharding (an aggregate would belong to a single shard, instead of many) and replication.

About aggregates
Here are 2 formal definitions :
• An aggregate is a collection of data that we interact with as a unit. These units
of data (aggregates) form the boundaries for ACID operations (at the
aggregate level) with the database. [source1]
• Aggregate defines a collection of related objects that we treat as a unit. This
unit is taken as a whole for the context of {data manipulation and
management of consistency}. We update aggregates via atomic operations
and communicate our data storage in terms of aggregates. NoSQL databases,
apart from graph databases , have aggregate data models.
However, relational databases have no concept of aggregates within their data
model. These are considered aggregate-ignorant.
An aggregate-ignorant model allows you to look at data in different ways, so
it’s good when you don’t have a primary structure for manipulating data.
Aggregate ignorant databases, like relational and graph databases, in general
support ACID transactions.
[source2]

Who do I think is aggregate oriented?
Yes (1 aggregate = 1 column /
segment of column)
Yes (1 aggregate can be a whole document
(identified by its key),
or a materialized view generated using map-
reduce)
Yes
(1 aggregate = 1 value,
i.e. a BLOB that bundles together
a bunch of data, this bunch is
meaningful only for the app)
No
No
No
Aggregate ignorant
Aggregate oriented data models
Maybe also a column family ?
But I don’t think so
I think the reason why Graph DBs are not “aggregate oriented” is because, despite storing
data as interconnected nodes, a node is probably not considered as an aggregate; probably
because the boundaries of an ACID operation extend beyond one node.

Key-Value DBMS
Performance, but ignorance of what the values mean.

Key value DBMS
BLOB.
The K-V DBMS doesn’t care
what’s inside this BLOB value;
it’s up to the app to figure that
out.
key value
key value
key value
key value
key value
key value
Values are just BLOBs ; they have no meaning for the DBMS
Key-value DBMS

Key value DBMS
key value
key value
key value
key value
key value
key value
API
• get the value for a key,
• put a value for a key,
• delete a key-value pair
How to query: with a very simple API
Key-value DBMS

Documents DBMS
Store hierarchical trees of data

<Value=Document>
<Key=DocumentID>
Documents DBMS
key document
key document
key document
key document
key document
“key-value stores where the value is examinable”; indeed this value is a document
key document
Depending on the DBMS, the document
may be in JSON, XML, BSON, etc.
Documents DBMS

Documents DBMS
key document
key document
key document
key document
key document
Example with a JSON document
key document
Documents DBMS

Documents DBMS
key document
key document
key document
key document
key document
How to query: with the document key, or (for some DBMS, like MongoDB) with attributes within documents
key
API
MongoDB
Actually, with MongoDB, it wouldn’t be a JSON doc, but a
BSON one. So it’d look like this:
x31x00x00x00
x04BSONx00
x26x00x00x00
x02x30x00x08x00x00x00awesomex00
x01x31x00x33x33x33x33x33x33x14x40
x10x32x00xc2x07x00x00
x00
x00

Documents DBMS
key document
key document
key document
key document
key document
How to query: for some other DBMS (e.g. CouchDB), querying docs by anything else than their ID requires
creating a materialized view, populated with JavaScript map-reduce code (for instance).
key
API
CouchDB
Document ID
This functions will parse all the
documents in the store, and emit the
docID of docs where there is a match
(where one of the topics is “music”).
The load of running a map function can be
distributed between nodes.
I think that this map function should be followed by a reduce
function that simply returns what it has been fed as parameters: e.g.
nonReduce = function (keys, values, reduce) {
if (reduce) {
// never run
}else{
// returns the emitted data
return values;
}
};
false

Documents DBMS
key document
key document
key document
key document
key document
Example with map and reduce
key
API
CouchDB
I think it's an array (with keys) of an array (with '1's) :
values= [ 'skating':[1,1]
'music': [1],
'sleeping': [1,1,1,1]
];
length() of each nested array ?
Boolean to say whether or not a
re-reduce is needed
That’s a key
That’s a value

Columnar DBMS
To have the DBMS work on columns, instead of rows

Columnar DBMS vs RDBMS
How you use them
Columnar DBMS
• Data is stored in columns
• You specify column families (kind of entities), that are composed of
rows featuring some of the columns (among all the columns
mentioned in the column-family).
RDBMS
• Data is stored in tables; each row contains data for all columns (although
a value can be NULL)
Col 1 Col 2 Col 3
Column family A
row1
row2
row3
row4
Col 1 Col 2 Col 3
Table A
row1
row2
row3
row4

Why columnar DBMS ?
The benefits of column-oriented DBMS reside only in the way they
store data on-disk: they stores data by column instead of by row.
This makes such DBMS more performant when you query a few
columns, but read/write many things in those few columns.
It also makes possible to store the columns in a compressed state, and
only the columns being queried will be decompressed (on the fly).
Such DBMS are meant for analytics or batch-processing use-cases (and
not performant at all for OLTP).

Colum oriented storage vs Row oriented storage
Column oriented storage (columnar DBMS’ strategy)
• Each column is stored in its own datafile
source
datafile0
datafile1
a. Adding/deleting a column is relatively cheep in I/O: it only requires
working on a single small datafile.
b. Columns are stored compressed on the disk. Only the columns you
query will be decompressed (on the fly).
Row oriented storage (RDBMS’ strategy)
a. it might require to rewrite the whole table...
b. you can’t compress rows, because the whole row has to be decompressed
in order to be understandable (just like in a column-oriented storage, the whole column has to be
decompressed, or at least the whole subset of a column –i.e. “segment” ?-). This means the whole
table would have to be decompressed in order to be queried (I don’t think it you
could only decompress a subset of the table, because it is hard to think of a meaningful way the table could have
been chunked. Maybe you could only compress all the values except the ID, and chunk the table based on the ID;
but it would only be useful for JOINs based on foreign key.). A decompressed table is often too
big to fit only in memory, so you’d have to swap part of it on disk (which is
slow) just to be able to query it.

Colum oriented storage vs Row oriented storage
when not to use
Column oriented storage (columnar DBMS’ strategy)
source
• If you only want to work on a few rows (like it’s often the case in
OLTP), it won’t be performant at all: you’ll have to read and
decompress all the columns (or at least their relevant subsets), and
then recompress and rewrite them.
Row oriented storage (RDBMS’ strategy)
• If you only need to work on a few columns, but the table has may
columns, and you want to read/write many thing from those few
columns, you’ll have to read the whole row, just to get the few column
data that interests you.
Col 1 Col 2 Col 3
You just want to
modify a row
FYI: Memory page: the smallest unit of data for virtual-memory management: the OS will move this unit
of block from the HD to the RAM using I/O channels, and vice-versa. As it is the smallest unit, a page is
read from disk as a whole, including unused space.

Graph DBMS
To store and query relationships

How to deal with many relationships
RDBMS
• you would use JOINs to compute relationships, at query
time. On top of being less intuitive, the performance of
the JOINs will decrease exponentially with the size of
the tables being joined.
Graph DBMS
• the relationships are natively stored, so no relationship
will have to be computed at run time.

Labelled Property Graph Model (e.g. implemented by Neo4J)
A graph in such a model is composed of:
• Nodes
• Relationships (between 2 nodes.)

About Nodes
A node can contain:
• Properties: multiple key-value pairs
• Labels: tags representing the roles of the node in the data domain. They are used to group
nodes into sets. Labels may also serve to attach metadata (index or constraint information)
to certain nodes.
Nodes
+
Label Labelled nodes
Person Book
Those names are
properties

About Relationships
A relationship always has:
• a direction: a start node, and an end node
• a type (i.e. a name)
• Properties: multiple key-value pairs
Properties
Properties
The type of relationship is
“HAS_READ”

NoSQL - Leo's notes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NoSQL - Leo's notes

Similar to NoSQL - Leo's notes (20)

More from Léopold Gault

More from Léopold Gault (7)

Recently uploaded

Recently uploaded (20)

NoSQL - Leo's notes

Editor's Notes