NoSQL databases have emerged as a response to some perceived problems in the RDBMSs: agile/dynamic schemas; and transparent, horizontal scaling of the database. The former has been promptly targeted with the introduction of unstructured data types, but scaling a relational databases is still a very hard problem.
As a consequence, all NoSQL databases have been built from scratch: their storage engines, replication techniques, journaling, ACID support (if any). They haven't leveraged the previously existing state-of-the-art of RDBMSs, effectively re-inventing the wheel. Isn't this sub-optimal? Wouldn't it be possible to construct a NoSQL database by layering it on top of a relational database?
Enter ToroDB. ToroDB is an open source project that behaves as a NoSQL database but runs on top of PostgreSQL, one of the most respected and reliable relational databases. ToroDB offers a document (JSON-like) interface, and implements the MongoDB wire protocol, hence being compatible with existing MongoDB drivers and applications. Rather than using PostgreSQL's jsonb data type, ToroDB explored an innovative approach by transforming JSON documents to a fully relational representation, in an automated way. This brings to the table many advantages like lower disk footprint and automatic data-partitioning, leading to significantly faster queries.As ToroDB speaks the MongoDB protocol, it also implements MongoDB replication and sharding techniques, enabling it to scale and offer HA like Mongo. Being based on PostgreSQL, ToroDB is effectively scaling PostgreSQL much in the same way MongoDB scales.
2. ToroDB @NoSQLonSQL
About *8Kdata*
● Research & Development in databases
●
Consulting, Training and Support in PostgreSQL
●
Founders of PostgreSQL España, 5th
largest PUG
in the world (>500 members as of today)
●
About myself: CTO at 8Kdata:
@ahachete
http://linkd.in/1jhvzQ3
www.8kdata.com
4. ToroDB @NoSQLonSQL
ToroDB in one slide
●
Document-oriented, JSON, NoSQL db
●
Open source (AGPL)
●
MongoDB compatibility (wire protocol
level)
●
Uses PostgreSQL as a storage backend
5. ToroDB @NoSQLonSQL
ToroDB storage
●
Data is stored in tables. No blobs
●
JSON documents are split by hierarchy
levels into “subdocuments”, which
contain no nested structures. Each
subdocument level is stored separately
●
Subdocuments are classified by “type”.
Each “type” maps to a different table
6. ToroDB @NoSQLonSQL
ToroDB storage (II)
●
A “structure” table keeps the
subdocument “schema”
●
Keys in JSON are mapped to attributes,
which retain the original name
●
Tables are created dinamically and
transparently to match the exact types of
the documents
12. ToroDB @NoSQLonSQL
ToroDB: query “by structure”
●
ToroDB is effectively partitioning by
type
●
Structures (schemas, partitioning types)
are cached in ToroDB memory
●
Queries only scan a subset of the data
●
Negative queries are served directly
from memory
14. ToroDB @NoSQLonSQL
Big Data: NoSQL vs SQL
vs
http://www.networkworld.com/article/2226514/tech-debates/what-s-better-for-your-big-data-application--sql-or-nosql-.html
17. ToroDB @NoSQLonSQL
Vertical scalability
Concurrency scalability
●
SQL is usually better (e.g. PostgreSQL):
➔
Finer locking
➔
MVCC
➔
better caching
●
NoSQL often needs sharding within the
same host to scale
18. ToroDB @NoSQLonSQL
Vertical scalability
Hardware scalability
●
Scaling with the number of cores?
●
Process/threading model?
Query scalability
●
Use of indexes? Use of more than one?
●
Table/collection partitioning?
●
ToroDB “by-type” partitioning
19. ToroDB @NoSQLonSQL
Read scalability: replication
●
Replicate data to slave nodes, available
read-only: scale-out reads
●
Both NoSQL and SQL support it
●
Binary replication usually faster (e.g.
PostgreSQL's Streaming Replication)
●
Not free from undesirable phenomena
22. ToroDB @NoSQLonSQL
MongoDB's dirty and stale reads
Dirty reads
A primary in minority accepts a write that
other clients see, but it later steps down,
write is rolled back (fixed in 3.2?)
Stale reads
A primary in minority serves a value that
ought to be current, but a newer value
was written to the other primary in
minority
23. ToroDB @NoSQLonSQL
Write scalability
(sharding)
●
NoSQL better prepared than SQL
●
But many compromises in data
modeling (schema design): no FKs
●
There are also solutions for SQL:
➔
Shared-disk, limited scalability (RAC)
➔
Sharding (like pg_shard)
➔
PostgreSQL's FDWs
25. ToroDB @NoSQLonSQL
Replication protocol choice
●
ToroDB is based on PostgreSQL
●
PostgreSQL has either binary streaming
replication (async or sync) or logical
replication
●
MongoDB has logical replication
●
ToroDB uses MongoDB's protocol
26. ToroDB @NoSQLonSQL
MongoDB's replication protocol
●
Every change is recorded in JSON
documents, idempotent format
(collection: local.oplog.rs)
●
Slaves pull these documents from master
(or other slaves) asynchronously
●
Changes are applied and feedback is
sent upstream
29. ToroDB @NoSQLonSQL
ToroDB v0.4
●
ToroDB works as a secondary slave of a
MongoDB master (or slave)
●
Implements the full replication protocol
(not an oplog tailable query)
●
Replicates from Mongo to a PostgreSQL
●
Open source github.com/torodb/torodb
(repl branch, version 0.4-SNAPSHOT)
30. ToroDB @NoSQLonSQL
Advantages of ToroDB w/ replication
●
Native SQL
●
Query “by type”
●
Better SQL scaling
●
Less concurrency contention
●
Better hardware utilization
●
No need for ETL from Mongo to PG!
31. ToroDB @NoSQLonSQL
●
NoSQL is trying to get back to SQL
●
ToroDB is SQL native!
●
Insert with Mongo, query with SQL!
●
Powerful PostgreSQL SQL: window
functions, recursive queries, hypothetical
aggregates, lateral joins, CTEs, etc
ToroDB: native SQL
37. ToroDB @NoSQLonSQL
●
GreenPlum was open sourced 6 days
ago
●
ToroDB already runs on GP!
●
Goal: ToroDB v0.4 replicates from a
MongoDB master onto Toro-Greenplum
and run the reporting using distributed
SQL
ToroDB on GP: SQL DW for MongoDB