Making sense in the brave new world of databases Benoit Grégoire Savoir-faire Linux [email_address]
Why am I here?
The last time SQL was declared dead wasn't fun.
There is this new hype called “NoSql”
I don't want to deal with another database holy war for the next ten years of my career. Especially if it will prevent us from using the right tool for the job.
Lots of smart people worked on new tools, but fanboys are beginning to strike...
Typical decision criteria
Criteria 1: Is it popular/hyped?
There is no criteria 2
What did I do
Survey all databases that:
Are active projects
Why are you here
Nobody needs this theorical mumbo-jumbo, right?
Rumor has is Google has a few PHD on it's payroll
The problem: The typical web developer's entire Database training is:
I use MySql
I overheard it has a manual, but I didn't actually check since I use an ORM for everything.
ORMs probably did more damage to database understading than any other factor.
Not understanding the fundamental caracteristics of the database system prevents you from making good decisions
Relational databases aren't the only option. If you go through the pain of using an ORMs, you need to at least want SOMETHING from relational databases:
Relational integrity and schema enforcement
Fast joins WITHOUT instanciating objects
Replication (but then you probably chose the wrong solution)
So if not NoSQL, then what
High Performance Scalable Data Stores (HPSDS)
Scalable Non Relational Database (SNRD?)
The movement formerly known as NoSQL?
It's great that it's possible
Most web applications are not
Those that are don't have all of their components « Internet scale »
Plan for it, but you probably don't need it NOW.
Google didn'T start with BigTable and MapReduce...
Major problem: Scaling
Not historically part of database feature lists
Major problem: Availability and replication
Major problem: Transactions
Traditional databases: ACID
Life is about compromises
So is computing
Good – Fast – Cheap
For distributed databases
Brewer's CAP Theorem (2000)
In a distributed system, chose any two of:
No set of failures less than total network failure is allowed to cause the system to respond incorrectly (Gilbert & Lynch)
Latency is all important
And unavoidable (the speed of light is unlikely to change) Montreal Sydney round-trip IS going to take 130ms
Classyfying databases is difficult
There are projects trying to be more than one basic type (ModetDB, Virtuoso)
Some of them use another as backend or frontend
Extensible record stores
What about Map-Reduce?
Classical distributed computing:
Move program and data to processing nodes
Move program to the data node
Comparing curency/locks/transactions Conc Control Data storage Replication Txn Redis Locks RAM Async N Scalaris Locks RAM Sync L Tokyo Locks RAM or disk Async L Voldemort MVCC RAM or BDB Async N SimpleDB None S3 Async N Riak MVCC Pluggable Async N MongoDB Field-level Disk Async N Couch DB MVCC Disk Async N HBase Locks Hadoop Async L HyperTable Locks Files Sync L Cassandra MVCC Disk Async L BigTable Locks + stamps GFS Sync + Async L ScaleDB Locks Disk Sync Y MySQL Cluster Locks Disk or RAM Sync Y MySql MyIsam Locks Disk Async N MySQL InnoDB MVCC Disk Async Y Drizzle Locks Disk Sync Y PostgreSQL MVCC Disk Pluggable Y
How early should you try to decouple your data?
Spliting data is more difficult than spliting applications
Are we headed to the one database to rule them all?
Profile, profile, profile
Don't neglect local caching
A database is not something that can be abstracted out.