Im from California – where mountain biking and startups were invented. My friends work at Facebook, eBay, Linked-In, and often Im the only DBA they will talk to. That is how I hear about the decision model around NoSQL usage.
We are a managed service AND a solution provider of elite database and System Administration skills in Oracle, MySQL and SQL Server 3
MySQL for front-end and ad servingOracle as a data warehouseHadoop for analytics and ETLHive as a more structured Hadoop frontendCassandra for mailbox searchWhile an excellent RDBMS such as Oracle can solve 90% of the problems, we need multiple, special purpose databases for the other 10%.Every developer knows more than one language, and most of them will happily learn more if the job requires. The good ones are “software engineers” and not “Java programmers”. We need to turn “Senior Oracle DBAs” into “Database Engineers”. 5
* Marketing term. These days everything is NoSQL (including Oracle!)* Anything from file-system to cache can be called NoSQL* Key value stores, document stores, column stores, OLTP or DW, RAM or Disk,
Some people say: Why worry about scale before you have even 100 users?Not true. Some startups like eBay or LinkedIn have a scale or fail business model from the beginning. They know that if they dont get 250M users, they will fail. So they plan for 250M from the beginning.While initially most NoSQL databases are easier for developers, due to simpler data models and easier access methods than JDBC+SQL. Eventually NoSQL databases lack many of the services an RDBMS will provide, forcing your developers to do more work.
You can do – pk access, range scan, group by – but not joinsYou may be able to update a single row (“document”, “column family”) as an atomic operation. But that is the absolute limit.
Note that these are not traditional RDBMS problems:Checkout requires access by key only.Monitoring is write a lot query very little.Page-rank and “People you might know” require quick updates and selects are done with batch offline jobs.Word completion is just set selection
Hadoop – so big it deserves its own presentation
... or when node 3 crashes?You need to remap every single datapoint to a new node. Causing lots of data copy and scanning. Lots of extra work. Some of it may require locking.Actually when you add node #5, you only need to mode 3000/5 datapoints, not 3000. Obviously, the more nodes you have, the more advantage there is to a smarter way of partitioning.
This includes subsequent access from independent processes
When you decide to go with a distributed and replicated model, theres an obvious question: What do I do when some of the nodes needed for the operation are not available (either due to network issues or to crashes):1. Fail the operation2. Wait for the node to come back3. Perform the operation on reachable nodes and update the extra node when its back.
Writes dont get lost, because at least one node keeps them and attempts to communicate them to other nodes in the system
Important – the application must know how to resolve conflicts. If you dont have a good method of resolving conflicts – dont do eventual consistency.
Storage nodes are the physical serversThey contain “partitions”. Keys are mapped to partitions. Partitions are grouped into “replication groups”, each containing a set number of partitions on separate servers, and if needed – separate data centers. All partitions in a replication group contain identical data.One partition in a replication group is designated the “master”. Writes are done on the master only. If the master fails, a new master is elected in the group.Client drivers keep track of the hash map – which key will map to which partition, who is the master of each replication group and the load on each node in the group. This allows the client to work to the right node. 28
Major key controls the location of the key. This means that all keys with same major key are kept on same replica, and can be updated in one transactions. It also means that many different major keys should be used to fully utilize all storage nodes. 31
* New products = lots of bugs, few featuresOracle is at 11gR2. MS SQLServer is the equivalent of Oracle 8i (maybe 9?), MySQL is somewhere between 6 and 7. NoSQL is between 2 and 3.* Open source = no support* Many companies decide to built their own – most of the algorithms are published, you can use existing code, there is no support anyway, solving specific problems is easier