After spending so long extolling the benefits of the various NoSQL solutions, I would like to point out at least onescenario where I haven’t seen a good NosQL solution for the RDBMS: Reporting. One of the great things aboutRDBMS is that given the information that it already have, it is very easy to massage the data into a lot of interestingforms. That is especially important when you are trying to do things like give the user the ability to analyze the dataon their own, such as by providing the user with a report tool that allows them to query, aggregate and manipulate thedata to their heart’s content.While it is certainly possible to produce reports on top of a NoSQL store, you wouldn’t be able to come close to thelevel of flexibility that a RDMBS will offer. That is one of the major benefits of the RDBMS, its flexibility. TheNoSQL solutions will tend to outperform the RDBMS solution (as long as you stay in the appropriate niche for eachNoSQL solution) and they certainly have better scalability story than the RDBMS, but for user driven reports, theRDBMS is still my tool of choice.
Column Databases is a DBMS that stores its content by column rather than by row. This has advantages for data warehouses. More efficient with Aggregates and if data is column oriented. Suited for OLAP and not much for OLTP. Comes from 1970s. Apache CASSANDRAKey-Value DBs allow the use to store key-value pair, where the key usually consist of a string, and the value is a simple primitive. Suited for uses cases where properties and values are enough: profiles, logs, etc. Eventually consistent, hierarchy, multivalued, etc. REDIS.IOGraph DB is a DB that uses graph structure with nodes, edges, and properties. Suited doe associative datasets, map object orient app structure. Avoid expensive joins.
There is a computer science theorem that quantifies the inevitable trade-offs. Eric Brewer’s CAP theorem says that if you want consistency, availability, and partition tolerance, you have to settle for two out of three. (For a distributed system, partition tolerance means the system will continue to work unless there is a total network failure. A few nodes can fail and the system keeps going.)
Horizontally ScalableThe problem is that SQL doesn't scale well. In particular, it doesn't scale horizontally. If your SQL performance is poor, you can't just add more SQL servers to make it faster. In general, you need rather large computers to handle large databases, which means some very expensive hardware. In addition, since you need large computers, this doesn't fit well with the cloud model.Document-oriented databases (such as CouchDB and MongoDB) are designed for horizontal scalability. This means as your database grows, you can simply add more commodity hardware, or more resources from the cloud. But how does it achieve this?These types of databases operate on something similar to distributed hash tables (DHTs). DHTs store a key/value pair in hash buckets. These buckets hold a number of key/value pairs indexed by "hash value." This hash value is a number generated from the data in such a way that all key/value pairs are distributed evenly among the hash buckets. For example, if the DHT has 5 hash buckets and 50 key/value pairs are stored, each hash bucket should have about 10 key/value pairs.One of the advantages here is that this is extremely easy to parallelize. Want more database servers? Just add more hash buckets. As your database grows, you just add more servers, and none of them need to be super-computers either. This is what it means to be "horizontally scalable.“Schema-lessAnother defining feature of document-oriented databases is that they're schemaless. This is a hard pill to swallow if you've been using relational databases for a long period of time. Instead of each record existing in a row of carefully designed columns, each record exists in a document. Think of it as a file on the filesystem. This document can store any data it wants, it doesn't have to follow a schema.Though, while these documents are schemaless, they're not freeform. Many databases opt to use the JSON format, which helps you store key/value pairs in a formatted way. A document can have any number of key/value pairs. Instead of using a schema, documents of the same time (for example, documents representing blog posts) all have a similar set of key/value pairs.One example of this is compactness. Since all documents of the same type don't have to have the same set of key/value pairs, you can save space by leaving some of them off. So, if not all blog posts have associated links, you can simply leave that key/value pair out. Not just leave it empty, you can leave it out entirely.But that has further implications. Not only does it save a bit of space, it makes adding features to a database relatively free. Doing an ALTER statement on a large SQL database can take hours of crunching. If it goes wrong, you'll have to restore a backup of the database, figure out what went wrong and try again. With document-oriented databases, you simply start adding new key/value pairs to your documents, it's as easy as that.Cloud ModelThe trend in web applications (and many other fields) is toward that of cloud computing. If you're not familiar with cloud computing, imagine a huge server farm. Your web application is on one of these servers, shared with many other applications. But, someone posts a link to your application on a popular website and you're suddenly inundated with traffic. On a traditional hosting platform, you'll reach the limitations of your virtual server and hit a brick wall. On a cloud, more servers can be dynamically allocated to deal with this traffic. Once the traffic is over with, the space on those servers return to the cloud. Nothing broke, and your web application didn't even slow down.One of the problems with SQL servers is they don't work well in a cloud. As databases grow and as traffic increases, larger and faster computers are required. Load balancing can be achieved by mirroring the servers, but they still need to be large and fast. This just doesn't fit with the cloud model. NoSQL servers, on the other hand, can simply add more nodes.
Client API design guidelinesThe Raven Client API design intentionally mimics the widely successful NHibernate API. The API is composed of the following main classes:IDocumentStore- This is expensive to create, thread safe and should only be created once per application. The Document Store is used to create DocumentSessions, to hold the conventions related to saving/loading data and any other global configuration. IDocumentSession- Instances of this interface are created by the DocumentStore, they are cheap to create and not thread safe. If an exception is thrown by an IDocumentSession method, the behavior of all of the methods (except Dispose) is undefined. The document session is used to interact with the Raven database, load data from the database, query the database, save and delete. Instances of this interface implement the Unit of Work pattern and change tracking. IDocumentQuery - Allows querying the indexes on the Raven server.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
Шардинг - разделение данных на уровне ресурсов. Концепция шардинга заключается в логическом разделении данных по различным ресурсам исходя из требований к нагрузке.A database shard is a horizontal partition in a database or search engine. Each individual partition is referred to as a shard or database shard.Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.There are numerous advantages to this partitioning approach. The total number of rows in each table is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.Sharding is in practice far more difficult than this. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.Where distributed computing is used to separate load between multiple servers (either for performance or reliability reasons) a shard approach may also be useful.
Transcript of "03 net saturday anton samarskyy ''document oriented databases for the .net platform''"
Document-Oriented Databases for the .NET platform<br />Anton Samarskyy<br />
Agenda<br />Challenges of Relational Databases<br />NoSQL: not only SQL<br />Document store concept<br />Document-oriented databases<br />Raven DB<br />Raven DB Demo<br />MapReduce (optional)<br />
CAP<br />Consistency<br />Each client has the same view<br />Availability<br />All clients can read and write<br />Partition tolerance<br />Works well across different network partitions<br />http://www.julianbrowne.com/article/viewer/brewers-cap-theorem<br />
Document-oriented databases are<br />Collection of independent documents: XML, JSON, JAML <br />Non relational, i.e. do not store data in tables with uniform sized fields for each record<br />Not limited with number of fields or length <br />Usually accessible via a RESTful HTTP/JSON API<br />Horizontally scalable<br />Can be distributed<br />Fault-tolerant<br />
Why documents store?<br />Schema free<br />User generated content<br />Storing full complex object graphs<br />Low overhead – usually operate on a single document:<br /> - One read, one write<br />Fast<br />Known format means the database can do interesting things with it…<br />
Indexing<br />Order in schema free world<br />Materialized views<br />Built on the background<br />Allow stale reads<br />Don’t slow down CRUD ops<br />
Document DB family<br />CouchDB: Apache project created by Damien Katz;<br />RavenDB: Oren Eini and Hybernating Rhinos project;<br />MongoDB: 10gen project.<br />SimpleDB: Amazon project. It is used as a web service in concert with Amazon Elastic Compute Cloud;<br />
Raven DB<br />Build on excising infrastructure (ESENT) that is known to scale to amazing sizes<br />Can be transactional, i.e. ACID: supports System.Transactions and can take part in distributed transactions<br />Indexes via Linq query, implements IQueryable that map to Lucene<br />Supports map/reduce operations<br />
Raven DB<br />Comes with fully functional .NET client API, Unit of Work, change tracking<br />REST based, so you can access it via the Java Script API directly<br />Support optimistic concurrency blocking<br />Can be extended with MEF<br />Has triggering support<br />Supports Sharding and Replication<br />http://ravendb.net<br />
MapReduce<br />MapReduceis a programming model and an associated implementation for processing and generating large data sets<br />Map function processes a key/value pair to generate a set of intermediate key/value pairs<br />Reduce function that merges all intermediate values associated with the same intermediate key<br />
Sharding<br />Sharding refers to horizontal partitioning of data across multiple machines<br />The idea is to split the load across many commodity machines, instead of buying huge expensive servers<br />