Livre blanc Windows Azure No SQL
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Livre blanc Windows Azure No SQL

on

  • 1,272 views

 

Statistics

Views

Total Views
1,272
Views on SlideShare
1,272
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Livre blanc Windows Azure No SQL Document Transcript

  • 1. © 2011 Microsoft Corporation. All rights reserved. NoSQL and the Windows Azure platform Investigation of an Unlikely Combination Author Andrew J. Brust, Blue Badge Insights, Inc. Published April 25, 2011 Applies to Windows Azure, SQL Azure and NoSQL Abstract An introduction to NoSQL database technology, and its major subcategories, for those new to the subject; an examination of NoSQL technologies available in the cloud using Windows Azure and SQL Azure; and a critical discussion of the NoSQL and relational database approaches, including the suitability of each to line-of-business application development. Disclaimer The research and opinions contained herein are the author’s own, and represent his perspective on the topics discussed. While the author received support and assistance from Microsoft in the creation of this paper, its thesis and conclusions do not constitute Microsoft’s official position, implied or explicit, on NoSQL technology.
  • 2. 2 Contents Introduction............................................................................................................................................................................................4 What is NoSQL? ....................................................................................................................................................................................5 Key-Value Stores .............................................................................................................................................................................6 Document Stores.............................................................................................................................................................................7 Wide Column Stores ......................................................................................................................................................................8 Graph Databases .......................................................................................................................................................................... 10 From Relational to Relationships ...................................................................................................................................... 10 Graphs and ORM..................................................................................................................................................................... 10 NoSQL Database Common Traits ............................................................................................................................................... 11 Shared Legacy: MapReduce, Hadoop, BigTable and HBase ....................................................................................... 11 NoSQL Database Consistency................................................................................................................................................. 13 Logical Models, Physical Models, and the Ubiquity of Key-Value Pairs................................................................. 13 NoSQL Indexing............................................................................................................................................................................ 14 NoSQL options on the Windows Azure Platform................................................................................................................. 14 Azure Table Storage.................................................................................................................................................................... 15 SQL Azure XML Columns .......................................................................................................................................................... 15 SQL Azure Federation................................................................................................................................................................. 16 OData................................................................................................................................................................................................ 17 What the Support Means..................................................................................................................................................... 17 Running NoSQL Database Products using Azure Worker Roles, VM Roles and Azure Drive........................ 18 On-Premise Technologies......................................................................................................................................................... 18 SQL Server 2008/2008R2 “Beyond Relational” Features.......................................................................................... 19 SQL Server Parallel Data Warehouse Edition ............................................................................................................... 19 Microsoft Research Dryad.................................................................................................................................................... 20 NoSQL Upsides, Downsides .......................................................................................................................................................... 21 Upsides............................................................................................................................................................................................. 22 Lightweight, low-friction ...................................................................................................................................................... 22 Minimalist tool requirements ............................................................................................................................................. 22 Sharding & Replication......................................................................................................................................................... 22 Web Developer-Friendliness............................................................................................................................................... 22
  • 3. 3 Cross-Platform, Cross-Device Operation....................................................................................................................... 23 Downsides....................................................................................................................................................................................... 23 Optimizations Have a Price ................................................................................................................................................. 23 Requirement to Query using a Procedural Language .............................................................................................. 24 Necessity to Scale Manually................................................................................................................................................ 24 Primitive Tooling...................................................................................................................................................................... 25 Lack of ACID Transactional Capabilities in Some Products.................................................................................... 25 Conclusion: Relational’s Continued Indispensability in Line-of-Business................................................................... 26
  • 4. 4 Introduction Just at the time when the database market seemed to many to be almost completely mature, a group of non-relational data stores, collectively categorized as “NoSQL” databases, have attracted significant attention. These databases are often employed in public, massively scaled Web site scenarios, where traditional database features matter less, and fast fetching of relatively simple data sets matters most. Many of these databases employ parallelized query mechanisms, horizontal partitioning and allow storage of heterogeneous, loosely-schematized data records. With so much developer mindshare being focused on the Web these days, and with the constant thirst for performance amongst technologists, especially for large Web applications, it’s no wonder that NoSQL databases are seen favorably and used by an enthusiastic population of developers. As Cloud computing grows, and given the proclivity of developers to conflate Web computing and scale with Cloud computing and elasticity, interest in NoSQL databases amongst cloud developers is equally unsurprising. Together, these streams of interest and visibility are significant; understandably, then, even users of traditional, relational databases are exploring the question of whether NoSQL technology is something they should use, too. There’s no free lunch though. Although NoSQL databases do facilitate the performance and availability that public Web properties sometimes require, the cost can be great. Things that users of a Relational Database Management System (RDBMS) would take for granted, including some or all of: transactional, atomic writes; indexing of non-key columns; query optimizers; and declarative, set-oriented query, are sacrificed in the NoSQL world. In certain scenarios, that sacrifice is justified and acceptable. But in many others, including line-of-business applications, that sacrifice is much less reasonable. As with anything in the software world, when technologies enter the realm of phenomena, the prudent thing to do is deconstruct and demystify them, understand and enumerate their various capabilities, then judge if those capabilities merit the enthusiasm and justify a disruption. Specifically, in the realm of cloud computing with the Microsoft stack, i.e. Windows Azure and SQL Azure, important questions arise with respect to NoSQL, and need to be answered. What exactly is NoSQL, and what characterizes its various subcategories? Are individual facets of NoSQL database architectures available to Azure developers? Are they sufficient or will only a full-blown NoSQL technology fulfill most requirements? Where in the Azure stack do these NoSQL technologies sit? For the types of applications that .NET and SQL Server practitioners build, is NoSQL better than relational? Is it even as good? These questions must be explored and answered before the larger question of NoSQL’s (or relational’s) overall efficacy can be judged. In this paper, we will define NoSQL, explore some of its history, review the various types of NoSQL databases, and understand their respective features. We will determine the commonalities between the various NoSQL subcategories and try to determine what basket of features seem to attract developers the most. We’ll examine the scenarios where use of NoSQL makes the most sense. We’ll distill the enumeration of NoSQL features down to the overall tradeoffs between NoSQL and relational databases.
  • 5. 5 We will also review the various components of the Azure stack that offer NoSQL technology, or capabilities that are comparable to those found in NoSQL databases. We will look at Windows Azure Storage, new and imminent features in SQL Azure, and even ways to deploy non-Microsoft, NoSQL databases to the Azure cloud, to make them usable from .NET code that is also deployed there. By the end of this paper, readers should have a good understanding of what NoSQL is all about and whether individual NoSQL features, full-fledged NoSQL databases or continued use of relational technology will work best for them. Let’s now define NoSQL, by examining the general use cases that it serves. We’ll also discuss the subcategories of NoSQL and take a more detailed look at each of them. What is NoSQL? There are scenarios in the software development world where data management is required, but what many of us might think of as a full-fledged database is not. Think of that application you wrote once that had a small amount of data to store, and did it using flat files, so you could avoid creating a database. Maybe you needed to store a few bits of information about the current user; maybe you needed to store application settings, or application state information, like window size and position; or perhaps you needed to store and retrieve actual content – be it raw text, images, or media – and the file system seemed to make more sense than a relational database as the repository. Now imagine an application like that one you wrote, but which ran on the Web and needed to serve a vast array of users distributed across the globe, many of them concurrently. You would find that your database needs, while still technically modest in terms of query complexity, would almost certainly outstrip what you could do comfortably using the file system. You’d need a server, or even a globally distributed cluster of servers. The server or cluster would need to be highly scalable to meet the demands of a popular Web-based application, and very fast at performing these relatively simple discrete store and fetch operations. You would need a database, but probably not the relational one you’re used to. The grouping of database engines collectively referred to as “NoSQL” is optimized for these workloads. Most of them sport distributed architectures as a core feature. Many of them are Apache or independent open source projects. NoSQL databases are good at what they do, primarily by dispensing with many of the tenets of relational database management. Many NoSQL databases trade off “ACID” (atomicity, consistency, isolation and durability) guarantees in favor of providing for very-high performance in the broad scale/simple store and retrieve scenario. And as we mentioned already, NoSQL databases, to varying degrees, even allow for the schema of data to differ from record to record. The “CAP” theorem says that databases may only excel at two of the following three attributes: consistency, availability and partition tolerance. Relational databases favor the first and last of those three properties; NoSQL databases favor the last two. In other words, NoSQL intentionally de-emphasizes the rules and functionality of consistency that many database administrators and developers think of as the very prerequisites of database management.
  • 6. 6 In his paper Amazon's Dynamo 1 (Dynamo is the online retailer’s foundational NoSQL database), Werner Vogels, Amazon.com’s Chief Technology Officer, describes why such an approach is appropriate: “Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS.” In other words, various systems on the Web, many of which are consumer-facing, don’t have sophisticated database needs, but they nonetheless have a huge burden. They must carry out their simple needs very, very quickly. NoSQL databases handle these workloads well, but they make serious concessions, to otherwise mainstream database needs, in order to do it. That is well-justified, but not always well-understood; in fact there exist NoSQL practitioners who advocate the usage of NoSQL as a general database technology applicable to the mainstream of application database needs. Such advocacy has caused some relational database customers to have concerns that they should perhaps switch to NoSQL databases even for line- of-business (LOB) applications. Customers have these concerns despite the fact that most LOB apps require transactional guarantees, and are well-served by normalized design and formal schema. This can be a controversial state of affairs and we hope to sort out that controversy. For now though, let’s just say that NoSQL databases work well in certain scenarios, and that sketching out what those scenarios are, and what they are not, is an important goal of this paper. To help enumerate those scenarios, it’s best that we discuss four subcategories that NoSQL databases tend to break down into. Enumerations of such subcategories tend to vary, but they usually include Key- Value Stores, Document Stores, Wide Column Stores and Graph Databases. Each NoSQL subcategory serves certain scenarios best. To understand core NoSQL scenarios as best as we can, let’s explore the various NoSQL subcategories and the specific types of applications and workloads they support most ably. Key-Value Stores The Key-Value Store subcategory (summarized graphically in Figure 1) is perhaps the mother of all NoSQL database types. Most NoSQL databases feature key-value mechanisms, even if only behind the scenes. NoSQL databases that belong to the explicit Key-Value Store category use their namesake construct as the basic unit of storage. A key-value pair might consist of a key like “Phone Number” that is associated with a value like “(212) 555-1212.” Key- Value Stores contain records whose entire content is made up of such pairs; the structure of one record can differ from the others in the same collection. 1 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Figure 1: Key-Value Stores often use the nomenclature of tables and rows, but the latter simply contain collections of key-value pairs, which vary from row to row.
  • 7. 7 If you do much programming, you’ll recognize this construct right away. That’s because collections, dictionaries and associative arrays in the programming world work on the same principle. Data caches work on the key-value principle as well. In fact, one prominent Key-Value Store, MemcacheDB, is API- compatible with the Memcached open source cache. The parallels between Key-Value Stores on the one hand, and collections, dictionaries, associative arrays and caches on the other, is more than academic; it’s significant. It shows that NoSQL databases work well in circumstances where data retrieval needs to be cache-like in speed and where the data which must be stored and retrieved consists of small, simple collections of attributes and values. Applications where Key-Value Stores would work well include anything where lists, like product categories, individual product attributes, shopping cart contents and top n best-selling products, or individual values like color schemes, a landing page URI, or a default account number, must be maintained. Values can consist of long text content, not just numeric and short string data. As such, content like comments, reviews, status messages or even private emails can be stored in a Key-Value Store. Most of this data is non-hierarchical, so the lack of relational logic or join constructs is acceptable. Some of this key-value-appropriate data (though probably not the long text content) is akin to lookup data, or configuration and preference data, in smaller applications. For a desktop app, we could imagine this data might be stored in a configuration file or a small, offline database. We could also imagine that much of it might do well to be loaded in memory upon application startup. For a consumer-facing Web app, the data is similarly straightforward, but the storage technology itself must be more capable. The data must live in a repository that is distributed, fault tolerant, fast and highly available. Beyond MemcacheDB and Dynamo lie other Key-Value Stores. Project Voldemort is an open source Key- Value store that originated at LinkedIn; and Dynomite, Kai and Riak are open source derivatives of Dynamo (which is not open source, nor publicly available, even though its architecture has been disclosed through published papers). Before we go on to describe other NoSQL database types, we must reiterate that almost all of them, whether physically or conceptually, build upon Key-Value Store principles. Therefore you should expect their applications to be more specialized than, but not wholly distinct from, those of Key-Value Stores themselves. Document Stores Document Stores are NoSQL databases which treat what might be otherwise called “records” or “rows” as “documents.” As with Key-Value Stores, each record can have a structure widely differentiated from the others. Each document consists of a set of keys and values, which can be compared to a relational table’s field names and values. The Document Store data structure is summarized in Figure 2. Two leading Document Stores, CouchDB and MongoDB, each use JavaScript data types for the values stored in their documents. Because of this, their documents can be thought of as JavaScript objects and can, in fact, be written and read in JSON (JavaScript Object notation) format. That doesn’t mean Document Stores equate to Object Databases, but it does mean that Document Stores have an affinity
  • 8. 8 with JavaScript programming and programmers. In fact, the native stored procedure/scripting language for both CouchDB and MongoDB is JavaScript itself. Documents can also contain attachments, making document stores useful for content management. The fact that certain Document Stores feature versioning of their documents (i.e. old versions are retained and all versions are numbered) makes this all the more so. CouchDB and MongoDB have been used for an array of public-facing Web application types including blog engines, event logs, appointment calendars, media stores, chat applications, cloud bookmark storage and even Twitter clients. An important facet of Document Stores is that the documents themselves can be addressed by unique URLs. And given the HTTP and URL orientation, document databases are automatically REST-friendly, as their APIs bear out. In the case of CouchDB, the HTTP orientation is developed to the point where the database can function as its own Web application server. Here’s how: so-called Show Functions in CouchDB – JavaScript functions that render HTML with the return statement – can be stored in special documents called design documents, and each function within is accessible via URL. This means that entire Web applications can be implemented in a document database. Users visit a URL, code runs on the server and content is returned via the HTTP response stream, just as it would be with classic ASP, node.js, ASP.NET Web Pages or PHP. This HTTP and application orientation distinguishes Documents Stores from Key-Value Stores, the latter of which are more general purpose in their implementation and application. That said, there are some NoSQL taxonomies which do not recognize the Document Store category and instead label its members as Key-Value Stores. As you will see, the remaining two NoSQL subcategories utilize key-value technology as well. Wide Column Stores Wide Column Stores, also known as Column Family Stores, manage key-value pairs, but they organize their storage in a semi-schematized and hierarchical pattern. Perhaps fittingly then, some of their nomenclature correlates with that of RDBMS technology. For example, the keys in a Wide Column Store are referred to as columns, and are stored in structures that are sometimes referred to as tables. In Figure 2: Document Stores contain JSON objects, referred to as documents, each of which has a schema-free of set properties and values. Values may contain attachments, point to other documents, or directly contain them.
  • 9. 9 between the table and column level lie various intermediate structures that vary by product. For example, Apache Cassandra (originated by Facebook) features Super Columns. Hypertable and Apache HBase feature Column Families, and Google’s BigTable features Tablets. The hierarchical structure and some of the varying nomenclature of Wide Column Stores is summarized in Figure 3. Although the schema within the intermediate structures can vary from row to row, tables and the intermediate structures themselves must be declared. Therefore, Wide Column Stores, while they tolerate schema variation at the “leaf” column level, are not completely schema-free. One could reasonably argue, in fact, that schema changes at the non-leaf level in Wide Column Stores are more disruptive than changes to table schemas in relational databases. Wide Column Stores work well for a subset of requirements that Key-Value Stores accommodate and many adopters of this category of NoSQL database cite the performance factors, over the structural ones, as reasons they chose it. But, clearly, Wide Column Stores are best for semi- structured data, rather than data whose structure is completely variable from row to row. As an example, in a product catalog, we may have a collection of items, each of which has a size and a rating associated with it, and we may want to store these items together in a table. But certain items’ sizes may be represented by height, width and depth, others by radius, and still others by weight. The rating may be a star rating on a 1-5 scale (e.g. for a book), or collection of sub-ratings on various attributes (e.g. freshness, flavor, color, moistness). Accommodating a grouping of entities with high-level characteristics in common, but with differing context-specific attributes, is one area where Wide Column Stores do well. In the relational world, traditionally, such context-specific attributes would each need to be stored in separate tables, with a foreign key in the main table to link them 2 . Joins and application-level merging of the datasets might be necessary. But Wide Column Stores allow such differently nuanced data to comingle in the same tables and query result sets. 2 Recent versions of major RDBMS products offer new features to accommodate this requirement without resorting to separate attribute tables. Such features in SQL Server and SQL Azure will be discussed later in this paper. Figure 3: Wide Column Stores contain tables (indicated above as “T”); Cassandra calls them “super- column families” (shown as “SCF”). These contain a key and columns (“C”) which consist of name/value pairs. Columns are subdivided into column families (“CF”), which are known as “super columns” (“SC”) in Cassandra. Columns are schema-free, but higher-level objects must be declared.
  • 10. 10 Graph Databases Graph databases recognize entities in a business or other domain, and explicitly track the relationships between them. In the graph database world, these entities are called nodes and the relationships between them are called edges; all of these terms come from mathematical graph theory as does this NoSQL database subcategory’s name. An example of a graph database assertion (the fundamental atomic unit of data expression) might be: Chris city Auckland Where Chris and Auckland are nodes and city is an edge. From Relational to Relationships As we try to orient ourselves to graph databases from a relational frame of reference, we could think of an edge in a graph database (a predicate) as a join, and the subject and the object of that predicate (the Chris node and the Auckland node, respectfully, in the above case) as rows in a table. Attributes of a node that have scalar values (for example the attribute Age might have a value of 45) can also be represented using edges and nodes, or as properties and values, depending on the specific graph database in use. In the former case, an edge might be thought of as a column, in a broad sense, rather than as a join. A collection of assertions are kept together in a graph. The structure of Graph Databases is illustrated in Figure 4. New edges can be added (or old ones removed) at any time, allowing one-to-many and many-to-many relationships to be expressed easily and avoiding anything like an intermediate relationship table that you might use in a relational database to accommodate many-to-many joins. Social graphs fit into the graph database rubric nicely (as does the name). Constructs like friends, followers, degrees of separation, lists, endorsements, status messages and responses to them are very naturally accommodated in graph databases. Semantic Web data also maps quite nicely on to the graph database structure. Graphs and ORM As we consider the concepts of properties, values and relationships, it starts to become clear that graph database theory has some alignment with object-relational modeling and ORM programming. This then Figure 4: Graph databases, like those in other NoSQL subcategories, may be key-value based, but they excel at tracking relationships (edges) between entities (nodes), in addition to the entities, keys and values, themselves. Sometimes even the key- value pairs are represented as edges and nodes.
  • 11. 11 begs the question of whether object databases belong in the NoSQL camp or even of whether they are in fact synonymous with graph databases. There really are no rules or strict definitions to provide authoritative answers to these questions, but there are differences in intent between graph and object databases. Object databases typically are schema based (even if the schema describes a class rather than a table) and are focused on entities and their properties. Graph databases are designed to accommodate slowly- or even rapidly-changing schemas and focus on relationships between entities more than the entities themselves. Popular graph databases include AllegroGraph, Neo4j and Twitter’s FlockDB. NoSQL Database Common Traits Having now covered the four main NoSQL subcategories, and what distinguishes them, let’s take a look at the qualities which each category’s products have in common. We’ll first look at a pair of technologies from Google (and their Apache project counterparts) whose design principles pervade all NoSQL subcategories. We’ll continue with a general look at the data consistency models employed in NoSQL databases and the split between NoSQL’s physical and logical implementations. We’ll finish with a look at NoSQL indexing and we’ll then be able to move to the next section and review the various features and products within Windows Azure and SQL Azure that provide NoSQL functionality. Shared Legacy: MapReduce, Hadoop, BigTable and HBase It’s a good idea for us to take a look at two technologies which underlie, or have provided inspiration for, many of the individual products in each NoSQL subcategory. Specifically, Google’s MapReduce and BigTable and their open source counterparts, Apache Hadoop and Apache HBase. Google MapReduce and the open source Hadoop project provide generalized parallel job processing engines; Google BigTable and the open source HBase are Wide Column Stores whose tables can serve as sources and destinations for the MapReduce and Hadoop jobs, respectively. Why are the job processing engines necessary? Because the less structured, less formal approaches employed by NoSQL databases make querying them less straightforward than in the relational world, and MapReduce/Hadoop help mitigate the burden. Think about it: although explicit joins are not necessary in the NoSQL world, the permissive environment and resulting inconsistency across records/entities/documents makes for quite a bit more hunting and gathering in order to satisfy a query. This is especially true for distributed NoSQL databases which store their data across various servers, typically using a partitioning pattern called sharding (more on that later). The lack of query optimizers, and corresponding query efficiencies, in NoSQL databases cries out for some help. NoSQL databases often require queries to be broken up and executed across multiple repositories on different servers. At some point, the resulting segmented result sets need to be collected and unified. An
  • 12. 12 approach called map-reduce acknowledges and addresses this conundrum. Specifically, the process of distributing the query across multiple agents is the Map step, and the process of coalescing the results into a single result set is the Reduce step. Map-reduce is a general algorithm, and is prevalent in functional programming languages – including F# – which support the notion of map and reduce functions. MapReduce (without the hyphen) is the patented software framework from Google that the company applies in the realm of managing large datasets over clusters or other distributed topologies. Hadoop is the top-level Apache project which implements map-reduce as a generalized highly parallel, divide-and-conquer batch job task manager. Google MapReduce/ BigTable and Apache Hadoop /HBase have their fingerprints all over most NoSQL databases. For example, Apache CouchDB, one of the document store databases already discussed, is, according to its Web site on apache.org, “queried and indexed in a MapReduce fashion.” Some would argue that CouchDB’s map and reduce steps differ conceptually from those in MapReduce itself. Nonetheless, the overarching map-reduce approach is the inspiration for the design of many NoSQL products. As effective as these mechanisms can be, they also introduce extra work for the database developer. That’s because instead of providing a declarative language over distributed storage that could then be implemented using map-reduce functionality under the covers, the architecture’s designers focused primarily on the raw processing approach and never added a language abstraction. In the world of line- of-business applications, the declarative power of SQL provides productivity that most organizations count on. Map-reduce based systems, by and large, cannot provide that productivity. A summary of the various NoSQL database subcategories, and the suitability of each to different scenarios and requirements, including map-reduce, is presented in table form in Figure 5. Figure 5: This chart shows the applicability of different NoSQL database types to different needs or scenarios. Notice that wide column stores are more special-purposed than are the other NoSQL subcategories, which are applicable in a variety of scenarios.
  • 13. 13 NoSQL Database Consistency Many NoSQL databases use an “eventual consistency” model for database updates and schema changes. This means that changes made at one replica will be transmitted asynchronously to the others. Domain Name Servers on the Internet refresh themselves on this model, and that is exactly why DNS propagation delay can allow some Internet users to navigate successfully to a new or updated domain name, while for other users the name may not resolve correctly. Eventually, all users’ DNS servers are updated and the anomaly disappears. The sacrifice of propagation delay is acceptable when the alternative (a coordinated atomic update across all DNS servers globally) is considered. The eventual consistency model allows updates to occur and DNS server availability to be maintained, all for the price of a temporary, tolerable, well-understood anomaly in the data. Likewise, in the NoSQL context, eventual consistency makes possible discrepancies in data state between replicas, and thus between users and locations, for a temporary period. As with DNS servers, such concessions to consistency are made in the name of high availability and will eventually resolve. Not all NoSQL databases use eventual consistency. Some are fully transactional. Others use an optimistic concurrency model. Some databases, like Apache Cassandra and Apache HBase, not only replicate over time, but commit their initial writes to disk over a certain latency period as well. In other words, these databases perform buffered writes by writing to memory initially (and to a log), rather than tables on disk. This is done in order to batch up the writes, rather than have them execute one at-a-time, since batching reduces the aggregate i/o time required. It is completely different from the update behavior of an RDBMS. The liberal consistency regimes of many NoSQL databases are quite appropriate, in certain scenarios. It’s important to remember that the transactional model is still the correct one in many others, including most line-of-business applications. The supremacy of one model in certain circumstances does not render established models obsolete in a variety, or even a majority, of others. Consistency is not the only sacrifice made in the name of performance and high availability. For some NoSQL databases, declarative query power is sacrificed as well. For example, “views” in CouchDB, rather than being stored queries, are actually JavaScript programs that return data. They are somewhat akin to stored procedures in the relational world, but even that analogy falters, as CouchDB views must iterate through data imperatively rather than use the set-oriented constructs found in SQL. The result is that individual query patterns must be optimized through code that anticipates them, rather than through optimizing logic that encounters them. As with the consistency sacrifice, in some situations, this may be perfectly acceptable. As we have discussed, many public Web applications perform a variety of very simple queries and a small number of complex ones, all of which can be explicitly coded. But, again, that’s not usually the case with LOB apps. Logical Models, Physical Models, and the Ubiquity of Key-Value Pairs The subcategory distinctions we’ve covered here are not only soft, but are logical model distinctions that may or may not translate to the underlying physical models. For example, Cassandra, a Wide Column
  • 14. 14 Store, essentially imposes a logical “super column” hierarchy over key-value pairs. Key-Value Stores underlie most other subcategories, either in terms of technique (such as how CouchDB’s documents are actually key-value structures, in an overt fashion) or in implementation (such as how edges and nodes in a graph database can be stored as key-value pairs as well, but behind the scenes). Document Stores, Wide Column Stores, and Graph Databases are in some senses akin to domain specific languages (DSL) in the programming world. While most NoSQL databases utilize key-value constructs, distributed architectures and sharding, and allow for schema-free databases, the various NoSQL subcategories provide different data interfaces, each of which works best in a subset of scenarios. NoSQL Indexing Despite the DSL analogy above, the common key-value substrate of most NoSQL databases does not render the subcategory a mere trivial abstraction. The quite wide spectrum of indexing features in the various NoSQL databases makes this clear. Some NoSQL databases index on little else than the keys used for rows/entities/documents and/or partitions. Others go a bit beyond this. For example, CouchDB indexes documents only on their IDs and sequence (version) numbers, but it also creates indexes on views. The AllegroGraph Graph Database, meanwhile, indexes everything (id, subject, predicate, object and graph), automatically. Some Key-Value and Wide Column Stores support so-called “secondary” indexes – a generic term for an index built on the value of a property/column that is not the key. But secondary indexes are relatively new features in some databases and still a bit immature. For example, Cassandra added secondary indexes in version 0.7, which was just released on January 9, 2011. These secondary indexes are essentially hash indexes only; support for bitmapped indexes, with which range criteria could be satisfied, is in the works for a future release. In the absence of secondary index support, some developers implement them on their own. The common approach is to create a second table containing the values of the “indexed” column and their corresponding row keys from the main table. This requirement is somewhat emblematic of NoSQL databases in general: developers may need to implement on their own what could long be taken for granted in an RDBMS. Again, in some situations, the tradeoff is deemed reasonable given the performance and availability requirements, but the price should not be understated. NoSQL options on the Windows Azure Platform As we discussed in the paper’s introduction, a proper evaluation of NoSQL involves deconstructing and deciding which features or characteristics are compelling. Next, you need to decide if those same features or characteristics are available from technologies you already use. With that in mind, what follows is an overview of certain Windows Azure and SQL Azure technologies (plus a few Microsoft on-premise
  • 15. 15 products and features) and which aspect of NoSQL technology each one implements. As you will see, elements of NoSQL computing can pop up in some unexpected places. Azure Table Storage Azure Storage is probably the most compelling place to start on our tour of NoSQL in Azure. That’s because Azure Table Storage is in fact a NoSQL database. Of the various categories of NoSQL database discussed in the last section, Azure Table Storage fits most snugly with Key-Value Stores. Azure Storage key-value pairs are called Properties; they belong to Entities which, in turn, are organized into so-called Tables. Azure Table Storage features optimistic concurrency and, as with other NoSQL databases, is schema-free, so the properties of each entity in a table may differ. Azure Table Storage does not support secondary indexes, and it’s not intended for use as a mainstream database, especially since SQL Azure is available to handle relational database duties. But Azure Table Storage is inexpensive (15c/GB/month and $0.01/10,000 transactions), easily programmed (via a .NET client library, a LINQ client and a RESTful API), and scales over multiple servers, as needed, automatically. Since Azure Table Storage is a bona fide NoSQL database, we could stop there. But it’s important to realize that other Azure technologies allow for the implementation of NoSQL approaches. These options are less about full-on NoSQL and more about cherry picking various NoSQL features when that is all that is actually desired. Let’s continue by looking at those options. SQL Azure XML Columns We’ll declare here and now: using XML columns in SQL Azure data storage constitutes NoSQL database storage. There are a number of reasons why this is the case. First, consider that an XML payload bears much resemblance to a Document Store NoSQL database. Not only are XML documents just that (i.e. documents) but they store a collection of elements and values, with those XML elements equivalent to key-value pairs in Document Stores 3 . The schema of an XML document can be changed at will (provided there’s no XSD schema in place – and the Schema Collections feature of SQL Server that supports XSD is not even implemented in SQL Azure at this time) and a collection of XML documents may or may not follow a given schema consistently. Again, each of these qualities is common to SQL Azure XML columns and Document Stores. If that weren’t enough to convince you, then consider that the developer version of Azure Storage (i.e. the emulator that runs on the local PC to use during development) is actually implemented using XML columns in SQL Server Express Edition. That means all Azure developers have a full XML-data-as-NoSQL proof-of-concept running on their development PCs. This is more than coincidence; it’s about motivation: XML columns were added to SQL Server (and other major relational database products) to accommodate databases with dynamic schema needs for certain 3 This analogy works best if we think of XML documents as a non-hierarchical storage mechanism. If we think of them as hierarchical (i.e. through the use of XML attributes or child elements) then an analogy with Wide Column Stores becomes more appropriate.
  • 16. 16 tables. Prior to XML in the database, the only way to accommodate changing schemas was to build out “vertical” tables, whose column values were stored as rows in attribute value tables (as key-value pairs, in fact). So if we consider one of the major value propositions of NoSQL, namely flexibility around changing schemas, we see that very scenario is the inspiration for the XML column feature in SQL Server (and now in SQL Azure). Using XML for NoSQL computing needs is not a kluge, but rather a sensible alignment of interests. It is important to note, however, that unlike on-premise editions of SQL Server, SQL Azure does not support indexes on XML columns. As long as your tables contain a scalar primary key column, then you’ll have the option of a key-based index, though you will lack the equivalent of a secondary index. SQL Azure Federation NoSQL focuses quite heavily on the notion of horizontal scaling and “sharding.” Sharding (i.e. horizontal partitioning) of databases accommodates the vast demand that many public Web products may experience. Using map-reduce-style technology is a common NoSQL product solution for managing the shards. SQL Azure Federation, announced at the 2010 Professional Developer Conference (PDC), is a forthcoming feature of SQL Azure which will allow individual SQL Azure databases to function as individual “shards” in a larger virtual database. This feature provides a supportable approach to dealing with SQL Azure’s current 50GB size limit on individual databases and enhances query performance while at the same time retaining the RDBMS features that most LOB developers need. SQL Azure Federation “Members” are the counterparts to NoSQL Shards. Shards are “federated” (hence the name of the feature) and this is achieved through the creation of a so-called Federation Key. The key is present in any table that will be distributed and each shard is defined in such a way that it is responsible for storing rows whose federation keys are in a specific range of values 4 . If the distribution of values changes over time, individual shards which become too large can be split into multiple ones. A significant advantage of this splitting feature is that it takes place online, under load, without affecting database availability or consistency. Once again, Azure lets us cherry-pick a NoSQL feature, without forcing us to forfeit RDBMS underpinnings This first version of SQL Azure Federation will not have support for so-called fan-out queries. So it will not have a map-reduce-style facility for taking a query that spans multiple members, splitting it automatically into separate queries and merging the results of each into a single result set. But SQL Azure Federation will have mapping functions, whereby a needed shard can be located by a specific Federation Key value and need not be addressed by its physical database name. This makes programming the query 4 In this way, a Federation Key is similar to an Azure Table Storage Partition Key
  • 17. 17 distribution simpler and it also provides the foundation for a full map-reduce-style fan out query capability that could appear in a future release. 5 OData OData is Microsoft’s generalized XML data serialization format, based on the ATOM feed standard, and RESTful API used to query, create and update data in the repositories it wraps. OData debuted as the transmission format and API for data exposed by what is now called WCF Data Services (originally known as project “Astoria,” then as ADO.NET Data Services). Typically, Astoria services act as RESTful wrappers around Entity Framework data models. But with the generalization of the data format and REST implementation, OData is now used by Microsoft and others to expose a variety of data sources. On- premise Microsoft products and technologies that support OData interfaces include SQL Server Reporting Services in SQL Server 2008 R2, SharePoint 2010 lists and Dynamics CRM 2011. In the Azure world, both Azure Table Storage and SQL Azure support OData interfaces to their respective tables. Azure Storage does so natively, while SQL Azure exposes its OData interface via a pre-release tool (SQL Azure OData Service) at time of this writing available from SQL Azure Labs. By logging into the tool and enabling OData access with a single checkbox (either for anonymous access or access by specific named users), the OData interface is made available immediately; there is no coding required to enable it. What’s more, SQL Azure provides this RESTful interface while maintaining its conventional Tabular Data Stream (TDS) interface. As such, SQL Azure provides developer simplicity while retaining its native interface, and the performance necessary for heavy LOB workloads. Windows Azure Marketplace DataMarket leverages OData as its native format for publishing the free and subscription-based data feeds that comprise the service. This makes the OData format itself especially valuable, and arguably more so than more generic XML data serialization formats, as it is at once an API tool and a channel to commercial or public distribution of data. What the Support Means In practical terms, this broad support for OData on Azure means that most of its data-focused services can be programmed via REST from most any development platform. The commands use intuitive URL patterns and open HTTP verb conventions to provide a full data platform for key-value structured storage (Azure Table Storage), relational data (SQL Azure) and de-normalized, processed data (DataMarket). OData can return results not only in ATOM/XML format, but in JSON format too. This makes it conform extremely well to various numerous NoSQL database APIs. Many NoSQL databases tout their support for REST, and the corresponding ease of use and low barrier to entry this provides. Arguably many NoSQL proponents are drawn to these platforms because of their simple RESTful interfaces. Given that Azure provides this same ease of use throughout the platform, we can see once again that Azure addresses specific needs catered to by NoSQL platforms. In fact, Azure provides for this need, and then goes beyond it: given Microsoft’s PowerPivot self-service BI tool, and its 5 Even in advance of such support, considering that map-reduce jobs must themselves be explicitly coded or scripted in many NoSQL databases, the notion of writing an Azure Federation fan-out query through code seems a reasonable task by comparison
  • 18. 18 ability to consume and analyze OData-formatted feeds using Azure’s RESTful services, Azure provides self- service BI to customers and not just APIs to developers. This presents a very clear business case that various NoSQL databases may be hard-pressed to counter. Running NoSQL Database Products using Azure Worker Roles, VM Roles and Azure Drive If the desire or specific need is present to run a particular NoSQL database product, Worker and Virtual Machine Roles make it possible to accommodate this setup on Azure, provided the NoSQL product has a Windows Server-compatible version (and most do). The VM role allows customers to build their own machine image, upload it as a virtual hard drive (VHD) file to their Azure accounts, and then spin up instances of that image. Any properly licensed software can be installed in that machine image, including various free NoSQL products. Likewise, a Worker Role can accommodate such customization, but any products added to the baseline image must be xcopy-deployable or silently installed during the Worker Role's startup task or its code's RoleEntryPoint.OnStart method. There is one complication though: since Worker and even VM role instances may be recycled at any point, local hard drive storage within the instance may at any time revert back to its baseline image state. So unless the data in the instance is static and can itself be included with a VM Role image or placed on a Worker Role image in a scripted manner at startup, data storage becomes an issue. Luckily, the Windows Azure Drive offering provides a solution. Azure Drive allows a separate VHD file, hosted in Azure Blob Storage, to be mounted as a mapped drive, within the Worker/VM Role instance, through a simple .NET API. This means that a Worker/VM Role instance could have a NoSQL database product installed on it, configured to read and write data to a mapped drive, and as long as the drive were mounted before the NoSQL product initialized, all would be well. Scaling this to multiple Role instances gets tricky, since a given VHD can be used as a read/write volume by only one instance at a time, but there are ways to do it. Is this solution optimal? Probably not. But it is workable and still runs within the context of the Azure managed platform from which you can avail yourself of the elasticity and other traits and features of the Azure fabric’s management. For Microsoft customers who already have a substantial investment in SQL Server and/or .NET, this no mere trivial benefit. And readers who find compelling the argument that NoSQL features and benefits can be had from existing Azure data products like Azure Storage, SQL Azure and their OData interfaces, will likely find the need to run dedicated NoSQL products an edge case. With that in mind, the Azure Worker Role/VM Role/Azure Drive option appears quite feasible. On-Premise Technologies Before we move on, three non-cloud technologies from Microsoft bear special mention, as they provide their own implementations of the non-tabular data, fan-out query and map-reduce job execution technology discussed in this paper.
  • 19. 19 SQL Server 2008/2008R2 “Beyond Relational” Features With the release of SQL Server 2008, a number of features were added to the product under the moniker “beyond relational.” There is an array of features in this category. The two features most often identified there are the so-called spatial features that allow for efficient storage and processing of geo-spatial information, such as latitude/longitude coordinates, polygons, points and lines. But “Beyond Relational” goes beyond geospatial, and includes a set of features that one could classify as NoSQL-like in nature. For example, the Sparse Columns feature effectively allows for loosely-schematized tables. Although all possible columns do in fact need to be declared as part of a table’s definition, the values for columns declared as sparse can be null, without introducing any storage overhead on a per-row basis (there is some overhead at the table-level, however). So while the full schema of sparse columns is stored, the physical content of each row in the table may differ, and drastically so, if necessary. Special filtered indexes and filtered statistics can be used to maintain good performance in tables that use sparse columns. Filestream columns allow Binary Large Object (BLOB) data to be stored in the server’s file system rather than in the database per se. Hierarchies and the HierarchyID column type allow for the representation of hierarchical data and provide explicit support for referencing and testing data in terms of ancestors and descendants. The XML data type is a beyond-relational feature as well and, as we have discussed, it is supported by SQL Azure; spatial features and the HierarchyID column type are supported by SQL Azure as well. However, Sparse Columns and Filestream features are not supported by SQL Azure at present. My take on this is that the symmetry between SQL Server and SQL Azure will continue to increase and, as such, the remaining Beyond Relational features will eventually be available in the cloud. When that happens, developers who are attracted to specific facets of NoSQL databases will find SQL Azure even more accommodating of their needs. SQL Server Parallel Data Warehouse Edition SQL Server Parallel Data Warehouse Edition (SQL PDW), which was borne of the acquisition of DATAllegro by Microsoft in 2008, is Microsoft’s maiden offering in the Massive Parallel Processing (MPP) database space. The product allows horizontal scaling of SQL Server by providing an interface over a number of instances of the product, each of which participates in a striped distribution of large data warehouse databases. To the database client, the entire array of SQL Server instances appears as a unified whole, and the queries sent to that single entity are appropriately split and dispatched by PDW to the appropriate individual agents, with each constituent query being executed in parallel (hence the term MPP). MPP shares qualities with both the sharding and map-reduce approaches to database management. PDW provides more value than a raw MPP or map-reduce software implementation though. It is sold as an appliance such that compute, network and storage hardware are purchased together with the software, as an appliance. PDW provides more evidence that if you seek specific capabilities of NoSQL, you may find that the relational products you use today, or products from the same family, deliver those capabilities to you, without the disruption that would come from migration to a new database platform.
  • 20. 20 Microsoft Research Dryad Dryad is a Microsoft Research (MSR) project that implements a map-reduce style execution engine. Dryad jobs consist of series of programs that are connected by channels. The programs represent vertices, and the channels represent edges. Together, these vertices and edges form a graph, and any such graph 6 , as long as it is acyclical, can be executed by Dryad. Like MapReduce or Hadoop, Dryad is an execution engine that manages jobs, processes input files and produces output files. Dryad manages the execution of a graph’s vertices/programs across various nodes in a compute cluster. Nodes may be physical machines, or cores within a machine. MSR explains that Dryad subsumes map-reduce and also provides such infrastructural services as fault tolerance, re- execution, scheduling, and accounting. Dryad is not a database, but it can coordinate the operations of multiple database servers. In fact, Microsoft AdCenter uses Dryad to run multiple instances of SQL Server Integration Services (and SQL Server RDBMS instances) for log processing. Dryad is now available as a technology preview within the Windows HPC Server 2008 R2 high- performance computing line. Furthermore, according to Microsoft Research, Dryad eventually will be integrated with Microsoft SQL Server and Windows Azure. Dryad implements an execution model with great affinity to the map-reduce approach so closely associated with NoSQL databases. It is therefore crucial to the discussion of NoSQL computing in the Microsoft technology universe. An enumeration of all the cloud and on-premise products and technologies discussed in this section is presented in Figure 6. 6 Do not confuse Dryad’s graphs with those of Graph Databases. Though the vocabulary is quite similar, the contexts are rather different.
  • 21. 21 Figure 6: These lists summarize the cloud and on-premise technologies from Microsoft which deliver genuine NoSQL technology (e.g. Azure Table Storage) and/or features that NoSQL databases offer and which resonate with NoSQL developers (like OData’s HTTP/REST APIs). We also enumerate the option of running open source NoSQL database products in Azure compute instances, using Worker and VM Roles. NoSQL Upsides, Downsides We’ve already alluded to many of the relative pros and cons of dedicated NoSQL products and various Azure technologies which, at the very least, nip away at the NoSQL feature list and deliver certain of their advantages on an a la carte basis. Allusions are one thing, but it’s probably best that we work to enumerate NoSQL’s upsides and downsides in a formal manner. By doing so, readers will be able to evaluate their NoSQL needs in a no-nonsense fashion and then determine, given the Azure platform capabilities, whether those needs necessitate use of dedicated NoSQL products.
  • 22. 22 Upsides Lightweight, low-friction Probably the most touted attribute of NoSQL database systems is their ease of provisioning, deployment and integration into application code. Download, install, run a browser-based UI, create a new database, and away you go. Since the products are open source, the licensing worries are reduced. Since there are no schemas to declare with many NoSQL products, the database is ready as soon as you create it. And since many NoSQL APIs are HTTP- and REST-based, and, for a number of NoSQL databases, a multitude of client libraries for various programming environments are available, you can start coding quickly too. Minimalist tool requirements A number of NoSQL databases have browser-based UIs. After the product is installed, simply point your browser at the server’s host name (or localhost, if you’re browsing on the server), a specific port and a given virtual directory, and you may get a fully-functional UI in the browser for managing your databases, and querying them too. Sharding & Replication Most NoSQL databases support the notion of sharding, which we have already discussed in the section on SQL Azure Federation, above. Unlike SQL Azure though, the sharding facilities in most NoSQL databases do support fan-out queries transparently. It seems reasonable that fan-out query capabilities will come to SQL Azure in the future, but they’re not there now. Many NoSQL databases also have simple replication facilities built in. In the relational world, replication can be useful in branch office scenarios, but for the Web-centric focus of most NoSQL databases, it is likely that geographic content distribution is more important. In other words, NoSQL database instances can be created in various geographic regions, and then be configured for continuous replication such that users can work against a database to which minimal network hops are required, with replication assuring that each regional server gets data changes from the others. Replication is also a disaster recovery tool, as the failure of a single replica can be addressed by the swapping in of another. This is very important in both sharded and single-server implementations: in the latter, the unitary server becomes a single point of failure; in the former, every single shard becomes a point of failure as well. For this reason, sharding and replication are often used together. Web Developer-Friendliness Many Document Store databases use JavaScript Object Notation (JSON) as the internal storage format and JavaScript as an internal scripting language. Therefore, writing an AJAX application against a database in one of these products becomes much easier, as the objects in the application’s JavaScript code can be directly written to, or read from, the database. This makes client-side (browser script-based) data access code quite feasible and simple.
  • 23. 23 Add to this the REST APIs used by most Document Store products, and the jQuery REST libraries available to Web developers, and it becomes clear that the suitability of NoSQL products to JavaScript/jQuery- based applications is high, with a reasonably low learning curve for many Web developers. For certain NoSQL products, especially Document Stores, it seems almost a core design principal that the databases function as an extension of JavaScript’s implementation of object orientation. While it would probably be a stretch to call these NoSQL products object databases 7 , that is a useful way to consider the intent with which they are built, with respect to JavaScript developers and their code. Cross-Platform, Cross-Device Operation Most NoSQL database products run on multiple OSes and thus on multiple devices. Specifically, most of them run on Windows clients and servers, as well as on Linux. Running on Linux allows certain of these products to run on Apple Mac OS, iOS and the Android operating system on phones and tablets 8 . For cloud computing though, the cloud servers are the host, and the only device compatibility that becomes important is on the client side. And given the number of OData interfaces supported by Azure, client compatibility with Microsoft’s cloud platform is quite high indeed. Downsides Having enumerated several facets of NoSQL databases that work out elegantly and advantageously, it’s important to point out some of the NoSQL product’s liabilities as well, especially with regard to productivity and suitability to line-of-business application development. Optimizations Have a Price Usually in computing, an advantageous optimization for certain activities and patterns leads to less functionality or flexibility in others. And with certain NoSQL databases, that is definitely the case. Consider CouchDB, and its ability to read and write data very quickly, which in turn helps it facilitate the Web scale capability which draws so many of its users to it. On the write side, CouchDB can process things so quickly because the operation of writing to disk is in fact deferred. Writes are buffered, which makes for better responsiveness, but leads to inconsistency in the physical database in the short term and risk of data loss in the event of a crash or other outage before the cache is committed to disk 9 . On the read side, CouchDB cannot be queried in an ad hoc fashion at all. Instead, the database designer must author a “view” containing JavaScript code that traverses CouchDB databases and returns a specific result set. This requirement, of course, makes CouchDB less than suitable for ad hoc query activities, or even for applications where the standard querying needs are in flux. The good news is that for applications where the querying needs are well-known and limited, CouchDB can work well, and the 7 Recall that we had already drawn parallel between Graph Databases and Object Databases. Here we do so for Document Stores. As before, the distinctions between NoSQL categories are not cut and dry. 8 At time of writing, CouchDB for Android is available as a developer alpha release. 9 The lost data is recoverable from database log files. But the restore operation can prove inefficient.
  • 24. 24 overhead of a query optimizer need not impose itself. But for applications where requirements may shift over time, capabilities are much more limited than with relational databases. This has some irony to it, given the importance of schema flexibility (and thus accommodation of changing requirements) in NoSQL databases overall. Requirement to Query using a Procedural Language A corollary to the above point on development of static views for querying is the procedural method by which the code itself must traverse the database in order to produce its results. Instead of using the set- based paradigm in SQL, NoSQL databases often must be traversed on a row-by-row (or document-by- document, or entity-by-entity) basis. Each row/document/entity must be evaluated individually, and declarative SQL operations like joins, which filter data more implicitly, are not available. What this does is force a client-like data access model to be employed at the server which could, in turn, impair scalability more than facilitate it. 10 Of course, that statement really comingles two separate senses of the word “scalability.” For many Web applications, scalability involves the elimination of latency in rather simple operations, such as pulling up an individual note, writing out a status message, bringing up account settings for a specific customer, and so forth. Another kind of scaling involves things such as efficient keyword searches over a gigantic bodies of data, limiting the value of specific fields to a certain range or aggregating numeric field values over a large subset of data; this sense of scale is very important as well and procedural traversals do not often enhance it. So perhaps it is unfair to say that, generally, procedural, row-wise data evaluation impairs scalability, since notions of scale differ between classes of applications. But this assertion must hold true in the converse as well, making it inaccurate to say, in a sweeping fashion, that NoSQL databases are more “scalable” or “Web scale” than relational databases. The reality is that different applications have different needs, different burdens and different points of stress (or failure). Scalability really is measured by the degree to which these needs are met, burdens lifted and stresses reduced as the volume of data and/or user activity grows linearly, and exponentially. The best database for the job is just that: the best database for the job at hand. For some applications, relational databases are not the optimal vehicle for storage and retrieval. For many others, NoSQL databases would be quite inappropriate. So the most important question in evaluating options amongst NoSQL databases, as well as evaluating the option of using them at all, hinges on the type of application being written, the type of queries that must be expected and handled with relative ease, and the regularity vs. variability of the data’s structure. That a certain type of database appears clumsy in certain situations does not by itself render that type of database inappropriate if that situation is merely an edge case. Necessity to Scale Manually For various Web applications that are public facing, and whose data may be document-, user- or message-oriented, NoSQL databases can work quite well. Their ability to stripe, replicate, cluster and 10 SQL Server and SQL Azure provide this same data access option through cursors, but SQL Server developers use cursors very sparingly to avoid the downside.
  • 25. 25 provide geographically distributed points of presence may form the perfect approach for the problem space of these applications. The ad hoc, semi-federated nature of NoSQL clusters and replicas makes for low-friction provisioning and helps assure that growth spurts in services usage and membership are non- disruptive. That said, there is still work involved, both in terms of resource monitoring and provisioning, that must be done in order to meet these very demands. Meanwhile, a Platform as a Service cloud like Windows Azure, with a data platform like SQL Azure to match, facilitates a more automated approach to both the monitoring and provisioning which must be performed to make certain a site or application grows non- disruptively. New Windows Azure Web and Worker roles can be spun up through clicks in the Azure portal’s management interface, and they can be deactivated just as easily. As a result, elasticity is achieved more laboriously with hosted NoSQL database applications. Replicas for SQL Azure databases are created implicitly and the “cutover” from one replica to another is implicit as well. The ramifications of this for NoSQL include extra effort and greater opportunity for error, which may have a very real and measurable economic impact in labor costs and/or opportunity costs, as well as greater risk exposure, to the companies building sites or providing services that use NoSQL databases 11 . Primitive Tooling NoSQL databases are, in many cases, easier to get up and running than are relational databases. There’s less up-front formality involved in terms of planning and design and, as a result, there’s a shorter distance between concept and implementation. That’s exactly the kind of agility that growing companies and their sites may need. There’s also far less complexity in tooling around these databases…simple, self- explanatory browser-based management interfaces, straightforward REST programming interfaces and conceptually simple key-value paradigms abound. But tooling has its value, and that value tends to increase over time, when the imperative of raw implementation has passed and need for smooth maintenance and troubleshooting becomes more pronounced (and economically impactful). The design, diagnostic and operational monitoring capabilities of SQL Server’s tools are significant, and have evolved over the roughly 20-year existence of the product. These tools, including SQL Server Management Studio and its execution plan window, aid greatly in preventing problems, and in solving them quickly when they do arise. NoSQL databases’ more minimalist tooling approach leads to more manual and time-consuming management and troubleshooting than is the case with SQL Azure (which is compatible with SQL Server’s tools), and may also make the process more error prone. The cost impact of this can be significant. Lack of ACID Transactional Capabilities in Some Products Many NoSQL databases do not provide ACID guarantees nor support for large-scoped transactions. As discussed previously, some products provide “eventual consistency” while others treat each database operation as its own isolated transaction. This may be appropriate if the application need only provide that level of reliability. For example, if social media status messages occasionally fail to post, users may 11 Some Web enterprises have large, dedicated technology staffs in place, who can handle this burden well. But many corporate business units, and even IT departments, are not in that position
  • 26. 26 find it perfectly acceptable to discover the failure (by noticing the message never appears in a feed or stream) and re-post the message. Furthermore, the occurrence of transactions that span more than a single database operation may not be significant in certain apps. Note taking-applications must update notes one at a time; blog posting is a simple operation; social networks may need to register a new follower for a given user, and that’s a discrete operation. Unlike a financial system which may need to execute a debit and credit as an atomic operation, many Web applications interact with data in a more granular, minimalist way. But for most corporate business applications, ACID guarantees are imperative. Debits and credits must execute in an all-or-nothing fashion; ecommerce orders cannot be lost as customers will not be content to recreate them from scratch. So, once again, the context of an application/service/site in large part determines what defines standards of reliability and what determines whether certain advanced features of a database are overkill or absolute necessities. Conclusion: Relational’s Continued Indispensability in Line-of-Business In this paper, we’ve investigated NoSQL’s general tenets. We have discussed each of its four major subcategories: Key-Value Stores, Document Stores, Wide Column Stores and Graph Databases. We’ve also reviewed the distributed nature of NoSQL databases, including the partitioning and replication schemes many of them use. We have looked at NoSQL’s concurrency models, its programming models and have explored the concepts around loosely schematized data. We reviewed MapReduce and BigTable, and saw that they established a legacy that has influenced most, if not all, NoSQL products. We also looked at Microsoft’s Azure cloud stack, including Windows Azure Table Storage, which is itself a bona fide NoSQL database; various facets of SQL Azure; and support for OData in both Windows Azure and SQL Azure. In doing so, we have seen how the Azure platform supports a full-on NoSQL approach as well as the ability to implement various NoSQL features on an “a la carte” basis. Furthermore, we looked at how Windows Azure Worker Roles and VM Roles support the installation and use of non-Microsoft NoSQL databases, when and if nothing else will do. We digressed, slightly, to review the NoSQL qualities of SQL Server’s “Beyond Relational” features and SQL Server Parallel Data Warehouse Edition; we briefly discussed Dryad, Microsoft Research’s project providing map-reduce capabilities, and more. We saw how NoSQL databases are suitable for data management that is light-duty but large-scale, and how they work well for content management requirements of many stripes. We also saw, again and again, that relational databases are best for line-of-business applications. The database consistency, query optimization and set-based declarative query capability that relational databases have provided for decades is still required by most LOB applications; this has not changed. In business, data in a specific domain tends to be very regular and consistent in structure. For example, most equities trades have the same fields, as do the counterparties involved in the trades. Most sales invoices and line items in those invoices have consistent structure as well. When such regularity exists – which is in fact quite often –relational databases work perfectly. Granted, they may need to be
  • 27. 27 appropriately scaled and tuned, but the overarching point is that the relational scheme is best in these scenarios. To understand the line-of-business versus structured data distinction, it may be helpful to consider a hypothetical large, online bookseller. This reseller likely keeps its catalog data in a NoSQL database. It may do likewise with its Web content, reviews and perhaps even its reading lists. But in all likelihood, its customer billing system, its inventory and supply chain systems, its publisher online inquiry systems and its shipping application all use relational databases. We don’t know this for a fact about any one bookseller, but the assumptions are nonetheless based on good rules of thumb for when and where each type of database is best utilized. The regular, consistent data scenario is the most common one in most corporate settings. Granted, for any number of outward-, consumer-facing Web applications, which are essentially content-and relationship-driven, NoSQL structured stores have a welcoming home. So you must ask yourself: do I have irregularly schematized data, such that I need to use a NoSQL, structured storage approach to storing and retrieving it? Try not to be led to a conclusion by fear (or even guilt) over the issue of inflexibility. Just because schema-less databases let you store irregular data doesn’t mean you’ll need that, and just because relational databases require you to go through steps that can be disruptive in order to modify a table’s schema, doesn’t mean you’re somehow foolhardy for going that route. Consider a household analogy: if, as you build a house, you run wiring in conduit, external to your walls, and surface-mount your fixtures, you’ll always be able to upgrade your wiring, or repair a wiring segment gone bad. But if you know that the electrical, and maybe cable TV and computer network wiring to be installed will suit your purposes for the long term, then it makes perfect sense to run your wiring in-wall. You can always open the walls again if need be, and if you’re reasonably certain that you won’t need to, then running the wiring internally is the right decision. It will look better to most people, make it easier to push furniture against the wall and will, arguably, be somewhat safer. In general, your home will have a more finished look to it. If one day your needs change and you need to open the walls again, that will not necessarily mean you made a bad decision. People should not let a relatively insignificant chance of disruption thwart them from enjoying the advantages of something that is otherwise advantageous. By the same token, customers should not let the notion that their database schema may someday change force them into a decision of going with a non-relational, loosely-schematized database. As we have said, some applications by their nature manage data that is variable in structure, and NoSQL databases may work very well for those applications. But if your app uses highly structured data – and most line-of-business apps do – then why forego the compatibility, data consistency, query optimization, maturity, broad support and professional talent pool that a major relational database offers? You should give that up only if the benefits of doing so outweigh the costs, and each such benefit should be evaluated on a sober survey of likelihoods and risks.
  • 28. 28 But what if the “wires” in your “house” are changing a lot? What if you’ve got an app that manages a lot of data that is ever-changing in structure and much of it functions as content on your Web site? Do you need Cassandra or MongoDB or Neo4j on a hosted Linux server? Probably not. Azure tools like Azure Table Storage, SQL Azure XML columns and OData may be viable options for your structured storage or key-value retrieval needs. And if not, then running xcopy-deployable or silently-installable NoSQL databases in Azure Worker Roles and Azure Drive, or running full blown NoSQL installations using Azure VM Roles ,may well work for you. Hopefully this paper has made the choices more clear and your evaluation a more straightforward and less “loaded” prospect. The Azure cloud provides for a spectrum of choice, rather than a single, compulsory methodology. This provides flexibility and protection in a cost-effective, elastic computing environment. And that’s really what “Web scale” should be all about.