The NoSQL Movement

  • 1,668 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,668
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
63
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The NoSQL movement Raluca Gheorghita Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi Abstract. As the amount and pace of data-generation keeps growing, businesses are stepping away from traditional RDBMs solutions and go for highly scalable store solutions. Numerous papers are being published on this topic for instance by well known companies like Google, Facebook and Amazon and open-source projects come into existence. Next genera- tion Databases mostly address some of the points: being non-relational, distributed, open source and horizontal scalable. The movement, named with the misleading term NoSQL (the community now translates it with ”not only sql”) began early 2009 and is growing rapidly. Often more char- acteristics apply as: schema-free, replication support, easy API, eventu- ally consistency, and more. 1 Introduction There is an interesting transition taking place in the Web-scale data stores’ world as an entire new type of scalable data stores is gaining popularity very quickly. The traditional LAMP (Linux, Apache HTTP Server, MySQL, and PHP, Python or Perl) stack is starting to look like a thing of the past. For a few years now, Memcached (free and open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load) has often appeared right next to MySQL, and now the whole data tier is being shaken up. While some might see it as a move away from MySQL and PostgreSQL, the traditional open source relational data stores, it is actually a higher-level change. Much of this change is the result of a few revelations [1] : – a relational database isn’t always the model or system for every piece of data – relational databases are tricky to scale; normalization often hurts perfor- mance – in many applications, primary key lookups are all you need The new data stores vary quite a bit in their specific features, but in general they derive from a similar set of high-level characteristics. Not all of them meet all of these, of course, but just looking at the list gives you a sense of what they are trying to accomplish. – de-normalized, often schema-free, document storage – key/value based, supporting lookups by key
  • 2. – horizontal scaling (ability of a software or hardware system or a network to grow without breaking down or requiring an expensive redesign) – built-in replication – HTTP/REST or easy to program APIs – support for Map/Reduce style programming (a programming model and an associated implementation for processing and generating large data sets) – Eventually Consistent [2](when no updates occur for a long period of time, eventually all updates will propagate through the system and all the replicas will be consistent). The movement to these distributed schema-free data stores has begun to use the name NoSQL. 2 What is NoSQL? First of all, the current disadvantages of relational databases need to be ad- dressed. A relational database like Microsoft SQL can be most easily described as a table-based data system where there is minimal data duplication and where sets of data can be accessed through a series of relational operators like joins and unions. The problem with such relations is that complex operations with large data sets quickly become big consumers of resources, although generally the benefits are collected at the application level. Adam Wiggins (Heroku) points out ways of getting around these limitations in his article SQL Databases Don’t Scale [3]. He presents the well-known tac- tics of backing up relational databases for huge applications (vertical scaling, sharding1 or partitioning and read slaves), also enumerating their drawbacks. So why are relational databases just now becoming a problem? Eric Florenzano puts it best: ”As the web has grown more social, however, more and more it’s the people themselves who have become the publishers. And with that fundamental shift away from read-heavy architectures to read/write and write-heavy archi- tectures, a lot of the way that we think about storing and retrieving data needed to change.” [4] The solution to this problem seems to be NoSQL: non-relational data stores that ”provide for web-scale data storage and retrieval especially in web based applications because it views the data more closely to how web apps view data - a key/value hash in the sky.” NoSQL is used for describing the cur- rent growing type of web applications that need to scale effectively. Applications can horizontally scale on clusters of commodity hardware without being subject to complicated sharding techniques. Many Web and Java developers built their own data storage solutions, fol- lowing the example of those built by Google Inc. and Amazon.com Inc., so that they can manage without Oracle at the beginning. They released them as open source afterwards. Now that their open source data stores manage hundreds of 1 Database Sharding can be simply defined as a ”shared-nothing” partitioning scheme for large databases across a number of servers, enabling new levels of database per- formance and scalability achievable.
  • 3. terabytes or even petabytes of data for thriving Web 2.0 and cloud computing vendors, switching back is neither technically, economically or even ideologically feasible. Johan Oskarsson, a Web developer of Last.fm site and the organizer of the NoSQL meeting that took place in San Francisco in June 2009, stresses that ”Web 2.0 companies can take chances and they need scalability”. He points out that having these two combined is what makes NoSQL so compelling. He also says that many developers had even stopped using the open source MySQL database, a long-time Web 2.0 favorite, for a NoSQL alternative, because the advantages were too compelling to ignore. Which are these advantages so compelling they can’t be ignored? 3 What are the benefits NoSQL provides? First of all, they aren’t simply databases. Amazon.com’s CTO, Werner Vogels, refers to the company’s Dynamo system as a ”highly available key-value store.” [5] Google calls its BigTable, the other role model for many NoSQL enthusiasts, a ”distributed storage system for managing structured data.” [6] Second of all, they can easily handle considerable amounts of data. Hyper- table , an open source column-based database modeled upon BigTable, is used by local search engine Zvents Inc. to write 1 billion cells of data per day, accord- ing to a presentation by Doug Judd [7], a Zvents engineer. Meanwhile BigTable, in conjunction with its sister technology, Map/Reduce, processes as much as 20 petabytes of data per day [8]. Map/Reduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all interme- diate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large clus- ter of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with paral- lel and distributed systems to easily utilize the resources of a large distributed system. ”Definitely, the volume of data is getting so huge that people are looking at other technologies,” said Jon Travis from SpringSource, whose ’VPork’(Voldemort Performance Testing Framework) technology helps NoSQL users benchmark the performance of their database alternative. Travis, who is Principal Engineer at Hyperic, which was acquired by SpringSource, put together a basic performance testing framework to prove out Voldemort for use in his company. Another benefit worth mentioning is that these NoSQL databases run on clusters of cheap PC servers. PC clusters can be easily and cheaply expanded without the complexity and cost of sharding, which involves cutting up databases into multiple tables to run on large clusters or grids. Google has said that one
  • 4. of BigTable’s bigger clusters manages as much as 6 petabytes of data [9] across thousands of servers. ”Oracle would tell you that with the right degree of hard- ware and the right configuration of Oracle RAC (Real Application Clusters) and other associated magic software, you can achieve the same scalability. But at what cost?” asks Javier Soltero, CTO of SpringSource. The NoSQL systems also solve performance issues. NoSQL architectures per- form much faster by avoiding the time-consuming task of translating Web or Java applications and data into a SQL-friendly format. ”SQL is an awkward fit for procedural code, and almost all code is procedural,” said Curt Monash, a leading analyst of and strategic advisor to the software industry. For data upon which users expect to do heavy, repeated manipulations, the cost of mapping data into SQL is ”well worth paying. But when your database structure is very, very simple, SQL may not seem that beneficial.” Raffaele Sena, Senior Computer Scientist in Adobe’s Business Productivity Unit, being asked about Adobe ConnectNow - a Web collaboration service - and Terracotta integration and how it addressed their web site scalability require- ments, said that Adobe decided against using a relational database for just the reason raised by Monash. Adobe uses Java clustering software from Terracotta Inc. to manage data in Java formats, which Sena says is key to boosting Connect- Now’s performance two to three times over the prior version. ”The system would have been more complex and harder to develop using a relational database,” he said. Another project, MongoDB, calls itself a ”document-oriented” database because of its native storage of object-style data. But it is important to note that NoSQL alternatives lack vendors offering formal support because they are open source. This fact isn’t seen as a problem by most supporters of the movement as they are closely in touch with the com- munity. But some admitted that working without a formal ”throat to choke” [10] when things go wrong was scary, at least for their managers. ”We did have to do some selling,” admitted Adobe’s Sena. ”But basically after they saw our first prototype was working, we were able to convince the higher-ups that this was the right way to go.” Despite their huge promise, most enterprises needn’t worry that they are missing out just yet, said Monash. ”Most large enterprises have an established way of doing OLTP [online transaction pro- cessing], probably via relational database management systems. Why change?” he said. Map/Reduce and similar BI-oriented projects ”may be useful for enter- prises. But where it is, it probably should be integrated into an analytic DBMS [database management system.]” Even NoSQL’s organizer, Oskarsson, admits that his company, Last.fm, has yet to move to a NoSQL alternative for produc- tion, instead relying on open-source databases. He agrees that a revolution, for now, remains on hold. ”It’s true that [NoSQL] aren’t relevant right now to main- stream enterprises,” Oskarsson said, ”but that might change one to two years down the line.” But one thing that has to be underlined is that NoSQL is not, and never was, intended to be a replacement for more mainstream SQL databases. There is no war between relational and non-relational databases. There’s nothing stopping
  • 5. people from splitting up data in their web application and using both types of data stores where it makes sense. As Brad Anderson of Cloudant says: NoSQL is about ’right tools for the job’ as opposed to anti-relational or replacing tradi- tional solutions. 4 NoSQL Databases The need to look at Non SQL systems arises out of scalability issues with rela- tional databases, which are a function of the fact that relational databases were not designed to be distributed (which is key to write scalability), and could thus afford to provide abstractions like ACID transactions and a rich high-level query model. All NoSQL databases try and address the scalability issue in many ways - by being distributed, by providing a simpler data / query model, by relaxing consistency requirements, etc. 4.1 Project Voldemort Voldemort is a distributed key-value storage system where automatically parti- tioned data is replicated over multiple servers. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not suf- ficient [11]. Many of LinkedIn’s products, like the modules People You May Know, View- ers of This Profile Also Viewed, and much of the Job matching functionality that LinkedIn gives to people who post jobs on the site, are severely counting on computationally intensive data mining algorithms. The difficulty in these sys- tems comes with the fact that large amounts of data need to be moved around every day. Thus although hundreds of gigabytes or terabytes of data are not too difficult when sitting still in a storage system, the problem becomes much, much harder when it must be transformed to support quick lookups and moved between systems on a daily basis. To solve this problem they spent some time thinking about how to build support for large daily data cycles. Voldemort was designed to support fast, scalable read/write loads, and is already used in a number of systems at LinkedIn. It was not designed specifically with batch computation in mind, but it supports a pluggable architecture which allows the support of multiple storage engines in the same framework. This allows integrating of fast, failure-resistant online storage system, with the heavy offline data crunching running on Hadoop. 4.2 CouchDB CouchDB is one of the most popular and mature document-oriented databases written in Erlang. Its primary focus was robustness, high concurrency, and fault tolerance. One key distinctness compared to other systems is its bi-directional incremental replication. CouchDB now has over 100 production users, 3 books are in writing, and the community is vibrant.
  • 6. CoucbDB’s documents are JSON based and they can have binary attach- ments. Each document has a revision which is deterministically generated from the document content. It is very robust since it never overwrites previously writ- ten data. There is therefore not a repair step after a server crash, and one can take backups with CP. Concurrency is another one of the benefits in CoucbDB’s design. It uses Erlang approach with lightweight processes which means one pro- cess per TCP connection. The architecture is also lock free. The API is REST based, using standard verbs: GET, PUT, POST, and DELETE. Map/Reduce views are used for generating persistent representations of docu- ment data. These are generally written in JavaScript. A really interesting feature of the views is that they are generated incrementally. The views are stored in a B-tree and kept up-to-date when new data is added. The bi-directional replica- tion is peer based (two nodes). One can replicate a subset of documents meeting a certain criteria. The replication happens over HTTP, which makes replication across datacenters easy and secure. In a multi-master replication setup CouchDB can deterministically choose which revision is the winner (with the loosing re- vision saved as well). One of the first adopters of CouchDB of scale was BBC. They needed flexibility in schema and robustness. They used CouchDB as a sim- ple key/value store for their existing application infrastructure. It has proven to be robust in production for several years and continues to scale to their de- mands of data and concurrency. Scoopler [12] is a real-time aggregation service with large and rapidly growing data volume. The schema flexibility was crucial when they selected CouchDB. An unnamed real-time analytics service migrated from a 40+ table Post- greSQL setup to a single CoucbDB document type with only two views. Ubuntu 9.10 includes the Ubuntu One system which stores user’s address books in CouchDB. Replication is the killer feature in this scenario. 4.3 Cassandra Digg is probably the only large site which has Cassandra in production (Facebook runs a forked version). It has been researching ways to scale their database infrastructure for some time now. Up until now Digg has used a normal LAMP stack. Step one was to adopt a traditional vertically partitioned master-slave configuration with MySQL, and they also investigated sharding MySQL with IDDB (a way to partition both indexes - integer sequences and unique character indexes - and actual tables across multiple storage servers). They went out and looked for alternatives. After considering Hbase, Hyper- table, Cassandra, Tokio Cabinet, Voldemort and Dynomite, They settled for Cassandra, because it offers a column-oriented data storage, highly available, peer-to-peer cluster. Even if it’s currently lacking some core features, it was the optimal solution for Digg. They wanted something open source, scalable, efficient, and easily administrable. They picked Cassandra because the promise of easier administration, no single point of failure, more flexible than a simple key/value store, very fast writes, the community was growing, and it was Java based (3 out 4 of the people in the team was comfortable with Java)[13].
  • 7. Digg implemented the green flag feature in Cassandra as a proof of concept. These flags appear on the Digg icon for a story when one of your friends has dug it. They did a dark launch with MySQL running alongside. First they just wrote data to Cassandra, then they enabled reading from Cassandra. Based on the results of the proof of concept, Digg are going to port the entire application to Cassandra. Digg is going to continue to use MySQL in some places, according to the saying ”Use the right tool for the job”. Ian Eure, Senior Core Infrastructure Software Engineer at Digg, declares their interest in NoSQL in general and Cassandra specifically. He states that they believe in this technology, and they are contributing to its ongoing development, both by submitting patches and by funding development of features necessary to support wide scale deployment. 4.4 MongoDB MongoDB is an open source, non-relational database that combines three key qualities: scalable, schema-less, and queryable. It has native drivers for pretty much every major language, and a small but growing community. Mongo’s design trades off a few traditional features of databases (notably joins and transactions) in order to achieve much better performance. It is perhaps most comparable to CouchDB for its JSON document-oriented approach, but has much better query- ing capabilities: you can do dynamic queries without pre-generating expensive views. So Mongo occupies a sweet spot for powering web apps. BusinessInsider.com, a business news site launched in February 2009, runs on LAMP platform: Linux, Apache, Mongo, PHP. The M comes from Mango, not from MySQL, as it usually does. They use MongoDB for different reasons. First of all, it’s scalable. Next, it’s document-oriented, not relational. RDBMs were invented in the 1970s, long before object-oriented programming and dynamic scripting languages became popular. By now, we’re all accustomed to the process of translating our code’s data structures back and forth to the tables in our database, but it doesn’t have to be that way. Rather than rows in a table, Mongo stores documents in collections. Documents are slightly enhanced JSON objects, so you can stash much more complex structured data in a single document than you can store in a table row. Data modeling becomes a much more natural process. The data modeling approach is different; instead of using multiple tables and joining them together with foreign keys, objects can be embedded within a single document. For example, each post on their site is a document. Similarly, in a MySQL- based system, a post would be a row in a table. But comments are different. Comments are embedded directly within the post document as an array of ob- jects. All of the comment data, including the text of each comment, information on who posted it, and the thumbs up/thumbs down voting, is stored directly within the post document. When the code pulls up a post like this one, the database doesn’t have to query over a separate comments table. The comments are right there as part of the post object, ready to be displayed. This is faster, and makes intuitive sense [13].
  • 8. Another benefit when using MongoDB is that there is no database-enforced schema, so when a notable change is made (like adding thumbs-ups to the com- ments), it can be easily done in a backwards-compatible way. Regarding caching, BussinessInsider does a lot less caching than they would on a MySQL database. Mongo is very fast at retrieving individual objects, so there is no need to cache individual posts. It is usually going to be as fast as Memcached for retrieving in- dividual documents. And Mongo itself can be used as an effective caching layer. If your collection is small, Mongo will keep it entirely in memory and performance will be comparable to a cache. Another plus for Mongo is that it can store binary data in the database, so that they don’t have to deal with the common hassle of having files in the file system and metadata in the database. Using its GridFS API, all the images can easily be stashed on the site in Mongo. SourceForge.net had a large redesign last summer where they moved to MongoDB. Their goal was to store the front pages, project pages, and download pages in a single document. It’s deployed with one master and 5-6 read-only slaves (obviously scaled for reads and reliability). 4.5 Amazon S3 - Simple Storage Service Amazon has announced Amazon S3 - Simple Storage Service, but it’s not in- tended for the general public, but rather for software developers who want to work with the Amazon Web Services system. Amazon Web Services Newsletter describes some specific details: ”Amazon S3 is storage for the Internet. It is de- signed to make web-scale computing easier for developers. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.” Amazon S3 Functionality Amazon S3 is intentionally built with a minimal fea- ture set. – ” Write, read, and delete objects containing from 1 byte to 5 gigabytes of data each. The number of objects you can store is unlimited.” – ” Each object is stored and retrieved via a unique, developer-assigned key.” – ” Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users.” – ” Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit.” – ” Built to be flexible so that protocol or functional layers can easily be added. Default download protocol is HTTP. A BitTorrent (TM) protocol interface is provided to lower costs for high-scale distribution. Additional interfaces will be added in the future.”
  • 9. 5 Conclusions It is obvious that relational database systems are no longer the main keepers of the data, and that is especially true with some of the large companies that have risen during the Internet era: Amazon, Google, Facebook, LinkedIn, and others. But it is also true that many have invested heavily in Oracle, DB2 or MS SQL, and the truth is those databases are still serving their needs. It is completely unlikely relational databases to disappear any time soon, but it is possible to see a gradual move towards open source non-SQL data stores for costs, simplicity and scalability reasons. References 1. Jeremy Zawodny: NoSQL: Distributed and Scalable Non-Relational Database Sys- tems. Linux Magazine, October, 2009 2. Werner Vogels: Eventually Consistent. ACM Queue Magazine, December 4, 2008 3. Adam Wiggins: SQL Databases Don’t Scale, Jul 06 4. Eric Florenzano: My thoughts on feric NoSQL, July 21, 2009 5. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels: Dynamo: Amazons Highly Available Key-value Store 6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI’06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. 7. Doug Judd: Hypertable, June 2009 8. Eric Lai: Researchers: Databases still beat Google’s MapReduce, April, 2009 9. Stephen Shankland: Google spotlights data center inner workings, May, 2008 10. Eric Lai: Red Hat Puts the Heat on Oracle. Computerworld, May 2007 11. Jay Kreps: Project Voldemort: Scaling Simple Storage at LinkedIn. LinkedIn blog, March, 2009 12. http://www.scoopler.com/ 13. Ian Eure: Looking to the future with Cassandra.Digg blog, September, 2009