Distributed Databases Overview


Published on

Distributed Databases Overview

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Distributed Databases Overview

  1. 1. Achieving High Availability, Scalable Storage and Performance at Portal do Aluno - Distributed Databases Overview Study - Luis Carlos Dill Junges1 , Ivan Linhares Martins1 1 Certi Foundation – Federal University of Santa Catarina (UFSC) Postal Box 5053 – 88.040-970 – Florian´ polis – SC – Brazil o luis.junges@gmail.com, ilm@certi.org.br Abstract. This document is a consolidation study made at Certi Foundation for the federal project called Portal do Aluno. This project will be an internet portal with the main objective to spread knowledge among kids between 12 and 18 years old from Brazilian elementary schools. Considering the fact that there will be around 5 millions students using it montlhy, some problems are inevitable on the storage system and at the availability of the portal. With this problem in mind, a comprehensive study has been made on the new flavor of distributed databases available at the market. The results of such study has been published on this document for appreciation with some considerations on each one. Resumo. Este documento e uma consolidacao de um estudo realizado na ´ ¸˜ fundacao Certi para o projeto do governo Federal chamado Portal do Aluno. ¸˜ Este projeto ser´ voltado para os estudantes entre 12 e 18 anos da rede de en- a sino b´ sico do Brasil com o objetivo de ser tornar um portal para divulgacao e a ¸˜ geracao de assuntos relacionados a formacao dos estudantes. ¸˜ ` ¸˜ Tal projeto ter´ algo em torno de 5 milh˜ es de usu´ rios que inevitavelmente a o a trar˜ o alguns problemas em relacao ao backend do sistema nos aspectos de a ¸˜ escalabilidade de dados e alta disponibilidade do portal. Com isto em mente, um estudo elaborado das solucoes atuais dos novos sistemas de banco de dados ¸˜ distribu´dos foi feito e os seus resultados s˜ o apresentados neste documento. ı a 1. Introduction This study was born due the problem being faced at Portal do Aluno project. The project consists of a social network focused on spreading the knowledge among students between 12 and 18 years old from the elementary schools of Brazil. Those problems are related to the availability, the storage capacity and also the per- formance of the overall system. Although minor projects were developed using standard relational databases which inevitable have become the SPOF1 of the system, this project had required a better solution in order to meet new requirements. This study shows a way to overcome such problems by using a new kind of Open Source tools available at the developing community. This new set of tools have been driven by the NoSQL movement which had began around 2009 to solve the limitations found on handling big data volumes and workloads. 1 Single Point of Failure
  2. 2. This group has the aim to redirect the database development to horizontal scalability by relaxing on some aspects. One of those aspects could be shown on the fact that such system often provide eventually consistency and, therefore, are not fully compliant with the ACID2 properties. This article is organized as follow: Section 2 describes the project. Section 3 presents the problems and the motivation to study a new approach. Section 4 describes the general characteristics of those distributed systems and Section 5 gives a brief overview of the major Open Source players. Section 6 shows a comparative table of the properties of each system. Section 7 presents the most prominent solution that best meets the Portal do Aluno’s requirements. Finally Section 8 gives the conclusion. 2. Portal do Aluno Portal do Aluno is a social learning environment project from the Ministry of Education of Brazil(known as MEC). It has characteristics of social network and has the aim to provide an educational portal with colaborative tools for schools tasks. It will be an extension of elementary schools on the internet trying to promote the integration among schools, students and teachers around Brazil by the possibility of having groups for researchs, discussions and others common tasks. This portal is subdivided into modules with specific content. On some of them, there is the possibility of uploading files like images and any other type of document in- cluding video. As the number of users of this portal is potentially high from the begining, scalability and availability are essencial and lead to the problem described at the next section. 3. Problem Relational databases are powerful and robust in such way that there is widespread of applications and systems using them. However, they show limitations when large sets of data need to be stored and when high availability of the system is mandatory. On the first issue is provably impossible3 to keep the ACID properties while scaling across multiple machines. Until now it has been tipically solved by high end RDMS4 through the use of replication system with master-slave architecture as shown on figure 1. Even being a working approach, this model has a prominent SPOF5 on the master. If it fails, the system goes down. This approach reachs scalabilty by forwarding the reads to free slaves (load balancer) and all writes on master, being again the bottleneck of the data flow. The second issue is usually solved by hardware solutions based on RAID6 . At a glance, the goal of this model is achieved by replicating the data among several hard drives and swapping them accordly on a failure. RAID systems, however, are not a complete safety solution because they can not survive without a backup if the server holding them is lost by fire or flooding or any other reason. 2 Atomicity, Consistency, Isolation, Durability 3 See Section 4.1 - CAP Theorem 4 Relational Database Management System 5 Single Point of Failure 6 Redundant Array of Independent Drives
  3. 3. Figure 1. Tradicional Relational Database Scaling Those relational issues have been claimed to be solved (or at least on the road) by a new flavor of distributed databases relatively new at the developing community. Those systems promise to overcome efficiently the lacks found at relational systems by relaxing on some characteristics like consistency and strong consideration on nodes failures. The next section introduces such systems. 4. Distributed Databases One of the major advantages of using a distributed database over a traditional relational database is the possibility to scale the reads and writes easily by just adding new nodes on the cluster. Relational databases can have this issue solved with the reads but scale the writes are virtually impossible and at the end it becomes too expensive. A brief comparison between relational databases and those new systems is de- scribed at table 1. Table, Columns, Rows ACID properties fully satisfied Relational Databases Normalized to avoid data duplication Strong storage schema Queries fully supported Table like domain Data identified only by a key Distributed Databases Schema-less Data integrity on application’s code Eventual Consistency Support for queries is limited Table 1. Relational vs Distributed Database Those systems also adopt a key-value model or a document-oriented approach:
  4. 4. Key-value Basically the data is associated with a key like a map. It is only possible to retrieve the data by knowing the key. They usually are able to retrieve the data at a constant time independet of how many entries have been stored. Document-Oriented The data is stored in a format which represents a document. It does not have any schema and some fields present at some document may not exist on others documents. Some implentations use JSON or XML as protocol layer for the data. 4.1. CAP Theorem The CAP theorem [Gilbert and Lynch 2002] was born as some properties that shared sys- tem must choose from. Their properties are as follow: • Strong Consistency:All Clients see the same view even in presence of updates. • High Availability: All Clients can find some replica of the data, even in presence of failures. • Partition-Tolerance: The system properties hold event when the system is parti- tioned by node failures, network problems or any other reason. The theorem states that a distributed system can always have only two of three CAP properties at the same time. At distributed databases, it is usually used Availability and Partition Tolerance. In order to handle the consistency, some of them use versioning systems [Manassiev and Amza 2005] [Amza et al. 2003] for update’s conflicts resolution. 5. Available Solutions On this section some approaches of distributed databases are shown explaining the sin- gular characteristics of each one and a practical example where they are being used at the moment. Those new systems are based on a large set of Open Source Tec- nhologies [Bortnikov 2009] which makes them really atractive and although there is not a consolidated benchmark already accepted by the community [Binnig et al. 2009] [Cryans et al. 2008], some points still can be made on each solution. 5.1. Voldemort Voldemort is a relatively new Open Source project at the community as it has been re- leased at the beginning of this year. It has been entirely written in Java and it’s based on the Key-Value model having just 2 functions to interact with (set and get). As the own de- velopers said, voldemort is basically just a big, distributed, persistent, fault-tolerant hash table. For the data persistency it uses MySQL or BDB as backend on each node. As it has the concept of eventual consistency [Vogels 2008], it uses a simple incremental versioning sytem for each update on the data. The application is responsible for fixing integrity problems and other issues that may happen on the data stored. This project is currently in use on production at Linkedin.com on some parts which require high-availability. The speed access observed at the production environment are at order of 19384 requisitions per second (req/sec) for reading and 16559 req/sec for writing. As some good points of this project, there is a well written documentation. There is also a good replication schema of data that can be manually configured in terms of how many writes and reads have to be made in order to validade a store or a reading operation,
  5. 5. respectively. As an example, an exception will be through if it is set to write at least on 3 node and just 2 nodes are up. This project’s design also has taken into consideration working properly on load balancers with clustering proposals as described on figure 2. As one major advantage, this project does not have a SPOF delivering, therefore, a high available system for critical applications. The drawbacks are the impossibility to add a new node on a live cluster which means the entire system has to be shut down in order to configure a new node. Other point is that all code processes a value at a time in memory (no cursor or streaming) meaning that the values need to fit comfortably in memory. Figure 2. Voldemort’s clustering architecture 5.2. HBase HBase is the official Hadoop project database. It is an Open Source, distributed, column-oriented store modeled after Google’s BigTable [Chang et al. 2008]. Just as BigTable leverages the distributed data storage provided by the Google File System [Ghemawat et al. 2003], HBase provides BigTable-like capabilities on top of Hadoop us- ing the HDFS7 . HBase is very good and powerful project which gives the users the opportunity to run parallel processing on the cluster through the use of MapReduce jobs. The current release (0.20) has removed the major drawbacks of having an SPOF and high reading latency. Its architecture works through the use of a distribution of masters and region servers along the cluster’s machines as described at figure 3. 7 Hadoop Distributed File System
  6. 6. HBase is currently in use at several places including a Yahoo’s Cluster with 10000 PCs. There is also some companies also doing tests with HBase running at Amazon Elastic Compute Cloud (know as Amazon EC2) Figure 3. HBase’s architecture 5.3. Redis Redis is a key-value distributed solution with the advantage of having more operations than just the tradicional set and get API. Those operations include handling multiple sets and some simple queries on the dataset stored with the garantee of being atomic(just some operations). It also supports storing more datatypes instead of just string or binaries including list, sets and ordered sets. Other major point is that it is increadibly fast, able to perform around 110000 SETs/second and around 81000 GETs/second according to the developer’s test case. It works by doing assynchronous calls which means data can be lost between the time is was requested to write and it definitely happened (not atomic operations). There is also the constraint that all dataset needs to fit on a single device. 5.4. Cassandra Cassandra was born to solve Facebook’s problems. It is a more complete key-value database based on Dynamos’s fully distributed database design [DeCandia et al. 2007] and BigTable’s Column family based data model [Chang et al. 2008]. This project has high-availability without a SPOF with incremental scalability through the option of adding new nodes on a live environment without disturbing the
  7. 7. applications currently running on the database. It also has the garantee of being atomic on a single Column Family’s operation. Drawbacks include the poor and inexistent documentation with a very obscure and difficult API that will pass though a heavy remodelling on the next releases. Cassandra is currently at use on Facebook on the inbox search where it is claimed to exist 40 TB of data distributed along 120 machines at separated data centers. It is also in use at Rackspace and Digg.com 5.5. MongoDB MongoDB is document-oriented approach for scalable distributed databases. It is an Open Source implementation entirely written in C++ with commercial support. Its major ad- vantage is the query support that made it unique on this feature. It works through a BSON (binary JSON) format for big data handling (photos and videos) with support for MapRe- duce jobs. Figure 4. MongoDB’s Architecture Design As a drawback it has an intricated cluster schema as shown on figure 4 which has several SPOFs. It is subdivided into config servers (store metadata on which mongo shard is the data) and mongo shards that store the data. There is also the mongo instances that are entry points for clients. Right now it is a relatively new implementation without full support for sharding and data replication has constraint on the number of nodes (2 nodes only) that can be used. 5.6. Tokyo Cabinet/Tyrant Tokyo Cabinet/Tyrant is an Open Source project claimed to be in use at mixi.jp, a japanese social network with 10000 updates/second through MemCache. The use of this tool seems to apply on the handling of 20 millions entries of data(20 bytes each). Although no test has been made, this solution claims to be really fast on writing and reading operations able to perform around 58000 req/seconds. It also seems to support
  8. 8. ACID properties with several differents storing approaches (Hash, B-Tree) for each type of data being stored. As drawbacks, it does not have a good documentation and few projects are using it. 5.7. CouchDB CouchDB is a very easy to run project with a document-oriented approach. It has a to- tally unstructured schema-less storing backend throught the use of JSON format as data handling. It is very similar to Amazon’s SimpleDB solution with assynchronous replica- tion of data. It also have a browser administration console where it is possible to create MapReduce jobs, backup operations and views statements like those ones found at rela- tional systems. It uses http requests to manage the dataset which makes it connectable to any soft- ware able to perform http requests. CouchDB has a major advantage because it provides a query like engine which enables the user to build their own queries properly for the application being developed. As a big drawback, it does not satisfy the concept of scalability because all the data being stored needs to fit on a single device. The availability of the system is achieved by a client router which forward the queries to the desired backend service. So, CouchDB is not a distributed database at the current moment but has some interesting features that make it eligible to be on this listing. One of those features is the MapReduce support. An- other one is the approach of having the entire dataset or part of it stored directly at client’s computer with assynchronous replication. By doing this, the workload at the backend can be reduced because the replication will happen on an appropriatte moment. This feature could also be used for mobile devices that get synchronized at base station (bluetooh, wireless, cable) and can access a website after that without having to connect to the in- ternet. This leads the user to avoid spending money on data carrier or even witnessing low conections speeds which invariable leads to a great website’s user-experience. There is some issues regarded to the type of data that can be handled on such approach or even if modifications made by the user can, at a later moment, be synchronized with the main data server without consistency problems. Despite of those issues, this approach seems interesting to delivery fast content on mobile devices. At Portal do Aluno, this feature could be used to connect the users to the dashboard with the option of editing comments on mobile devices that later can be synchronized with the main server. CouchDB is in use at several projects and websites because of its easiness it pro- vides through http requests. At the moment it is a very young project with strong security problems and at Alpha development. Even with such issues, it’s a project to keep an eye on. 5.8. MemCache According to the developers, MemCache is an Open Source, high-performance, dis- tributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. It is a really simple and robust solution to improve the performance of web appli- cations by dropping the reading time by accessing the cache layer instead of the database
  9. 9. itself. As real examples, it has reach the speed of 38000 req/sec at Flickr.com. It does not have persistency layer but is able to work properly doing load balancing just by adding new nodes on the cluster. It is in use at several big projects on the internet and can be used directly by the API of some high end RDBS as a cache layer (MySQL and PostgreSQL). Figure 5. Solution using intermediate storage layers Although MemCache does not provide persistency, it can be very useful with so- lutions that do provide the storage together with commercial services available at market like Amazon Simple Storage Service (Amazon S3). By using the MemCache, there is a considerable drop on the number of reading operations that hit the Amazon S3 and con- sequently the month payment. Figure 5 shows this approach by using MemCache as an intermediate layer for applications that do required high availability but do not have the capacity to setup a private cluster for it. Instead, they use commercial storage solutions for the cluster [Brantner et al. 2008] [Palankar et al. 2008] and drop the month payment by adding aditional storage layers (disk and MemCache). 5.9. Others There is also a lot more of distributed databases projects with differents approaches. Most of them seems to be at beginning development without enough documentation and robust- ness or without persistence layer (In-memory only) . Some of them include: • ThruDB • LightCloud • Kay • MemcachedDB • Scalaris • NMDB • Disco • Riak • Hazelcast • KeySpace • Dynomite • MNesia • Ringo • Hypertable 6. Solution’s Benchmark Until now there is not an accepted benchmark for those new systems [Binnig et al. 2009] [Cryans et al. 2008] and the decision to use or not a system is based on their properties. Table 2 shows a comparative listing of some properties of each system. This table is a snapshot of the systems made at December 2009.
  10. 10. Name Language Fault-Tolerance Persistence Client Data Model Documentation Production Voldemort Java Partitioned, Replicated Berkeley DB,MySQL Java API Structured, Blob, Text Good LinkedIn HBase Java Replication, Partitiong Custom on-disk Custom API, Thrift BigTable Good Yahoo Cassandra Java Replication, Partitiong Custom on-disk Thrift BigTable,Dynamo Poor Facebook CouchDB Earlang Replication Custom on-disk HTTP,JSON Document-Oriented Good UbuntuOne MongoDB C++ Replication Custom on-disk,GridFS Java, C++ Drivers Document-Oriented Good SourceForge Hypertable C++ Replication,Partitioning Custom on-disk Java, Thrift BigTable Good Baidu ThruDB C++ Replication Custom on-disk Thrift Document-Oriented Medium — Ringo Earlang Replication,Partitioning Custom on-disk HTTP Blob Medium Nokia Tokyo Tyrant C — B-Tree,Hash ANSI C Document-Oriented Poor Mixi.jp Scalaris Earlang Replication, Partitioning In-Memory Java, Earlang, HTTP Blob Medium OnScale MemCache C Partitiong In-Memory Python, java, Ruby —- Good Several Projects Dynomite Earlang Replication, Partitioning —- Custom, Thrift Blob Poor PowerSet Kai Earlang Partitioning —- —- Blob Poor — Table 2. Comparative List of Distributed Databases Properties 7. Adopted Solution For the Portal do Aluno, there is some requirements that needs to be meet as easily scal- able, high available and fast content retrieval storage. On the presented solutions, HBase (Hadoop Project) and Voldemort have shown as the major robust solutions available at the moment which completely meet the easily scalable goal proposed by the NoSQL movement. Hbase had the problem of having a high latency and a SPOF on the master being inappropriate to serve web pages in real time. At the current version (≥ 0.20), those problems seem to be solved. Voldemort is a really robust approach but it’s not optimized for large data sets the Portal do Aluno needs to store(Video, Photos). This limitations is because it uses mysql or BDB as storage backend. MongoDB has some problems on the scalability it can provide an also the SPOFs it have. It does have, however, a strong support for queries which make it eligible to be tested as backend on Portal do Aluno. It also have its own binary format (BSON) that makes it relatively fast. As a result, HBase could be used for future tests based on its good documentation, easiness of use and robustness it provides. Considering that some intricated and complex joins have to be made at Portal do Aluno, MongoDB could also be used for tests due its query support. Togheter with one of those solutions, MemCache could be used to speed up the performance of the Portal. 8. Conclusion It is not possible to deny that a new set of databases are being developed from now on. They have started as commercial competitive advantages from private companies to solve internet related problems. Assume that they will replace the old fashion relational database model is naive thinking because just some little and special applications require their power. RDMS also have more features that are well known to implement and deal with not to metion the fact they can organize the data as it is at the real world with strong integrity which make them independent of application. The new flavor, however, has shown themselves as good promises in terms of scalability and availability using ordinary hardware as a cheap solution. Businesses which relies completely on a single access point with the client will see those new tools as mandatory in order to have more availability on their applications. The requisite of having a high available system is mandatory for some applica-
  11. 11. tions. Until now this property was satisfied when huge investments were made at backup systems with RAID and others devices and software which have just increased the num- ber of SPOFs. Of course those new systems are not completely trustful at the moment because they are relatively new and may end up on failures, but they could be considered as an option to fulfill this requirement. The use of those systems should be analyzed for each application as it is known the entire storing logic will have to be glued to the application’s code. The possiblity of dealing with complicated queries (search, insert, update, delete) is, at the current mo- ment, very little or inexistent. Also, the normalization theorem usually found at relational database to avoid data replication does not apply at all for them. This new approach have several replicas of data inside it at several places which need to be synchronized and keep up to date entirely by the application’s code. As a thumb rule, those new tools are encouraged to be used when the requisites of the application match at least one of the following statements: There is a huge amount of data that needs to be stored; The data set has an easy representation that does not required complex joins or queries and it naturally fits the key-value model; The future of the application will have a high-demand access which will lead to performance problems without clustering. References Amza, C., Cox, A. L., and Zwaenepoel, W. (2003). Distributed versioning: consistent replication for scaling back-end databases of dynamic content web sites. In Middle- ware ’03: Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware, pages 282–304, New York, NY, USA. Springer-Verlag New York, Inc. Binnig, C., Kossmann, D., Kraska, T., and Loesing, S. (2009). How is the weather to- morrow?: towards a benchmark for the cloud. In DBTest ’09: Proceedings of the Second International Workshop on Testing Database Systems, pages 1–6, New York, NY, USA. ACM. Bortnikov, E. (2009). Open-source grid technologies for web-scale computing. SIGACT News, 40(2):87–93. Brantner, M., Florescu, D., Graf, D., Kossmann, D., and Kraska, T. (2008). Building a database on s3. In SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD inter- national conference on Management of data, pages 251–264, New York, NY, USA. ACM. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):1–26. Cryans, J.-D., April, A., and Abran, A. (2008). Criteria to compare cloud computing with current database technology. In IWSM/Metrikon/Mensura ’08: Proceedings of the International Conferences on Software Process and Product Measurement, pages 114–126, Berlin, Heidelberg. Springer-Verlag. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: amazon’s highly
  12. 12. available key-value store. In SOSP ’07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 205–220, New York, NY, USA. ACM. Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29–43, New York, NY, USA. ACM. Gilbert, S. and Lynch, N. (2002). Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51–59. Manassiev, K. and Amza, C. (2005). Scalable database replication through dynamic mul- tiversioning. In CASCON ’05: Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research, pages 141–154. IBM Press. Palankar, M. R., Iamnitchi, A., Ripeanu, M., and Garfinkel, S. (2008). Amazon s3 for science grids: a viable solution? In DADC ’08: Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55–64, New York, NY, USA. ACM. Vogels, W. (2008). Eventually consistent - revisited. http://www. allthingsdistributed.com/2008/12/eventually consistent. html, Visited in December 2009.