Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Altoros using no sql databases for interactive_applications


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Altoros using no sql databases for interactive_applications

  1. 1. Using NoSQL Databasesfor Interactive ApplicationsBy Alexey Diomin and Kirill Grigorchuk
  2. 2. 2Using NoSQL Databases for Interactive Applications©  Altoros  SystemsContentsIntroduction 3Cassandra, MongoDB, and Couchbase 3Key Considerations for Interactive Applications 3Performance Benchmarking 5Results 7Analysis 10Conclusion 10About the authors 11Additional Links 11
  3. 3. 3Using NoSQL Databases for Interactive Applications©  Altoros  SystemsIntroductionInteractive web applications need high-performance and scalability, calling for a different kind ofdatabase. If your website is not fast enough, users may quickly abandon it and look for alternatives. Forexample, in paid online social games, players are extremely demanding and will drop out, even if there isa slight delay. To deliver the best user experience, you must pick the right database.Traditional RDBMS are the wrong tool for the job because they do not provide the necessary scalabilityand performance for working with large amounts of data and application requests. In contrast, NoSQLdatabases have become a viable alternative to RDBMS, particularly for applications that need to changerapidly. They provide high throughput, low latency, and horizontal scaling. But with so many differentoptions around, choosing the right NoSQL database for your specific application needs can be tricky.Recently we took the time to review and benchmark several NoSQL databases. This whitepaper providesan overview of three popular NoSQL solutions: Cassandra, MongoDB, and Couchbase. In addition, itpresents a vendor-independent performance comparison of these products and can be used as a guidewhen choosing a NoSQL database for an interactive application.Cassandra, MongoDB, and CouchbaseSince we had to pick some NoSQL databases to start with, we looked around for commonly used open-source NoSQL solutions. Cassandra, Couchbase, and MongoDB seemed to be the most mature opensource products in their class. If you are already familiar with these NoSQL databases, you might want toskip the rest of this section and go directly to the performance evaluation.Cassandra is a distributed columnar key-value database with eventual consistency. It is optimized forwrite operations and has no central master—data can be written or read to and from any of the nodes inthe cluster. Cassandra provides seamless horizontal scaling and has no single point of failure—if a nodein the cluster fails, another node steps up to replace it. At the moment, Cassandra is an Apache 2.0licensed project supported by the Apache Community.MongoDB is a schema-free, document-oriented, NoSQL database. In MongoDB, data is stored in theBSON format—BSON document is essentially a JSON document represented in a binary format, whichallows for easier and faster integration of data in certain types of applications. This database alsoprovides horizontal scalability and has no single point of failure. However, a MongoDB cluster is differentfrom a Cassandra or Couchbase Server cluster—it includes an arbiter, a master, and multiple slaves. Asof 2009, MongoDB is an open source project with an AGPL license supported by 10Gen.Couchbase is a NoSQL document database. Documents in Couchbase Server are stored as JSON. Withbuilt-in caching, Couchbase provides low-latency read and write operations with linearly scalablethroughput. The architecture has no single point of failure. It is easy to scale-out the cluster and supportlive cluster topology changes. This means, there is no application downtime when you are upgrading yourdatabase, software, or hardware using rolling upgrades. Couchbase, Inc. develops and providescommercial support for the Couchbase Apache 2.0 licensed project.Key Considerations for Interactive Applicationsour database is the workhorse for your Web application. When choosing a database, the following factorsare important to keep in mind:1 Scalability: It’s  hard  to   predict  when  your  application  needs  to  scale,  but  when  your   Web sitetraffic suddenly spikes and your database does not have enough capacity, you need to scale yourdatabase quickly, on demand, and without any application changes. Similarly, when your systemis idle, you should have a possibility to decrease the amount of resources used. Scaling your
  4. 4. 4Using NoSQL Databases for Interactive Applications©  Altoros  Systemsdatabase must be a simple operation—you should not need to deal with complicated proceduresor make any changes to your application.In this paper, we will only speak about horizontal scalability, which involves dividing a system intosmall structural components hosted on different physical machines (or groups of machines)and/or increasing the number of servers that perform the same function in parallel.a Cassandra meets the requirements of an ideal horizontally scalable system. Nodes canbe added seamlessly as you need more capacity. The cluster automatically utilizes thenew resources. A node can be decommissioned in automatic or semi-automatic mode.b Couchbase scales horizontally. All nodes are identical and easy to setup. Nodes can beadded or removed from the cluster with a single button click and no changes to theapplication. Auto-sharding evenly distributes data across all nodes in the cluster withoutany hotspots. Cross datacenter replication makes it possible to scale a cluster acrossdatacenters for better data locality and faster data access.c MongoDB—this database has a number of functions related to scalability. These include:automatic sharding (auto-partitioning of data across servers), reads and writes distributedover shards, and eventually-consistent reads that can be distributed over replicatedservers. When the system is idle, cluster size can only be decreased manually. Theadministrator  uses  the  management  console  to  change  the  system’s configuration. Afterthat, the server process of MongoDB can be safely stopped on the vacant machines.2 Performance: Interactive applications require very low read and write latencies. The databasemust deliver consistently low latencies for read and write operations independent of load or thesize of data being accessed. In general, the read and write latency of NoSQL databases is verylow because data is shared across all the nodes in a cluster while the application’s working set isin memory.Interactive applications need to support millions of users and have different workloads—read,write, or mixed. In the next section, we share some performance test results on different NoSQLdatabases measuring latency versus varying levels of throughput.3 Availability: Interactive Web applications need a database that is highly available. If yourapplication is down, you simply are not making any money. To ensure high availability, yoursolution should be able to do online upgrades to the latest version, easily remove a node formaintenance without affecting the availability of the cluster, handle online operations, such asbackups, and provide disaster recovery, if an entire datacenter goes down.Below are examples of how availability is achieved in different NoSQL databases:a Cassandra: Every node in a Cassandra  cluster,  or  “ring”,  is  given  a  range  of  data  for  which it is responsible. When Cassandra receives a write operation designated to bestored in a node that has failed, it will automatically route the write request to a node thatis alive. The node that receives the write request saves the write operation with a hint.The hint is a message that contains information about the failed node that should havehandled the write request. The node that holds the hint monitors the node ring for therecovery of the failed node that missed the write request. If the failed node comes backonline, the node that holds the hint will handoff the hint message to the recovered node,so that the write requests can be persisted in their proper location. When a new node isadded to the cluster, the workload is distributed to this new node as well.b Couchbase: Couchbase Server maintains multiple copies (up to 3 replicas) of eachdocument in a cluster. Each server is identical and serves active and replica documents.Data is uniformly distributed across all the nodes and the clients are aware of thetopology. If a node in the cluster fails, Couchbase Server detects the failure andpromotes replica documents on other live nodes to active. The client cluster map isupdated to reflect the new topology, so the application continues to work without
  5. 5. 5Using NoSQL Databases for Interactive Applications©  Altoros  Systemsdowntime. When capacity is added, data is rebalanced automatically, also without anydowntime.c MongoDB: Data in MongoDB is spread across several shards. Typically, each shard(replica set) consists of multiple mongo-daemon instances, including an arbiter node, amaster node, and multiple slaves. If a slave node fails, the master node automaticallyredistributes the workload to the rest of the slave nodes. In case the master nodecrashes, the arbiter node elects a new master. If the arbiter node fails and there are noinstances left in the shard, the shard is dead. In MongoDB, a replica set can span acrossmultiple datacenters but writes can only go to one primary instance in one data-center(master-slave replication).4 Ease of development: Relational databases require a rigid schema to model an application. Ifyour application changes, your database schema needs to change as well. In this regard, NoSQLdatabases have the following advantages:a Flexible schema: You do not have to modify the existing structural elements when newfields are added to a document. New documents can co-exist with existing documentswithout any additional changes.b Simple query language: Because data in a NoSQL document is stored in a de-normalized state, you can get and update a document with the help of put and getoperations.Performance BenchmarkingOur test infrastructure consisted of 4 extra-large instances on Amazon EC2 for the NoSQL databases and1 instance for the client. Each instance had 4 virtual CPU cores with 2 Amazon compute units per core,15GB of RAM, and 4 EBS 50GB volumes with RAID 0 striping. We used 64-bit Amazon Linux as the OS.Networking was all 10GigE.The client used the Yahoo! Cloud Serving Benchmark (YCSB), which was modified to suit our needs—weadded a warm-up phase and adjusted working-set load generation that simulates different usersaccessing different data objects with meaningful data amounts and runtime. As shown in Figure 1, theYCSB client consists of two main parts—the workload generator and workload scenarios.The benchmark had 30 parallel client threads to drive the test, generating a mixed read-write workloadwith 5% of creates, 33% of updates, 2% of deletes, and 60% of reads. For all the tests, we used 1.5 KBdocuments (15 fields and 100 bytes each)—a typical document size across several NoSQL databaseuse-cases. The total number of documents in the cluster was 30 million—15 million of active and 15million of replica documents for each database.
  6. 6. 6Using NoSQL Databases for Interactive Applications©  Altoros  SystemsFigure 1: YCSB client—NoSQL database server architectureWe ran each test five times for every NoSQL database and compared the average data access latencyagainst different throughput levels. The NoSQL databases were setup using the following configuration:Cassandra 1.1.2Cassandra JVM settings:1 MAX_HEAP_SIZE, which is the total amount of memory dedicated to the Java heap—6GB2 HEAP_NEWSIZE, which is the total amount of memory for a new generation of objects—400MBCassandra settings:1 RandomPartitioner that uses MD5 Hashing to evenly distribute rows across the cluster2 Memtable of 4GB in size
  7. 7. 7Using NoSQL Databases for Interactive Applications©  Altoros  SystemsCouchbase 2.0 - Beta build 17231 1 replica setting2 12 GB used as per node RAM quota using the Couchbase bucket typeMongoDB1 4 shards, each with 1 replica; each shard is a set of 2 nodes—primary and secondary2 Journaling disabled3 Each node was running 2 mongo daemon processes and 4 mongo router processes.ResultsFigure 2 shows the average latency at varying throughput levels for read, insert, and update operationsmeasured from the client to the server and back against varying levels of throughput for each NoSQLdatabase. The lower the latency values, the better.
  8. 8. 8Using NoSQL Databases for Interactive Applications©  Altoros  SystemsFigure 2: Average latency vs. throughputWe also calculated the 95th percentile time taken for a request to execute a read, insert and updateoperations measured from the client to the server and back against varying levels of throughput for eachNoSQL database.
  9. 9. 9Using NoSQL Databases for Interactive Applications©  Altoros  SystemsFigure 3: 95th Percentile latency vs. throughput
  10. 10. 10Using NoSQL Databases for Interactive Applications©  Altoros  SystemsTypically, you want to see flat latency curves irrespective of the throughput to ensure a consistent userexperience. Couchbase had faster read and write times than MongoDB and Cassandra.AnalysisWhile not an exhaustive list, these are the most relevant pros and cons identified after reviewing thesedatabases:MongoDB demonstrated the lowest throughput among all the databases compared in our test. We sawhigh latencies for write operations at average throughput because the coarser locking in MongoDB limitsthe write throughput of the server. Read requests were faster than in Cassandra but slower than inCouchbase.Increasing the size of the cluster in MongoDB was rather complicated. Many MongoDB operations needto be done manually through the command line and it is mandatory that you have a highly skilled systemadministrator. The advantages include support for in-built MapReduce and CAS transactions.Cassandra showed better results than MongoDB because it uses an eventually consistent architecturewhere in order to confirm a record you only need a reply from one node. In addition, unlike MongoDB,Cassandra is rather flexible when the cluster needs to be resized. Unfortunately, its extremeflexibility designed to sustain performance in highly distributed environments resulted inadditional limitations. The database supports no transactions and cannot block separaterecords.Couchbase showed the lowest latencies and highest throughput among all the databases compared.The in-built object managed cache is responsible for the low latency. With fine grain locking at thedocument level, Couchbase Server was capable of providing high throughput for both reads and writes.The admin console in Couchbase has flexible settings for changing cluster size. Each document in thecluster has an active copy and multiple replicas. Access requests for a particular document are processedby the server holding the active document, which makes it possible to add extended transactionprocessing systems, locking, and CAS. This also eliminates the problem with eventual consistency, whenread replicas have obsolete values. As a bonus for database administrators, Couchbase also comes withadvanced tools for monitoring the status of the whole cluster and its separate nodes.ConclusionChoosing the right NoSQL database for your application is a very complicated process because everyNoSQL solution is optimized for a particular type of load. This is why you should properly evaluate allavailable options before picking a suitable data store for your application.
  11. 11. 11Using NoSQL Databases for Interactive Applications©  Altoros  SystemsAbout the authorsKirill Grigorchuk is the head of R&D department at Altoros Systems Inc. Mr. Grigorchuk has 15+ yearsof experience in IT and profound skills in R&D process engineering, product and project management,Web development, and big data. At the moment, he leads and coordinates research into a wide range ofcutting edge technologies, including distributed computing and NoSQL solutions.Alexey Diomin is a senior Java developer at Altoros Systems Inc. with vast experience in distributedcomputing, NoSQL databases, and Linux. Having excellent skills in building, administering, andsupporting large-scale distributed computing systems, Mr. Diomin did an extensive research into the fieldof big data.Additional LinksCassandra website — http://cassandra.apache.orgCouchbase server website — http://www.couchbase.comMongodb website — http://www.mongodb.orgYCSB Github —