This presentation contains information on the test environment, settings, major criteria for evaluation, and component diagrams that can help you to test a NoSQL data store for your project. It also provides a matrix that compares a number of NoSQL products based on our test results. We also list the issues we encountered and some approaches we used to overcome them.
For more independent research into Hadoop, NoSQL, and other big data technologies, please visit www.altoros.com/research-papers or follow @altoros.
Often referred to as NoSQL, non-relational databases feature elasticity and scalability. In addition, they can store big data and work with cloud computing systems. All of these factors make them extremely popular.
For the purpose of the test, we divided the databases into three categories. Each database was evaluated based on 22 criteria.
For the purpose of the test, we divided the databases into three categories. Each database was evaluated based on 22 criteria.
NoSQL solutions address many of these problems.POINT 1: In 2013, the number of NoSQL products reached 150+ and the figure is still growing. That variety makes it difficult to select the best tool for a particular case.POINT 2: They come in many types--key-value, columnar, document-oriented, and graph.POINT 3: There is one thing in common for all NoSQL databases. They don't use the relational data model. This means they do not use the SQL query language.POINT 4: NoSQL data management systems are inherently schema-free (with no obsessive complexity and a flexible data model) and eventually consistent (complying with BASE rather than ACID)POINT 5: They provide APIs to perform various operations. Some of NoSQL data stores support query language operations, for example, Cassandra and Hbase. However, there is no standard. This is another difference between NoSQL databases and traditional RDBMS.POINT 6: RDBMS usually have strong data consistency. In contrast to that, NoSQL data stores operate with eventual consistency. When you add data to the system, it becomes consistent after some time.POINT 7: NoSQL architectures are designed to run in cluster that consist of several nodes. This makes it possible to scale them horizontally by increasing the number of nodes.In addition, NoSQL data stores serve huge amounts of data and provide high throughput.
POINT 1: NoSQL databases differ from RDBMS in their data models. These systems can be divided into 4 groups:A. Key Value StoresKey value stores are similar to maps or dictionaries where data is addressed by a unique key.B. Document StoresDocument Stores encapsulate key value pairs in JSON or JSON like documents. Within documents, keys have to be unique. In contrast to key-value stores, values are not opaque to the system and can be queried as well.C. Column Family StoresColumn Family Stores are also known as column oriented stores, extensible record stores and wide columnar stores.D. Graph databasesKey-value stores, document stores, and column family stores have a common feature. They do store denormalized data in order to gain advantages in distribution.In contrast to relational databases and the already introduced key oriented NoSQL databases, graph databases are specialized on efficient management of heavily linked data.POINT 2: All NoSQL data stores have an API to work with data. Some DBs use certain SQL operations. Others support MapReduce aggregation.POINT 3: Multiversion concurrency control (MVCC) relaxes strict consistency in favor of performance. In order to support transactions without reserving multiple datasets for exclusive access, optimistic locking is provided by many stores. Before changed data is committed, each transaction checks, whether another transactions made any conflicting modifications to the same datasets.POINT 4: NoSQL databases differ in the way they distribute data on multiple machines. Since data models of key-value stores, document stores and column family stores are key oriented, the two common partition strategies are based on keys, too.The first strategy distributes datasets by the range of their keys. A routing server splits the whole keyset into blocks and allocates these blocks to different nodes. Afterwards, one node is responsible for storage and request handling of his specific key ranges. In order to find a certain key, clients have to contact the routing server for getting the partition table.Higher availability and much simpler cluster architecture can be achieved with the second distributionstrategy called consistent hashing. In contrast to range based partitioning, keys are distributed by using hash functions. Since every server is responsible for a certain hash region, addresses of certain keys within the cluster can be calculated very fast.In addition to better read performance through load balancing, replication also brings better availability and durability, because failing nodes can be replaced by other servers. If all replicas of a master server were updated synchronously, the system would not be available until all slaves had committed a write operation. Ifmessages got lost due to network problems, the system would not be available for a longer period of time. This solution is not suitable for platforms that rely on high availability, because even a few milliseconds of latency can have a big influence on user behavior.POINT 5: (PERFORMANCE: TYPICAL WORKLOADS)Obviously, performance is a very important factor. Performance of data storage solutions can be evaluated using typical scenarios. These scenarios simulate the most common operations performed by applications that use the data store, also known as typical workloads. The tests that we performed to compare performance of several NoSQL data stores also used typical workloads.
1. Device syncUpload of raw payload from devices. ConfigTable: SYNCInitial records: imported on initial workloadOperations: 100% custom inserts to SYNC Custom operationParameters: PAYLOADSteps: Generate SYNC_IDRead PAYLOAD value from pregenerated file;insert new record (SYNC_ID, PAYLOAD) into SYNC 2. Add to Shopping CartConsumer would add to the cart and may not buy: reading from INVENTORY, inserting to ORDERS. ConfigTable(read): INVENTORYTable(insert): ORDERSInitial records: imported on initial workloadOperations: 100% adding to Shopping Cart: read of INVENTORY table – 50%, insert to ORDERS – 50% CustomoperationParameters: USER_IDSKU_ID QTY :=1STATE := ‘INCOMPLETE’Steps:Retrieve row from INVENTORY with given SKU_IDGenerate ORDER_IDInsert new Order record(USERID, ORDER_ID, SKU_ID, QTY, CR_DATE := SYSDATE, STATE := ‘INCOMPLETE’) to ORDERS 3. ProfileregistrationsRegistration of new users in a system. ConfigTable: USERInitial records: imported on initial workloadOperations: 100% adding of new profile: read (checking if user with such an email exists) – 50%, insert of new profile – 50% CustomoperationParameters:Password := ‘1234’Shipping addr := ‘1 Bowerman Drive, Beaverton, OR, 97005’Billing addr := ‘1 Bowerman Drive, Beaverton, OR, 97005’Steps: Generate a unique USER_ID [based on Java UUID]: must be uniqueGenerate FIRST_NAME, LAST_NAME: random generator can be used, but generating name so that EMAIL constructed from them is unique – see the next pointGenerate a EMAIL: FIRST_NAME.LAST_NAME@test.com – must be uniqueLAST_ACTIVITY_DATE will be SYSDATE in TIMESTAMP formatRetrieve row from USER based on EMAIL (simulating a check of uniqueness)INSERT to USER(USER_ID, EMAIL, FIRST_NAME, LAST_NAME, Password, Shipping addr, Billing addr) 4. LoginLogin of user and updating of his last activity date ConfigTable: USERInitial records: imported on initial workloadOperations: 100% login of user: read from USER – 50%, update USER – 50% CustomoperationParameters:User EMAILSteps:Retrieve row from USER based on EMAILSet Last Activity Date to SYSDATE in TIMESTAMP format and update USER record 5. OrderCreateCustomer would add to shopping cart and check out. ConfigTable(read): INVENTORYTable(update): INVENTORYTable(insert): ORDERSInitial records: imported on initial workloadOperations: 100% of create order operation: read from INVENTORY - 33.3%, insert to ORDERS - 33.3%, update of INVENTORY - 33.3% CustomoperationParameters:USER_ID from USERValid SKU_ID from INVENTORYQTY :=2STATE := ‘COMPLETE’Steps:Retrieve row from INVENTORY based on SKU_IDCheckif LOCK is ‘N’Check QTY < AVAIL_QTYSubtract QTY from AVAIL_QTY in INVENTORY row for SKU_IDSet LOCK to ‘Y’ for SKU_ID [No other user should be able to place an order for this SKU as long as LOCK is ‘Y’] and update this INVENTORY rowCR_DATE will be SYSDATE in Timestamp formatINSERT to ORDERS(USER_ID, SKU_ID, QTY, CR_DATE, STATE)Set LOCK to ‘N’ in INVENTORY row and update it again 6. Last 30 useractivitiesChecking for the last 30 activities on mobile app. ConfigTable: ACTIVITYInitial records: imported on initial workloadOperations: custom query read – 100% Custom operationParameters:USER_ID from USERSteps:Retrieve last 30 rows from ACTIVITY based on USER ID and sort desc based on DATE column 7. Single activity detail based on activity idChecking data of a single activity ConfigTable: ACTIVITYInitial records: imported on initial workloadOperations: custom read based on ACTIVITY_ID – 100% CustomoperationParameters:ACTIVITY_ID Steps:Retrieve data from ACTIVITY based on ACTIVITY_ID 8. Total fuel for the last 30 activitiesAggregate query of total fuel earned by user for the last 30 activities ConfigTable: ACTIVITYInitial records: imported on initial workloadOperations: custom aggregated read – 100% Custom operationParameters:USER_ID from USERSteps:Aggregate FUEL for the last 30 activities from ACTIVITY 9. Delete an activity based on activity idUser’s deleting a single activity from their mobile app. ConfigTable: ACTIVITYInitial records: imported on initial workloadOperations: 100% custom delete based on ACTIVITY_ID CustomoperationParameters:ACTIVITY_IDSteps:Delete from ACTIVITY based on ACTIVITY_ID 10. Profile search based on the First Name and Last NameUser record lookup. 10.1 ConfigTable: USERInitial records: imported on initial workloadOperations: custom search read – 100% 10.2 Custom operationParameters:FirstNamefrom USERLastNamefrom USERSteps:Retrieve data based on First name and Last name [case insensitive]
Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. In our tests we tried to see how NoSQL data stores perform under the same conditions.POINT 1: For benchmarking, we used the Yahoo Cloud Serving Benchmark (YCSB)The kernel of YCSB has a a framework with a workload generator that creates test workload and a set of workload scenarios.POINT 2: Developers need to describe the scenario of the workload by operation type: what operations are performed on what types of records. POINT 3: Supported operations include: insert, update (change one of the fields), read (one random field or all the field of one record), and scan (read the records in the order of the key starting from a randomly selected record).We can define the workload by the data that will be loaded into the database during the loading phase and the operations that will be executed against the data set during the transaction phase.
This is a component diagram of the YCSB framework. It consists of several modules.Workload executor applies the workload to the data store. For each session, when the client accesses the DB, a client thread is initiated. Each thread performs a set of operations from the workload. The results in the form of statistics are then sent to the statistics module, which prints the output of the test to console where benchmark is started. These tests are consequently repeated for all the selected solutions.The YCSB framework has connectors for a wide range of DBs. For each database tested with YCSB, a developer needs to determine the type of database, target throughput, the number of concurrent threads on the client side, and how many operations we want to perform. This is necessary to create and start a test.