Comparing noSQL databases : benchmark
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Comparing noSQL databases : benchmark

on

  • 45,972 views

FOSDEM 2011 data analytics presentation about noSQL performances.

FOSDEM 2011 data analytics presentation about noSQL performances.
See http://www.nosqlbenchmarking.com

Statistics

Views

Total Views
45,972
Views on SlideShare
31,374
Embed Views
14,598

Actions

Likes
38
Downloads
805
Comments
5

68 Embeds 14,598

http://nosql.mypopescu.com 6637
http://www.readwriteweb.com 2931
http://friendfeedredux.appspot.com 2388
http://pro.clubic.com 1354
http://readwrite.com 437
http://blog.amt.in 403
http://oakleafblog.blogspot.com 133
http://baradates.tumblr.com 42
http://webcache.googleusercontent.com 27
http://readwriteweb.com 25
http://static.slidesharecdn.com 23
http://oakleafblog.blogspot.in 20
http://paper.li 20
http://translate.googleusercontent.com 16
http://jetlib.com 12
http://oakleafblog.blogspot.co.uk 9
http://thinkery.me 8
http://www.linkedin.com 7
http://oakleafblog.blogspot.com.au 6
http://twitter.com 6
https://euranova.knowledgeplaza.net 6
https://www.readwriteweb.com 6
http://cncc.bingj.com 5
http://oakleafblog.blogspot.de 4
http://oakleafblog.blogspot.ro 4
http://swik.net 4
http://social-news-jp.blogspot.com 3
http://dschool.co 3
http://oakleafblog.blogspot.it 3
http://www.twylah.com 3
http://mobile.clubic.com 3
http://www.netvibes.com 2
http://cache.yahoofs.jp 2
http://www.onlydoo.com 2
http://oakleafblog.blogspot.jp 2
http://oakleafblog.blogspot.no 2
http://oakleafblog.blogspot.ca 2
http://oakleafblog.blogspot.ru 2
http://mobile.pro.clubic.com 2
https://www.google.co.kr 2
http://oakleafblog.blogspot.fr 2
http://elsterama.com 2
http://www.clipboard.com 2
http://oakleafblog.blogspot.com.br 2
http://oakleafblog.blogspot.co.at 1
http://oakleafblog.blogspot.be 1
http://oakleafblog.blogspot.pt 1
http://oakleafblog.blogspot.kr 1
http://www.365online.nu 1
http://www.elsterama.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Comparing noSQL databases : benchmark Presentation Transcript

  • 1. Comparing Scalable NOSQL Databases Functionalities and Measurements Dory Thibault UCL Contact : thibault.dory@student.uclouvain.be Sponsor : Euranova February 4, 2011
  • 2. Motivation Overview of the databases Methodology Results Summary and conclusionMotivation YCSB Yahoo! Cloud Servicing Benchmark is the best known noSQL bench- marking application so why make another one? YCSB uses data generated from statistical distributions instead of real data YCSB only focuses on read/write/update/scan performances YCSB results for elasticity are not conclusive Idea Data and use case inspired by a concrete case : Wikipedia Test read/update performances Test MapReduce performances by computing an inverted search index 2 / 18
  • 3. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusionCassandra 0.6.10 Overview Cassandra is a fully distributed column oriented data store that pro- vides a MapReduce implementation using Hadoop. All the nodes in the cluster play the same role The data (existing and new) are sharded automatically among the nodes The developer can choose the consistency level for each request 3 / 18
  • 4. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusionHBase 0.20.6 Overview HBase is a column oriented database that aims to provide low latency requests on top of Hadoop HDFS An HBase cluster uses several kinds of servers : HDFS needs at least one namenode and several datanodes HBase needs a ZooKeeper cluster, a master and several regionservers The requests must be made to the master(s) On the HDFS level, existing data are not sharded automatically but new data are On the HBase level, the data are divided into regions that are sharded automatically across regionservers 4 / 18
  • 5. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusionmongoDB 1.6.5 Overview mongoDB is a document oriented database that stores JSON dic- tionnaries. It provides auto sharding and a MapReduce implemen- tation. A mongoDB cluster is made of several kinds of servers : The shard servers that store data The con
  • 6. guration servers that store the con
  • 7. guration The router servers that receive and route the requests Existing and new data are sharded automatically MapReduce can only use one thread by server 5 / 18
  • 8. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusionRiak 0.14 Overview Riak is a fully distributed key/bucket store with an implementation of MapReduce. Buckets can store the data directly or be a link to another bucket All the nodes in the cluster play the same role The data (existing and new) are sharded automatically amongs the nodes The developer can choose the consistency level for each request 6 / 18
  • 9. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusionThe data Wikipedia export 20.000 pages downloaded from Wikipedia Every document is in XML format All documents sum up to 620Mo Each document is associated to a single integer ID Insertions Each document is inserted only once during the whole benchmark 7 / 18
  • 10. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusionThe client Overview Fully random requests Acts as a perfect load balancer The proportion of updates can be speci
  • 11. ed Speci
  • 12. c parts : read/write/update and MapReduce Updates The updates simply concatenate the string 1" at the end of the article. 8 / 18
  • 13. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusionMapReduce Overview MapReduce is used to build a reverse index for a given keyword. The reverse index is a list of pairs made of : ID : the ID of the article if Count 6= 0 Count : the number of occurrences of the keyword in this article Justi
  • 14. cation This kind of computation implies that all the documents are crawled and take advantage of the speci
  • 15. cations of MapReduce 9 / 18
  • 16. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusionThe methodology 1 Start up a clean cluster of size 3 and insert all the documents 2 Choose a total number of requests, a read percentage and starts the benchmark 3 Wait one minute and starts the benchmark again 4 Wait
  • 17. ve minutes and starts the benchmark again 5 Start the MapReduce benchmark 6 Add a new node to the cluster and wait for it to be ready then restart immediately the bench with the new nodes IP in the list 7 Jump to 3 until there are no more computer to add to the cluster 10 / 18
  • 18. Motivation Overview of the databases Methodology Results Summary and conclusionRead/update results 11 / 18
  • 19. Motivation Overview of the databases Methodology Results Summary and conclusionRead/update results without HBase 12 / 18
  • 20. Motivation Overview of the databases Methodology Results Summary and conclusionMapReduce performance 13 / 18
  • 21. Motivation Overview of the databases Methodology Results Summary and conclusionThe HBase case Veri
  • 22. cations made : Checked the logs : nothing seemed problematic HDFS level : running the balancer with a very low threshold distributed the blocks evenly but without any impact on the performances HBase level : the regions where always nearly evenly distributed across the regionservers The number of rows did not change and the content of each row was correct 14 / 18
  • 23. Motivation Overview of the databases Methodology Results Summary and conclusionSummary of raw performances DB read/update performances MapReduce performances Cassandra Good Very Good HBase Bad / N.A. Average / N.A mongoDB Good Poor but scalable Riak Poor / unstable Average but scalable 15 / 18
  • 24. Motivation Overview of the databases Methodology Results Summary and conclusionSummary of scalability Going from 3 to 8 servers is a 266% increase in capacity, here are the observed increases in performances : DB read/update MapReduce Cassandra 153% 112% HBase 11% 43% mongoDB 145% 211% Riak 74% 189% Riak 7 nodes max 155% 168% 16 / 18
  • 25. Motivation Overview of the databases Methodology Results Summary and conclusionConclusion and future work Conclusion The elastic gain seems more apparent than with YCSB but not linear either It is worth testing MapReduce performances as the results vary a lot between databases for both raw and scalability performances Future work This is still a work in progress : Applying this benchmark to other databases (Terrastore, Voldemort, Scalaris ...) Trying with a growing/bigger data set 17 / 18
  • 26. Motivation Overview of the databases Methodology Results Summary and conclusionQuestions and remarks Any questions or remarks? 18 / 18