Your SlideShare is downloading. ×
A new methodology for large scale nosql benchmarking
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

A new methodology for large scale nosql benchmarking

1,353
views

Published on

A presentation of the new methodology I plan to use for large scale benchmarking of various NoSQL databases. Preceded by a short comparison with the current Wikipedia infrastructure.

A presentation of the new methodology I plan to use for large scale benchmarking of various NoSQL databases. Preceded by a short comparison with the current Wikipedia infrastructure.

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,353
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A new methodology for large scale benchmarking A step by step methodology Dory Thibault UCL Contact : thibault.dory@student.uclouvain.be Sponsor : Euranova Website : nosqlbenchmarking.com March 1, 2011
  • 2. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyExisting Wikipedia infrastructure 2 / 13
  • 3. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyExisting Wikipedia infrastructure The structured data (revision history, articles relations, user accounts...) are stored in MySQL Each wiki has its own database, not necessarily its own cluster Each cluster is made of several MySQL servers using replication Only one master for each cluster All the writes are handled by the master The multiple slaves serve the reads Currently there are 37 servers running MySQL according to ganglia.wikimedia.org Each one has between 8 and 12 CPUs running at 2.2Ghz between 32 and 64 Gb of RAM 3 / 13
  • 4. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyExisting Wikipedia infrastructure The content of the last version of an article is stored as a blob on external storage servers Replicated cluster of 3 MySQL hosts Those data are stored appart from the main core databases because this content : Needs a lot of storage space Is largely unused thanks to the cache servers 4 / 13
  • 5. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyThe benchmark VS the real Wikipedia load A very simpli
  • 6. ed model The benchmark does not try to reproduce the real load on the MySQL clusters There is no computational work on the structured data There is no other cache than the one provided by the database itself The MySQL clusters run on a few powerful servers while the NoSQL clusters will run on many small servers So why Wikipedia? The main point in using Wikipedias data is to use real data : each entry has a dierent size and the MapReduce computation on the content makes sense. 5 / 13
  • 7. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyThe new data set All the articles from Wikipedia in English The new data set is made of all the +10 millions articles from the english version of Wikipedia Sums up to 28Gb uncompressed Each article is considered as a XML blob with all its metadata and is identi
  • 8. ed with a unique integer ID Is that enough data? Not really for a very big cluster. The solution is simply to insert the same data set several times but still using unique ID for each insert. 6 / 13
  • 9. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyThe old benchmark architecture Scaling problem This architecture does not scale, mainly for bandwidth reasons. The computational power needed is small but the whole article is trans- mited for each request. 7 / 13
  • 10. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyThe distributed benchmark architecture 8 / 13
  • 11. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyThe new infrastructure Amazon EC2 infrastructure I plan to use mainly small standard instances (1 CPU, 1.7Gb of RAM) on the Amazon EC2 infrastructure. The biggest cluster should be made of : Hundreds of small EC2 instances A few bigger servers for systems that use master or load balancer like HBase 9 / 13
  • 12. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyThe measured properties 1 The raw performances : how fast is it to make all the requests? 2 The scalability : what is the impact on the perfomances of changing the cluster size (number of nodes and data set)? 3 The elasticity : how long does it take to get to a stable state with increased performances when node are added to the cluster? 10 / 13
  • 13. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyMeasuring the elasticity The most complex of the three measures The time needed for the system to stabilize should be dierent for each system and for each cluster size. I have chosen to character- ize the elasticity by computing the standard deviation for smaller benchmark runs. 1 Use a stable cluster to determine the usual standard deviation of the DB 2 Add the new nodes to the cluster but do not increase the data set 3 Repeat : Start a benchmark run and compute the standard deviation Wait X seconds 4 Until the standard deviation for the last Y runs does not diverge more than Z percents from the usual standard deviation 11 / 13
  • 14. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyThe step by step methodology 1 Start up a clean cluster of size 50 and insert all the articles 2 Measure the standard deviation for this cluster once it has stabilized 3 Choose a total number of requests and a read-only percentage 4 Start the benchmark with the chosen number of requests and read-only percentage 5 Start the MapReduce benchmark 6 Double the number of nodes in the cluster 7 Start the elasticity test 8 Double the size of the data set inserted 9 Jump to 4 with a doubled number of requests until there are no more servers to add to the cluster 12 / 13
  • 15. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodologyBibliography www.nedworks.org/mark/presentations/san/Wikimedia%20architecture.pdf http://meta.wikimedia.org/wiki/Wikimedia servers http://ganglia.wikimedia.org/ 13 / 13