Mongouk talk june_18
Upcoming SlideShare
Loading in...5
×
 

Mongouk talk june_18

on

  • 2,632 views

 

Statistics

Views

Total Views
2,632
Views on SlideShare
2,632
Embed Views
0

Actions

Likes
0
Downloads
15
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mongouk talk june_18 Mongouk talk june_18 Presentation Transcript

    • Table of Contents 1. Structure:............................................................................................................................................................. 4 1. Markus ............................................................................................................................................................... 4 2. Flavio................................................................................................................................................................. 5 2. Who are we? ........................................................................................................................................................ 6 1. Markus Gattol ..................................................................................................................................................... 7 2. Flavio Percoco Premoli .......................................................................................................................................... 8 3. Introduction Part 1 .............................................................................................................................................. 9 1. What I am going to tell you................................................................................................................................... 9 4. Integration with other Technologies .................................................................................................................. 10 5. Frequently Asked Questions ............................................................................................................................... 11 1. Basics .............................................................................................................................................................. 12 1. Are there any Reasons not to use MongoDB? ...................................................................................................... 13 2. What are the supported Programming Languages? .............................................................................................. 14 3. What is the Status of Python 3 Support? ............................................................................................................ 15 4. What is the difference in the main Building-blocks to RDBMSs? ............................................................................. 16 2. Administration................................................................................................................................................... 17 1. Is there a Web GUI? What about a REST Interface/API? ....................................................................................... 18 2. Can I rename a Database? ............................................................................................................................... 19 3. How do I physically migrate a Database? ........................................................................................................... 20 1. Secure Copy .... as in scp .............................................................................................................................. 20 2. Minimum Downtime...................................................................................................................................... 20 4. How do I update to a new MongoDB version?...................................................................................................... 22 5. What is the default listening Port and IP? ........................................................................................................... 23 6. Is there a Way to do automatic Backups? ........................................................................................................... 24 7. What is getSisterDB() good for? ........................................................................................................................ 25 8. How can I make MongoDB automatically start/restart on Server boot/reboot? ......................................................... 26 3. Resource Usage................................................................................................................................................. 27 1. Why is my Database growing so fast? ................................................................................................................ 28 2. What Caching Algorithm does MongoDB use? ...................................................................................................... 29 3. Why does MongoDB use so much RAM? ............................................................................................................. 30
    • 4. What is the so-called Working Set Size? ............................................................................................................. 31 5. How much RAM does MongoDB need?................................................................................................................ 32 1. Speed Impact of not having enough RAM ........................................................................................................ 32 6. Can I limit MongoDB's RAM Usage? ................................................................................................................... 33 7. What can I do about Out Of Memory Errors? ....................................................................................................... 34 1. OpenVZ ...................................................................................................................................................... 35 8. Does MongoDB use more than one CPU Core?..................................................................................................... 36 9. How can I tell how many clients are connected? .................................................................................................. 37 10. How many parallel Client Connections to MongoDB can there be? .......................................................................... 38 11. Does MongoDB do Connection Pooling? .............................................................................................................. 39 12. Is there a Size limit of how much Data can be stored inside MongoDB? .................................................................. 40 13. Do embedded Documents count toward the 4 MiB BSON Document Size Limit? ....................................................... 41 14. Does Document Size impact read/write Performance? .......................................................................................... 42 15. Is there a Way to tell the Size of a specific Document? ......................................................................................... 43 16. How can I tell the Size of a Collection and its Indexes? ........................................................................................ 44 4. Collections / Namespaces ................................................................................................................................... 46 1. What is a Capped Collection? Why use it? ........................................................................................................... 47 2. Can I rename a Collection?............................................................................................................................... 48 3. What is a Virtual Collection? Why use it? ............................................................................................................ 49 4. Can I use a larger Number of Collections/Namespaces?........................................................................................ 50 5. How about cloning a Collection? ........................................................................................................................ 51 6. Can I merge two or more Collections into one? ................................................................................................... 52 7. How can I get a list of Collections in my Database?.............................................................................................. 53 8. How do I delete a Collection?............................................................................................................................ 55 9. What is a Namespace with regards to MongoDB?................................................................................................. 56 10. How can I get a list of Namespaces in Database? ................................................................................................ 57 5. Statistics / Monitoring ........................................................................................................................................ 58 1. The Server Status, what does it tell? ................................................................................................................. 59 6. Schema / Configuration ...................................................................................................................................... 62 7. Indexes / Search / Metadata ............................................................................................................................... 63 8. Map / Reduce .................................................................................................................................................... 64 9. GridFS / Data Size ............................................................................................................................................. 65
    • 1. What is GridFS? .............................................................................................................................................. 66 1. What can we do with GridFS .......................................................................................................................... 66 2. Why use GridFS over ordinary Filesystem Storage?.............................................................................................. 67 10. Scalability / Fault Tolerance / Load Balancing ........................................................................................................ 68 11. Miscellaneous .................................................................................................................................................... 69 6. Use Case ............................................................................................................................................................ 70 7. Summary Part 1 ................................................................................................................................................. 71 8. Introduction Part 2 ............................................................................................................................................ 72 9. Existing Technologies......................................................................................................................................... 73 10. SQL to MongoDB Query Translation.................................................................................................................... 74 11. Keeping things lazy... ......................................................................................................................................... 75 12. Keeping Relations or Embedding? ...................................................................................................................... 76 1. Using References:.............................................................................................................................................. 77 2. Without references: ........................................................................................................................................... 78 3. Light and fast (For registered users): ................................................................................................................... 79 4. Heavy and slow (For any user): ........................................................................................................................... 79 5. Lazy relations or mongodb like ones:.................................................................................................................... 80 13. Taking Advantage from schema-less Databases for Web Development ..............................................................81 14. Summary Part 2 ................................................................................................................................................. 83 Structure: Markus • 2min: tell the audience what I am going to tell them (a summary) and why I think it's worth mentioning • 3min: I'll start with a big picture view (how MongoDB just integrates nicely with existing setups eg folks can continue on using dm- crypt/luks) basic principles like • 5min: pick a few FAQs items and elaborate on them eg "Why is MongoDB using so much RAM"
    • • 5min: I will then go on taking a use case as an example (a webapplication build with Django and MongoDB) from the financial domain where we need transactions/locking/ACID and talk about the differences to eg MySQL/PostgreSQL • 5min: also, with this use case, other things like: storing various precison numbers • 5min: summarize what I've told them You start after me and drill down on details (the stuff you mentioned in your email ~9 days ago) or whatever you/we see fit. Flavio • 2min: I'll tell the audience the topics I'll talk about and how they help us with mongodb and django integration • 5min: Mappers & Stack, I'll list some of the current ODM's used to integrate mongodb and django and how django-mongodb-engine integrates with django and mongodb. • 5min: I'll talk about queries, what we have in sql that we don't have in mongodb and how we can obtain the same results using it ◦ perfect, nothing to add/change here • 3min: I'll talk about embedding and referencing, when it worths doing each and why • 5min: I'll talk about how it is possible to take advantage of schemeless databases in web programming (django oriented) ◦ ok sounds good, not sure I understand exactly; approach me today on #sunoano and give me an example • 5min: Summarize and maybe some benchmark!!!
    • Who are we? Still, with all the technology we have these days, at the end of the day it is all about the people ... /me definitely not a
    • Markus Gattol • grown up in Carinthia (southernmost Austrian state, bordering Italy), lives in the UK now ◦ http://sunoano.name/albums/places/austria/index.html • technical background, MSc (Computer Science, Electrical Engineering) • with Linux (Debian) since 1995, Contributor • RDBMSs, the usual ... • Open Source Developer/Contributor in general • website http://sunoano.name ◦ http://sunoano.name/ws/mongodb.html • works for Heart Internet Ltd., NSN before that ◦ http://www.heartinternet.co.uk
    • Flavio Percoco Premoli • GNOME a11y Contributor (MouseTrap [http://live.gnome.org/MouseTrap]) • Open Source Developer/Contributor (Web and Desktop) • R&D Developer at The Net Planet Europe ◦ NoSQL Technologies ◦ Cloud Computing ◦ Knowledge Management Systems • Linux Lover/User and Mac user too • website: http://www.flaper87.org • Twitter: FlaPer87 • Github: FlaPer87 • Bitbucket: FlaPer87 • Everywhere else: FlaPer87
    • Introduction Part 1 The why ... 1. why are you here today? 2. why does some business want to know about new technology? 3. why are we looking to move away from RDBMs to NoSQL DBMSs? 4. German: Hardware und Software sind dann gut, wenn sie sich verstehen lassen, während man sie benutzt - und nicht, wenn man damit vielleicht zum Mars fliegen kann. Part 1 is mainly about MongoDB itself and not about Django/Python .... Part 2? .... Django! What I am going to tell you Best listener experience possible ... Introduction Part 1 ... Tell the audience what you're going to tell them Tell them Integration with other Technologies Frequently Asked Questions Use Case Summary Part 1 ... Tell the audience what you told them
    • Integration with other Technologies • How can I get MongoDB? • Ok, have it! Now what? 1. full-disk encryption / filesystem-level encryption 2. backup technologies, Rsync/Unison, Bacula, Amanda 3. LVM 4. VPN, SSH 5. Virtualization, OpenVZ
    • Frequently Asked Questions Well, just because ...
    • Basics Before we start running we need to be able to walk ...
    • Are there any Reasons not to use MongoDB? 1. We need transactions (ACID (Atomicity, Consistency, Isolation, Durability)). 2. Our data is very relational. 3. Related to 2, we want to be able to do joins on the server (but can not do embedded objects / arrays). 4. We need triggers on our tables. There might be triggers available soon however. 5. We rely on triggers (or similar functionality) for cascading updates or deletes. 6. We need the database to enforce referential integrity (MongoDB has no notion of this at all). 7. If we need 100% per node durability. 8. Write ahead log. MongoDB does not have one simply because it does not need one. 9. Dynamic aggregation with ad-hoc queries; Crystal reports, reporting, business logic, ... RDBMSs heartland ...
    • What are the supported Programming Languages? Right now (June 2010) we can use MongoDB from at least C, C++, C#, .NET, ColdFusion, Erlang, Factor, Java, Javascript, PHP, Python, Ruby, Perl. Of course, there might be more languages available in the future.
    • What is the Status of Python 3 Support? The current thought is to use Django as more or less a signal for when adding full support for Python 3 makes sense. MongoDB can probably support it a bit earlier than Django does, but that is certainly not something the MongoDB community wants to rush and then have to support two totally different code bases.
    • What is the difference in the main Building-blocks to RDBMSs? We have RDBMSs like for example MySQL, Oracle, PostgreSQL and then there are NoSQL DBMSs like for example MongoDB. Below is a breakout about how MongoDB relates to the afore mentioned, it is a breakout about how the main building blocks of each party resemble: MySQL, PostgreSQL, Oracle -------------------------------------------- Server:Port - Database - Table - Row MongoDB -------------------------------------------- Server:Port - Database - Collection - Document
    • Administration The usual handicraft work ... get and keep it running ... if in doubt, automate!
    • Is there a Web GUI? What about a REST Interface/API? • assuming a mongod process is running on localhost then we can access some statistics at http://localhost:28017/ and http://localhost:28017/_status • In order to have a REST interface to MongoDB, same as CouchDB has it, we have to start mongod with the --rest switch. ◦ Note however that this is just a read-only REST interface. • For a read and/or write REST interface: ◦ http://www.mongodb.org/display/DOCS/Http+Interface ◦ http://github.com/kchodorow/sleepy.mongoose ◦ http://github.com/tdegrunt/mongodb-rest • If we wanted real-time updates from the CLI, then we could also use mongostat.
    • Can I rename a Database? Yes, but it is not as easy as renaming a collection. As of now, the recommended way to rename a database is to clone it and thereby rename it. This will require enough additional free disk space to fit the current/old database at least twice.
    • How do I physically migrate a Database? There is even a clone command for that. Note however that neither copyDatabase() nor cloneDatabase() actually perform a point-in-time snapshot of the entire database -- what they basically do is query the source database and then replicate to the target database i.e. if we use copyDatabase() or cloneDatabase() on a source database which is online and has operations performed on it, then the target database cannot be a point-in-time snapshot pointing to the exact time when either one command was issued. Rather, at some point in time, they will/might have the same data/state as their source database. Secure Copy .... as in scp A bit downtime but the chance to resume a canceled transfer .... • shutdown mongod on the old machine • copied/sync the database directory to the new machine • start mongod on the new machine with dbpath set appropriately ◦ http://sunoano.name/ws/debian_notes_cheat_sheets.html#resume_an_scp_transfer Minimum Downtime Below is what we could do in order to have as little downtime as possible: • stop and re-start the existing mongod as master (if it is not already running as master that is) • install mongod on the new machine and configure it as slave using --slave and --source • wait while the slave copies the database, re-indexes and then catches up with its master (this happens automatically when we point a slave to its master). Once the slave has caught up, we • disable writes to the master (clients can still read/query) • once all outstanding writes have been committed on the master and the slave caught up, we shutdown the master and restart the slave as new master. The old master can now be removed entirely. • now we point all traffic at the new master
    • • finally we enable writes on the new master again, ... Et voilà! Of course, we might also use OpenVZ and its live-migration feature ...
    • How do I update to a new MongoDB version? If it is a drop-in replacement we just need to shutdown the older version and start the new one with the appropriate dbpath. Otherwise, i.e. if it is not a drop-in replacement, we would use mongoexport followed by mongoimport.
    • What is the default listening Port and IP? We can use netstat to find out: wks:/home/sa# netstat -tulpena | grep mongo tcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN 124 1474236 8822/mongod tcp 0 0 0.0.0.0:28017 0.0.0.0:* LISTEN 124 1474237 8822/mongod wks:/home/sa# The default listening port for mongod is 27017. 28017 is where we can point our web browser in order to get some statistics. The default listening IPs are all local IPs i.e. 0/0 which matches all source addresses from 0.0.0.0 with netmask 0.0.0.0 i.e all source addresses from the local machine ... plus ... And yes, this includes the loopback device/address/network 127.0.0.0/8, the private class A network 10.0.0.0/8, the private class B network 172.16.0.0/12 and of course also the private class C network 192.168.0.0/16 amongst others. Both, listening port and IP address, can be changed either by using the CLI switches --port and --bind_ip or the configuration file which we can figure out by looking at the runtime configuration.
    • Is there a Way to do automatic Backups? Yes, http://github.com/micahwedemeyer/automongobackup
    • What is getSisterDB() good for? We can use it to get ourselves references to databases which not just saves a lot of typing but is, once we got used to using it, a lot more intuitive: 1 sa@wks:~/mm/new$ mongo 2 MongoDB shell version: 1.5.2-pre- 3 url: test 4 connecting to: test 5 type "help" for help 6 > db.getCollectionNames(); 7 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 8 > reference_to_test_db = db.getSisterDB('test'); 9 test 10 > reference_to_test_db.getCollectionNames(); 11 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 12 > use admin 13 switched to db admin 14 > reference_to_test_db.getCollectionNames(); 15 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 16 > bye 17 sa@wks:~/mm/new$ Note how we get a reference to our test database in line 8 and how it is used in lines 10 and even line 14, after switching from our test database to the admin database. getCollectionNames() has just been chosen as an example, it could have been any other command as well of course.
    • How can I make MongoDB automatically start/restart on Server boot/reboot? One way would be to use the @reboot directive with Cron. However, .deb and .rpm packages install init scripts (sysv or upstart style, as appropriate) on Debian, Ubuntu, Fedora, and CentOS already so MongoDB will restart there without further need from us to do anything special. • For other constellations, http://gist.github.com/409301is an init.d script for Unix-like systems based on http://bitbucket.org/bwmcadams/toybox/src/3e84be941408/mongodb.init.rhel. • For Mac OS X, people have reported that launchctl configurations like http://github.com/AndreiRailean/MongoDB-OSX- Launchctl/blob/master/org.mongo.mongod.plist work. • For Windows, we have http://www.mongodb.org/display/DOCS/Windows+Service documentation.
    • Resource Usage Lot's of confusion amongst beginners ...
    • Why is my Database growing so fast? The first file for a database is dbname.0, then dbname.1, etc. dbname.0 will be 64 MiB, dbname.1 128 MiB, ... up to 2 GiB. Once the files reach 2 GiB in size, each successive file is also 2 GiB. So, if we have say, database files up to dbname.n, then dbname.n-1 might be 90% unused but dbname.n has already be allocated once we start using dbname.n-1. The reasoning here is simple: we do not want to wait for new database files when we need them so we always allocate the next one in the background as soon as we start to use an empty one. Note that deleting data and/or dropping a collection or index will not release already allocated disk space since it is allocated per database. Disk space will only be released if a database is repaired or the database is dropped altogether. Go to http://www.mongodb.org/display/DOCS/Developer+FAQ#DeveloperFAQ-Whyaremydatafilessolarge%3F for more information.
    • What Caching Algorithm does MongoDB use? Actually, that is done by the OS using the LRU (Least Recently Used) caching pattern.
    • Why does MongoDB use so much RAM? Well, it does not actually, it is just that most folks do not really understand memory management -- there is more to it than just is in RAM or is not in RAM. The current default storage engine for MongoDB is called MongoMemMapped_RecStore. It uses memory-mapped files for all disk I/O operations. Using this strategy, the operating system's virtual memory manager is in charge of caching. This has several implications: • There is no redundancy between file system cache and database cache, actually, they are one and the same. • MongoDB can use all free memory on the server for cache space automatically without any configuration of a cache size. • Virtual memory size and RSS (Resident Set Size) will appear to be very large for the mongod process. This is benign however -- virtual memory space will be just larger than the size of the datafiles open and mapped i.e. resident size will vary depending on the amount of memory not used by other processes on the machine. • Caching behavior (such as LRU'ing out of pages, and laziness of page writes) is controlled by the operating system. The quality of the VMM (Virtual Memory Manager) implementation will vary by OS. As of now, an alternative storage engine (CachedBasicRecStore), which does not use memory-mapped files, is under development. This engine is more traditional in design with its own page cache. With this store the database has more control over the exact timing of reads and writes, and of the cache LRU strategy. Generally, the memory-mapped store (MongoMemMapped_RecStore) works quite well. The alternative store will be useful in cases where an operating system's VMM is behaving suboptimal.
    • What is the so-called Working Set Size? Working set size can roughly be thought of as how much data we will need MongoDB (or any other DBMS, relational or non-relational) to access in a period of time. For example, YouTube has ridiculous amounts of data, but only 1% may be accessed at any given time. If, however, we are in the rare case where all the data we store is accessed at the same rate at all times (LRU), then our working set size can be defined as our entire data set stored in MongoDB.
    • How much RAM does MongoDB need? We now know MongoDB's caching pattern, we also know what a working set size is. Therefore we can have the following rule of thumb on how much RAM a machine needs in order to work properly. It is the working set size plus MongoDB's indexes which should reside in RAM at all times i.e. the amount of available RAM should be at least the working set size plus the size of indexes plus what the rest of the OS and other software running on the same machine needs. Speed Impact of not having enough RAM Generally, when databases are to big to fit into RAM entirely, and if we are doing random access, we are in trouble as HDDs are slow at that (roughly a 100 operations per second per drive). One solution is to have lots of HDDs (10, 100, ...). Another one is to use SSDs (Solid State Drives) or, even better, add more RAM. Now that being said, the key factor here is random access. If we do sequential access to data bigger than RAM, then that is fine. So, it is ok if the database is huge (more than RAM size), but if we do a lot of random access to data, it is best if the working set fits in RAM entirely. However, there are some nuances around having indexes bigger than RAM with MongoDB. For example, we can speed up inserts if the index keys have certain properties -- if inserts are an issue, then that would help.
    • Can I limit MongoDB's RAM Usage? No, it is not designed to do that, it is designed for speed and scalability. If we wanted to run MongoDB on the same physical machine alongside some web server and for example some application server like Django, then we could ensure memory limits on each one by simply using virtualization and putting each one in its own VE (Virtual Environment). In the end we would thus have a web application made of MongoDB, Django and for example Cherokee, all running on the same physical machine but being limited to whatever limits we set on each VE they run in.
    • What can I do about Out Of Memory Errors? If we are getting something like this Fri May 21 08:29:52 JS Error: out of memory (or akin stuff) in our logs, then we hit a memory limit. As we already know, MongoDB takes all RAM it can get i.e. RAM, or more precisely RSS (Resident Set Size), itself part of virtual memory, will appear to be very large for the mongod process. The important point here is how it is handled by the OS. If the OS just blocks any attempt to get more virtual memory or, even worse, kills the process (e.g. mongod) which tries to get more virtual memory, then we have got a problem. What can be done is to elevated/alter a few settings: 1 sa@wks:~$ ulimit -a | egrep virtual|open 2 open files (-n) 1024 3 virtual memory (kbytes, -v) unlimited 4 sa@wks:~$ lsb_release -irc 5 Distributor ID: Debian 6 Release: unstable 7 Codename: sid 8 sa@wks:~$ uname -a 9 Linux wks 2.6.32-trunk-amd64 #1 SMP Sun Jan 10 22:40:40 UTC 2010 x86_64 GNU/Linux 10 sa@wks:~$ As we can see from lines 5 to 9, I am on Debian sid (still in development) running the 2.6.32 Linux kernel. The settings we are interested in are with lines 2 and 3. Virtual memory is unlimited by default so that is fine already -- this is actually what causes the most problems so we need to make sure virtual memory is either reasonably high or, even better, set to unlimited as shown above. With regards to allowed open file descriptors -- by default we are limited to 1024 open files which, in some cases, might pose a problem -- simply elevating it might be enough already and make memory
    • errors go away. Note that we need to run these commands (e.g. ulimit -v unlimited) in the same user context as mongod i.e. we basically want to script them as part of our mongod startup process. OpenVZ If we are running MongoDB with OpenVZ then there are some more settings we might want to tune in order to avoid the OOM (Out of memory) killer to kick in or simply hit the virtual memory ceiling if not set to unlimited. Special attention should be paid to the OpenVZ memory settings i.e. they should be set to reflect MongoDB's memory usage.
    • Does MongoDB use more than one CPU Core? For write operations MongoDB makes use of one CPU core. For read operations however, which tend to be the majority of operations, MongoDB uses all CPU cores available to it. In short: one will notice a speed increase going from a single-core CPU to dual-core or even higher e.g. quad-core or maybe even octo-core since the speed increase is roughly proportional to the available CPU cores.
    • How can I tell how many clients are connected? We can look at the connections field (current) with the server status: sa@wks:~$ mongo --quiet type "help" for help > db.serverStatus(); { [skipping a lot of lines ...] "connections" : { "current" : 2, "available" : 19998 }, [skipping a lot of lines ...] } > bye sa@wks:~$
    • How many parallel Client Connections to MongoDB can there be? Have a look at the connections field (available) with the server status.
    • Does MongoDB do Connection Pooling? Yes, we can do connection pooling for performance reasons and overall resource usage optimization -- without it things would be a lot slower and resource intensive. Fact is that as of now (June 2010) most of the client drivers do connection pooling, how exactly it is done varies with driver e.g. PyMongo.
    • Is there a Size limit of how much Data can be stored inside MongoDB? 4 MiB is the limit on individual documents, but GridFS uses many documents, so there is no limit, technically/ practically speaking. As the above is true for x86-64, it is not entirely true for x86 (32 bit) -- there is a limit because of how memory mapped files work which is a limit of 2GiB per database.
    • Do embedded Documents count toward the 4 MiB BSON Document Size Limit? Yes, the entire BSON (Binary JSON) document (including all embedded documents, etc.) cannot be more than 4 MiB in size.
    • Does Document Size impact read/write Performance? Yes, but this is mostly due to network limitations e.g. one will max out a GigE link with inserts before document size starts to slow down MongoDB itself.
    • Is there a Way to tell the Size of a specific Document? Yes, one can use Object.bsonsize(db.whatever.findOne()) in the shell like this: sa@wks:~$ mongo MongoDB shell version: 1.5.1-pre- url: test connecting to: test type "help" for help > db.test.save({ name : "katze" }); > Object.bsonsize(db.test.findOne({ name : "katze"})) 38 > bye sa@wks:~$
    • How can I tell the Size of a Collection and its Indexes? sa@wks:~$ mongo --quiet type "help" for help > db.getCollectionNames(); [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] > db.test.dataSize(); 160 > db.test.storageSize(); 2304 > db.test.totalIndexSize(); 8192 > db.test.totalSize(); 10496 We are using the test collection here. dataSize() is self-explanatory. storageSize() includes our data and all the still free but already allocated disk space to this collection. totalIndexSize() is the size in bytes of all the indexes in this collection and totalSize() is all the storage allocated for all data and indexes in this collection. If we need/want a more detailed view we could also have a look at > db.test.validate(); { "ns" : "test.test", "result" : " validate firstExtent:2:2b00 ns:test.test lastExtent:2:2b00 ns:test.test # extents:1 datasize?:160 nrecords?:4 lastExtentSize:2304
    • padding:1 first extent: loc:2:2b00 xnext:null xprev:null nsdiag:test.test size:2304 firstRecord:2:2be8 lastRecord:2:2c58 4 objects found, nobj:4 224 bytes data w/headers 160 bytes data wout/headers deletedList: 0000001000000000000 deleted: n: 1 size: 1904 nIndexes:1 test.test.$_id_ keys:4 ", "ok" : 1, "valid" : true, "lastExtentSize" : 2304 } > bye sa@wks:~$ Note that while MongoDB generally does a lot of pre-allocation, we can remedy this by starting mongod with --noprealloc and --smallfiles.
    • Collections / Namespaces Needs to be known, plain and simple ...
    • What is a Capped Collection? Why use it? • Size: http://www.mongodb.org/display/DOCS/Capped+Collections • Time (TTL Collections): http://jira.mongodb.org/browse/SERVER-211
    • Can I rename a Collection? Yes. Using help(); from MongoDB's interactive shell we get, amongst others, db.test.renameCollection( newName , <dropTarget> ) which renames the collection. So yes, we could do db.foo.renameCollection('bar'); and have the collection foo renamed to bar. Renaming a collection is an atomic operation by the way.
    • What is a Virtual Collection? Why use it? It refers to the ability to reference embedded documents as if they were a first-class collection of top level documents, querying on them and returning them as stand-alone entities, etc.
    • Can I use a larger Number of Collections/Namespaces? There is a limit to how much collections/namespaces we can have within a single MongoDB database. It is ~24000 namespaces per database. This is essentially the number of collections plus the number of indexes.
    • How about cloning a Collection? Yes, possible. Have a look at mongoexport and mongoimport.
    • Can I merge two or more Collections into one? Yes, we read from all collections we want to merge and use insert() to write it into our single target collection. This can be done on the server (using MongoDB's interactive shell) or from a client.
    • How can I get a list of Collections in my Database? We can use getCollectionNames() as shown below in lines 8 and 9. Yet another possibility is shown in lines 23 to 28. Of course, since every collection is also a namespace, we can find them aside indexes in lines 11 to 21: 1 sa@wks:~$ mongo 2 MongoDB shell version: 1.2.4 3 url: test 4 connecting to: test 5 type "help" for help 6 > db 7 test 8 > db.getCollectionNames(); 9 [ "fs.chunks", "fs.files", "mycollection", "system.indexes", "things" ] 10 > db.system.namespaces.find(); 11 { "name" : "test.system.indexes" } 12 { "name" : "test.fs.files" } 13 { "name" : "test.fs.files.$_id_" } 14 { "name" : "test.fs.files.$filename_1" } 15 { "name" : "test.fs.chunks" } 16 { "name" : "test.fs.chunks.$_id_" } 17 { "name" : "test.fs.chunks.$files_id_1_n_1" } 18 { "name" : "test.things" } 19 { "name" : "test.things.$_id_" } 20 { "name" : "test.mycollection" } 21 { "name" : "test.mycollection.$_id_" } 23 > show collections 24 fs.chunks 25 fs.files 26 mycollection
    • 27 system.indexes 28 things 29 > bye 30 sa@wks:~$
    • How do I delete a Collection? db.collection.drop() but there is no undo so beware.
    • What is a Namespace with regards to MongoDB? Collections can be organized in namespaces. These are named groups of collections defined using a dot notation. For example, we could define collections blog.posts and blog.authors, both reside under the namespace blog but are two separate collections. Namespaces can then be used to access these collections using the dot notation e.g. db.blog.posts.find(); will return all documents from the collection blog.posts but nothing from the collection blog.authors. Namespaces simply provide an organizational mechanism for the user i.e. the collection namespace is flat from the database point of view which means that blog.authors really just is a collection on its own and not some collection authors grouped under some namespace blog. Again, the collection namespace is flat from the database point of view i.e. technically speaking blog.authors is no different than foo or foo.bar.baz -- grouping just helps the humans keep track ...
    • How can I get a list of Namespaces in Database? One way to list all namespaces for a particular database would be to enter MongoDB's interactive shell: sa@wks:~$ mongo MongoDB shell version: 1.2.4 url: test connecting to: test type "help" for help > db.system.namespaces.find(); { "name" : "test.system.indexes" } { "name" : "test.fs.files" } { "name" : "test.fs.files.$_id_" } { "name" : "test.fs.files.$filename_1" } { "name" : "test.fs.chunks" } { "name" : "test.fs.chunks.$_id_" } { "name" : "test.fs.chunks.$files_id_1_n_1" } { "name" : "test.things" } { "name" : "test.things.$_id_" } { "name" : "test.mycollection" } { "name" : "test.mycollection.$_id_" } > db.system.namespaces.count(); 11 > bye sa@wks:~$ The system namespace in MongoDB is special since it contains database system information (read metadata). There are several collections like for example system.namespaces which for example can be used to get information about all the namespaces with some database.
    • Statistics / Monitoring Because pilots need to know ...
    • The Server Status, what does it tell? sa@wks:~$ mongo --quiet type "help" for help > db.serverStatus(); { "uptime" : 6695, "localTime" : "Sun Apr 11 2010 11:22:19 GMT+0200 (CEST)", "globalLock" : { "totalTime" : 6694193239, "lockTime" : 45048, "ratio" : 0.000006729414343397326 }, "mem" : { "resident" : 3, "virtual" : 138, "supported" : true, "mapped" : 0 }, Most of it is obvious like for example uptime. The globalLock part is interesting. totalTime is the same as uptime but in microseconds. lockTime is the amount of time the global lock has been held i.e. the total time spend waiting for write queries until a lock has been assigned and thus a write could be made. One may ask what is the point of having both, uptime and totalTime? Well, totalTime will rollover faster since it is in microseconds, at some point they diverge. The rollover is coordinated between totalTime and lockTime. mem units are in MiB, all of them. resident, what is in physical memory (also known as RAM), virtual is the virtual address space, mapped is the space memory mapped, and supported is if memory info is supported on our platform.
    • "connections" : { "current" : 2, "available" : 19998 }, "extra_info" : { "note" : "fields vary by platform", "heap_usage_bytes" : 146048, "page_faults" : 57 }, "indexCounters" : { "btree" : { "accesses" : 0, "hits" : 0, "misses" : 0, "resets" : 0, "missRatio" : 0 } }, "backgroundFlushing" : { "flushes" : 111, "total_ms" : 2, "average_ms" : 0.018018018018018018, "last_ms" : 0, "last_finished" : "Sun Apr 11 2010 11:21:45 GMT+0200 (CEST)" }, connections tells us how many client connections we can open against mongod, more precisely, current tells us how many existing client connections to mongod there are right now and available shows us how many we got left. Within the extra_info part we have heap_usage_bytes which is the main memory needed by the database.
    • "opcounters" : { "insert" : 16513, "query" : 1482263, "update" : 141594, "delete" : 38, "getmore" : 246889, "command" : 1247316 }, "asserts" : { "regular" : 0, "warning" : 0, "msg" : 0, "user" : 0, "rollovers" : 0 }, "ok" : 1 } > bye sa@wks:~$ The opcounters part is also pretty interesting. insert, query, update, and delete are self-explanatory but getmore and command are probably not. When we do a query, we get results in batches. The first batch is counted in query, all subsequent in getmore. commands are things like count, group, distinct, etc. And yes, taking those numbers and dividing them by time (delta or total) will give us operations/time e.g. operations per second or operations since mongod got started. In fact, there is a Munin plugin (http://github.com/erh/mongo-munin) which does use this.
    • Schema / Configuration Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_schema_configuration
    • Indexes / Search / Metadata Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_indexes_search_metadata
    • Map / Reduce Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_map_reduce
    • GridFS / Data Size Store tons of data reliable and smart ...
    • What is GridFS? Basically a collection of normal documents. We have two collections, one for metadata (fs.files) and one consisting of chunks of data (fs.chunks). The GridFS spec provides a mechanism for transparently dividing a large file among multiple documents. This allows us to efficiently store large objects, and in the case of especially large files, such as videos, permits range operations (e.g., fetching only the first n bytes of a file). What can we do with GridFS Store ridcoulous amounts of data in a smart way.
    • Why use GridFS over ordinary Filesystem Storage? If we use the filesystem we would have to handle backup/replication/scaling ourselves. We would also have to come up with some sort of hashing scheme ourselves plus we would need to take care about cleanup/sorting/moving because filesystems do not love lots of small files. With GridFS, we can use MongoDB's built-in replication/backup/scaling e.g. scale reads by adding more read-only slaves and writes by using sharding. We also get out of the box hashing (read UUID (Universally Unique Identifier)) for stored content plus we do not suffer from filesystem performance degradation because of a myriad of small files. Also, we can easily access information from random sections of large files, another thing traditional tools working with data right off the filesystem are not good at. Last but not least, we can keep information associated with the file (who has edited it, download count, description, etc.) right with the file itself.
    • Scalability / Fault Tolerance / Load Balancing Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/ mongodb.html#faqs_scalability_fault_tolerance_load_balancing
    • Miscellaneous Sorry folks, no can do, lack of time ... go to http://sunoano.name/ws/mongodb.html#faqs_miscellaneous
    • Use Case This should have been my major part ◦ locking (read transactions) ◦ asynchronous as opposed to synchronous operations ◦ numbers (double precision) Again, lack of time ... go to http://sunoano.name/ws/mongodb.html
    • Summary Part 1 Tell them what you told them ... simple as that ...
    • Introduction Part 2 Before starting with mongodb specific topics it's important to know that we don't dislike relational databases, we know they are good for many things but we also know that web applications success is mainly based on their performance and speed so that's what we're running after and that's why we're all here.
    • Existing Technologies • MongoKit (Nicolas Clairon): ◦ Great for completely unstructured model programming. It has structure validation but I’ve never used it, I prefer to use mongokit on models that may be constantly changing their structure. • mongoengine (Harry Marr): ◦ It allows you to define schemas for documents and query collections using django-like syntax. • django-mongodb-engine (Alberto Paro and myself): ◦ This is a real Django backend based on django-mongodb and mongoengine, adapted to work with django- nonrel and mongodb without changing anything in the code.
    • SQL to MongoDB Query Translation.... "What matters is who adapts faster to the changing conditions" - Charles Darwin The first we should remember when passing from SQL databases to NoSQL ones is that models were made to model data but, models can be modeled too, what I mean is that people use to adapt databases features to their models instead of adapting models to databases. I'll try to mention some of the common quesitons found in the m-l: • Lets start with JOINS. Why JOINS? Because we don’t have those in MongoDB and we might need them so, we have to figure out what’s the best workaround for this. The best thing you can do here is forget about JOINS, you wont have them we are not talking about highly relational databases we are talking about non relational ones so there can't be joins between 2 collections if there's no relation between them. One of the things we did was remodeling the way we stored data. We embedded what could be embedded and did 2 or more queries where embedding was not possible. • What about ForeignKeys, do we have those? Yes, or kind off. We have DBRef which is a kind of ForeignKey but I personally wouldn't use refs in mongodb. As I said, MongoDB is not about referencing and collection relations it is about performance based on dynamism. • If MongoDB barely has references you could guess that many to many is insignificant, instead of that I would start thinking on dicionaries/maps and lists/arrays. • And last but not least, If you really need to do a query that joins 2 collections based on a field reference that should handle a many to many relation then you have map/reduce.
    • Keeping things lazy... Yes, because we’re lazy people so we do lazy things ... It is important when getting orms to work with mongodb that we keep things lazy to avoid bottle necks in our web applications. Mongodb doesn't have many to many relations but it can have lists and dictionaries saved. For example class User(models.Model) nickname = models.CharField(max_length=255) full_name = models.CharField(max_length=255) friends = ListField() groups = ListField() In the User model we have 2 ListFields that may cause some slow downs in our web application, the first one is a list containing ids/names of the user friends and the second one containing the groups user is related to so, think of a user that have many friends and that is related to many groups (a popular one), that's a lot of data transfer and many instantiations for our code because each object/id in the ListField should be instantiated. Maybe this might sound obvios but trust me, nothing is obvious when doing web programming.
    • Keeping Relations or Embedding? This is a common question when moving from relational databases to non-rel ones. Should we keep our models related or embed smallest ones into the biggest ones?. The answer is NO, you shouldn't keep them related. For Example, A common situation (or commonly used to show how mongodb works) is a blog engine with posts and comments. Lets see how we could handle comments (not threaded) in our blog engine:
    • Using References: class Comment(models.Model): post = models.ForeignKey(Post) user = models.ForeignKey(User) text = models.CharField(max_length=255) my_comment, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text, defaults={})
    • Without references: class Post(models.Model) .... comments = ListField() post.comments.append({ ‘user’ : user, ‘text’ : text}) post.save() The first example is the most used because is the way we're used to think when we write our models but, the second one is the right one when talking about nosql databases because references make things slower. The bad thing about embedding our comments like that is that we have to worry about our 4mb Document limit so if we are really popular on the net and many people comes to our blog and comments our posts, that might be a problem for us, even though, This is great, I mean, we have removed a model from our app so it should be easier to maintain, shouldn't it? but, what is user supposed to be? Is it an embedded user object? is it a ForeignKey? what is it? How should we handle users there? It again depends on how you'd like to do things, for example It is possible to save the username as it should be showed and then when the comments are loaded just show the username, for those wanting to know more about this user then it is possible to do that just by clicking on its username it'll load the user's personal info. Here are some examples:
    • Light and fast (For registered users): post.comments.append({'user' : 'FlaPer87', 'text' : 'My Comment'}) post.save() Heavy and slow (For any user): post.comments.append({'user' : {'username' : 'FlaPer87', 'email' : 'flaper87@flaper87.org', 'url' : 'http://blog.flaper87.org'}, 'text' : 'My Comment'}) post.save()
    • Lazy relations or mongodb like ones: #Automatic serialization done in django-mongodb-engine post.comments.append({'user' : {'_app': model._meta.app_label, '_model': model._meta.module_name, 'pk': model.pk, '_type': "django"}, 'text' : 'My Comment'}) post.save()
    • Taking Advantage from schema-less Databases for Web Development One of the things I like more from mongodb is that it is schema-less. People use to think about schema-less dbs as a mess which they're not. Schema-less databases do have a structure the difference between them and Schema based ones is that the schema-less structures are dynamic, this means that they can be modified at anytime and they're not typed, you can think about schema-less dbs as (just like mongodb does) json based maps. This kind of structures can be really helpful when doing web programing, in our case they let us save any kind of data in our collections and have generic structures that changed during the time. For example, let's try to improve our Comment model (in case we decided to have some relations).
    • class Comment(models.Model): post = models.ForeignKey(Post) user = GenericField() text = models.CharField(max_length=255) my_user = "FlaPer87" #Known User my_comment, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text, defaults={}) my_user = {'nickname' : 'FlaPer87', 'full_name' : 'Flavio Percoco Premoli', 'email' : 'flaper87@flaper87.org', 'url' : 'http://blog.flaper87.org'} #Anonymous User my_comment2, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text, defaults={}) Using a GenericField we'll be able to save anything into that attr and we'll have to do our checks and controls code side. In this case the Schema-less collection helped us to get/save the anonymous users information without having to create a record in our Users table or without forcing the user to register.
    • Summary Part 2 • Re-model your models • Be Lazy to be faster • Forget about relations, they will slow you down • Remember that dynamism is better than restrictions