1. MapReduce allows aggregation queries across collections by mapping documents to key-value pairs, reducing by key, and finalizing results. It can calculate averages, count tag frequencies, and scale across servers.
2. Geospatial indexing supports location-based queries by distance and area. Documents have location fields indexed for $near and $within operators to find closest or contained documents.
3. findAndModify atomically reads, modifies, and writes a single document, such as dequeuing from a shared queue or incrementing a counter.
2. “Cool” features not covered
• Scaling
– Auto sharding
– Replication with auto failover
• Administration
– Monitoring, diagnostics, profiler
– Backup, import/export
• General querying & indexing
4. MapReduce
• General aggregation query
• Examples
– What is the average weight of a collection of
bicycles?
– Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection?
• What are the 10 most popular tags?
5. MapReduce
• map :: Document -> Array (key, value)
• reduce :: (key, Array value) -> value
• finalize :: (key, value) -> result
• runCommand ({mapreduce: collection, map: map,
reduce: reduce [, finalize: finalize]}) :: Array (key, result)
1. Apply map to each document
2. Collect all (key, value) pairs and group by key
3. Apply reduce to each key and its group of values
4. Apply finalize to each key and reduced value
• Array of results is stored as a result collection
– Which can be queried and/or mapReduced (again!)
6. MapReduce example
• Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection (ignoring case)?
map = function() {
for (var i in this.tags) emit(lowercase(this.tags[i]),1) } *
reduce = function(tag,counts) {return sum(counts)}
finalize = function(tag,count) {return count}
• * In MongoDB, map input is ‘this’ and output is
emitted instead of returned
7. MapReduce scaling
• Output of reduce step can be input to another
reduce step
– reduce (k, [reduce (k,vs)]) == reduce (k,vs)
– This allows easy divide-and-conquer parallelism
finalize
reduce
reduce reduce
reduce
map
reduce
map
reduce
map
reduce
map
8. MapReduce example 2
• What is the average weight of a collection of
bicycles?
map = function() {emit(0, {sum: this.weight, count: 1})}
reduce = fuction(k,vs) {
var v = {sum: 0, count: 0}
for (var i in vs) {
v.sum += vs[i].sum
v.count += vs[i].count }
return v }
finalize = function(k,v) {return v.sum/v.count}
9. MapReduce optional parameters
• Apply mapReduce to subset of documents
– query: Eg. {price: {$lt: 1000}}
– limit: Eg. 100
– sort: only meaningful with limit. Eg. {price: 1}
• Result collection
– out: name of permanent result collection
• Temporary collection used if missing
– keeptemp: make temporary collection permanent
• Otherwise temporary collection is dropped when connection closes
• Javascript scope
– scope: scope contains user-defined functions and values for
map, reduce, and finalize to use
10. Geospatial indexing
• Query by geographic location
– Eg. Find the 10 closest museums to my location
• Location field must be an array or document of
two numbers
– Eg. [-87.7, 41.8], {long: -87.7, lat: 41.8}
– Any units can be used, but normally (long, lat) degrees
• ensureIndex (loc: “2d” [, {min: -180, max: 180}])
– Index is required for geospatial queries
• Error otherwise
11. Geospatial querying
• N closest in closeness order
– find (loc: {$near: [-87.7, 41.8]}).limit(N)
– Default limit is 100 if elided
• Closest within given distance
– find (loc: {$near: [-87.7, 41.8], $maxDistance: 5})
• Locations within circle or box
– find (loc: {$within: {$center: [[-87.7, 41.8], 5]}})
– find (loc: {$within: {$box: [[-86,40],[-89,44]]}})
• Box params are [lower-left, upper-right]
12. Geospatial, compound index
• Query by location and something else
– ensureIndex ({loc: “2d”, tag: 1})
• Compound index not required, just loc index
– find ({loc: {$near: [-87.7, 41.8]}, tag: “museum”})
13. Geospatial, current limitations
• Flat Euclidian geometry
– Spherical geometry in version 1.7
• To use, append “Sphere” to $near or $center
• Coordinate units must be (long, lat) degrees
• Distance units must be radians
• No wrapping at min/max (180th meridian)
• Only one geo-index per collection allowed
• Currently doesn’t work on sharded collections
– Tentatively planned for version 1.8
14. Geospatial, other
• runCommand ({geoNear: collection, near: [-
87.7, 41.8], num: 10})
– Similar to find $near except includes distances
with results
15. Confirm write, getLastError
• Check success of previous write (on same
connection)
– runCommand ({getlasterror: 1})
> {err: E [, code: C], n: N [, updatedExisting: U]}
• E = null. Success
• E = “error message”. Failure
• C only present on failure and is the error code number
• N = num docs updated (or upserted)
• U only present when write was an save/update
• U = true. Updated existing object
• U = false. Inserted new object (upsert)
16. Confirm write, safe mode
• Some language drivers have a safe write that
calls getlasterror automatically and raises
exception on error
– Eg. in Python: save({..}, safe=True)
17. Confirm write, getLastError variations
• runCommand ({getlasterror: 1, fsync: true})
– fsync data to disk. Blocks until finished
• In replicated enviroment
– runCommand ({getlasterror: 1, w: 2, wtimeout:
3000})
• Block until write reaches at least two servers or timeout
(in ms) reached
• If timeout reached, result will contain {wtimeout: true}
– Only need to do this check on last write because
all writes are replicated in order
18. findAndModify: atomic read & write
• Read and write a single document atomically
• Examples
– Taking from a shared queue
– Getting and incrementing a counter
19. findAndModify
• runCommand ({findAndModify: collection,
– query: - select document, first one if many, error if none
– sort: - pick first of sorted selection
– Either
• remove: true - deleted selected document
• update: modifier - modify selected document
– new:
• false (default) – return document before remove/update
• true – return document after update
– fields: - project given fields only
– upsert:
• false (default)
• True – create object if it does not exist })
• All fields above are optional except remove or update
22. Capped collection
• Very fast fixed-size circular buffer
– Oldest documents removed when new documents
inserted
– No indexes
– Supports queries, scans in (reverse) insert order
– Good for logging
• Used internally for oplog
• createCollection (“foo”, {capped: true, size: bytes
[, max: numDocs]})
– Create capped collection before first use
23. Capped collection, tailable cursor
• Similar to Unix “tail –f” command
• Cursor never finishes, next() just waits for next
inserted document
• Cannot sort results, insert order only
24. GridFS
• Specification for storing large files
(bytestrings) in MongoDB
• Interface
– store (filename, bytes)
– fetch (filename)
• Implementation
– Break file into chunks of 256KB (default)
– Store chunks in “fs.chunks” collection
– Store filename, file size, etc., in “fs.files” collection
25. GridFS implementation
• “files” schema
– _id: ObjectId
– filename: String
– length: Int – size of file in bytes
– chunkSize: Int – size of each chunk (default 256KB)
– uploadDate: Date – date when object stored
– md5: String – result of filemd5 on file
• Additional fields may be added by user
• Unique index on {filename: 1}
26. GridFS implementation
• “chunks” schema
– _id: ObjectId
– Files_id: ObjectId – id of file in “files” collection
– N: Int – chunk number
– Data: Binary – chunk bytes
• Unique index on {files_id: 1, n: 1}