MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

MapReduce,
Geospatial Indexing,
& Other Cool Features
Tony Hannan
Software Engineer, 10gen
tony@10gen.com
MongoDB Chicago, Oct 2010

“Cool” features not covered
• Scaling
– Auto sharding
– Replication with auto failover
• Administration
– Monitoring, diagnostics, profiler
– Backup, import/export
• General querying & indexing

“Cool” features covered
• MapReduce aggregation
• Geospatial indexing
• Confirm write (getLastError)
• Atomic read and write (findAndModify)
• Capped collection
• GridFS

MapReduce
• General aggregation query
• Examples
– What is the average weight of a collection of
bicycles?
– Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection?
• What are the 10 most popular tags?

MapReduce
• map :: Document -> Array (key, value)
• reduce :: (key, Array value) -> value
• finalize :: (key, value) -> result
• runCommand ({mapreduce: collection, map: map,
reduce: reduce [, finalize: finalize]}) :: Array (key, result)
1. Apply map to each document
2. Collect all (key, value) pairs and group by key
3. Apply reduce to each key and its group of values
4. Apply finalize to each key and reduced value
• Array of results is stored as a result collection
– Which can be queried and/or mapReduced (again!)

MapReduce example
• Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection (ignoring case)?
map = function() {
for (var i in this.tags) emit(lowercase(this.tags[i]),1) } *
reduce = function(tag,counts) {return sum(counts)}
finalize = function(tag,count) {return count}
• * In MongoDB, map input is ‘this’ and output is
emitted instead of returned

MapReduce scaling
• Output of reduce step can be input to another
reduce step
– reduce (k, [reduce (k,vs)]) == reduce (k,vs)
– This allows easy divide-and-conquer parallelism
finalize
reduce
reduce reduce
reduce
map
reduce
map
reduce
map
reduce
map

MapReduce example 2
• What is the average weight of a collection of
bicycles?
map = function() {emit(0, {sum: this.weight, count: 1})}
reduce = fuction(k,vs) {
var v = {sum: 0, count: 0}
for (var i in vs) {
v.sum += vs[i].sum
v.count += vs[i].count }
return v }
finalize = function(k,v) {return v.sum/v.count}

MapReduce optional parameters
• Apply mapReduce to subset of documents
– query: Eg. {price: {$lt: 1000}}
– limit: Eg. 100
– sort: only meaningful with limit. Eg. {price: 1}
• Result collection
– out: name of permanent result collection
• Temporary collection used if missing
– keeptemp: make temporary collection permanent
• Otherwise temporary collection is dropped when connection closes
• Javascript scope
– scope: scope contains user-defined functions and values for
map, reduce, and finalize to use

Geospatial indexing
• Query by geographic location
– Eg. Find the 10 closest museums to my location
• Location field must be an array or document of
two numbers
– Eg. [-87.7, 41.8], {long: -87.7, lat: 41.8}
– Any units can be used, but normally (long, lat) degrees
• ensureIndex (loc: “2d” [, {min: -180, max: 180}])
– Index is required for geospatial queries
• Error otherwise

Geospatial querying
• N closest in closeness order
– find (loc: {$near: [-87.7, 41.8]}).limit(N)
– Default limit is 100 if elided
• Closest within given distance
– find (loc: {$near: [-87.7, 41.8], $maxDistance: 5})
• Locations within circle or box
– find (loc: {$within: {$center: [[-87.7, 41.8], 5]}})
– find (loc: {$within: {$box: [[-86,40],[-89,44]]}})
• Box params are [lower-left, upper-right]

Geospatial, compound index
• Query by location and something else
– ensureIndex ({loc: “2d”, tag: 1})
• Compound index not required, just loc index
– find ({loc: {$near: [-87.7, 41.8]}, tag: “museum”})

Geospatial, current limitations
• Flat Euclidian geometry
– Spherical geometry in version 1.7
• To use, append “Sphere” to $near or $center
• Coordinate units must be (long, lat) degrees
• Distance units must be radians
• No wrapping at min/max (180th meridian)
• Only one geo-index per collection allowed
• Currently doesn’t work on sharded collections
– Tentatively planned for version 1.8

Geospatial, other
• runCommand ({geoNear: collection, near: [-
87.7, 41.8], num: 10})
– Similar to find $near except includes distances
with results

Confirm write, getLastError
• Check success of previous write (on same
connection)
– runCommand ({getlasterror: 1})
> {err: E [, code: C], n: N [, updatedExisting: U]}
• E = null. Success
• E = “error message”. Failure
• C only present on failure and is the error code number
• N = num docs updated (or upserted)
• U only present when write was an save/update
• U = true. Updated existing object
• U = false. Inserted new object (upsert)

Confirm write, safe mode
• Some language drivers have a safe write that
calls getlasterror automatically and raises
exception on error
– Eg. in Python: save({..}, safe=True)

Confirm write, getLastError variations
• runCommand ({getlasterror: 1, fsync: true})
– fsync data to disk. Blocks until finished
• In replicated enviroment
– runCommand ({getlasterror: 1, w: 2, wtimeout:
3000})
• Block until write reaches at least two servers or timeout
(in ms) reached
• If timeout reached, result will contain {wtimeout: true}
– Only need to do this check on last write because
all writes are replicated in order

findAndModify: atomic read & write
• Read and write a single document atomically
• Examples
– Taking from a shared queue
– Getting and incrementing a counter

findAndModify
• runCommand ({findAndModify: collection,
– query: - select document, first one if many, error if none
– sort: - pick first of sorted selection
– Either
• remove: true - deleted selected document
• update: modifier - modify selected document
– new:
• false (default) – return document before remove/update
• true – return document after update
– fields: - project given fields only
– upsert:
• false (default)
• True – create object if it does not exist })
• All fields above are optional except remove or update

findAndModify example
• Shared queue
– runCommand ({findAndModify: “queue”,
query: {},
sort: {priority: -1},
remove: true })
> {ok: 1, value: {_id: .., priority: 9, …}}

findAndModify example 2
• Auto-increment
function getNextValue(counterName) {
var r = runCommand ({findAndModify: “counters”,
query: {_id: counterName},
update: {$inc: {value: 1}},
upsert: true,
new: true })
return r.value }
> getNextValue(“x”) -> 1
> getNextValue(“y”) -> 1

Capped collection
• Very fast fixed-size circular buffer
– Oldest documents removed when new documents
inserted
– No indexes
– Supports queries, scans in (reverse) insert order
– Good for logging
• Used internally for oplog
• createCollection (“foo”, {capped: true, size: bytes
[, max: numDocs]})
– Create capped collection before first use

Capped collection, tailable cursor
• Similar to Unix “tail –f” command
• Cursor never finishes, next() just waits for next
inserted document
• Cannot sort results, insert order only

GridFS
• Specification for storing large files
(bytestrings) in MongoDB
• Interface
– store (filename, bytes)
– fetch (filename)
• Implementation
– Break file into chunks of 256KB (default)
– Store chunks in “fs.chunks” collection
– Store filename, file size, etc., in “fs.files” collection

GridFS implementation
• “files” schema
– _id: ObjectId
– filename: String
– length: Int – size of file in bytes
– chunkSize: Int – size of each chunk (default 256KB)
– uploadDate: Date – date when object stored
– md5: String – result of filemd5 on file
• Additional fields may be added by user
• Unique index on {filename: 1}

GridFS implementation
• “chunks” schema
– _id: ObjectId
– Files_id: ObjectId – id of file in “files” collection
– N: Int – chunk number
– Data: Binary – chunk bytes
• Unique index on {files_id: 1, n: 1}

MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

Similar to MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features (20)

Recently uploaded

Recently uploaded (20)

MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features