SlideShare a Scribd company logo
1 of 27
MapReduce,
Geospatial Indexing,
& Other Cool Features
Tony Hannan
Software Engineer, 10gen
tony@10gen.com
MongoDB Chicago, Oct 2010
“Cool” features not covered
• Scaling
– Auto sharding
– Replication with auto failover
• Administration
– Monitoring, diagnostics, profiler
– Backup, import/export
• General querying & indexing
“Cool” features covered
• MapReduce aggregation
• Geospatial indexing
• Confirm write (getLastError)
• Atomic read and write (findAndModify)
• Capped collection
• GridFS
MapReduce
• General aggregation query
• Examples
– What is the average weight of a collection of
bicycles?
– Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection?
• What are the 10 most popular tags?
MapReduce
• map :: Document -> Array (key, value)
• reduce :: (key, Array value) -> value
• finalize :: (key, value) -> result
• runCommand ({mapreduce: collection, map: map,
reduce: reduce [, finalize: finalize]}) :: Array (key, result)
1. Apply map to each document
2. Collect all (key, value) pairs and group by key
3. Apply reduce to each key and its group of values
4. Apply finalize to each key and reduced value
• Array of results is stored as a result collection
– Which can be queried and/or mapReduced (again!)
MapReduce example
• Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection (ignoring case)?
map = function() {
for (var i in this.tags) emit(lowercase(this.tags[i]),1) } *
reduce = function(tag,counts) {return sum(counts)}
finalize = function(tag,count) {return count}
• * In MongoDB, map input is ‘this’ and output is
emitted instead of returned
MapReduce scaling
• Output of reduce step can be input to another
reduce step
– reduce (k, [reduce (k,vs)]) == reduce (k,vs)
– This allows easy divide-and-conquer parallelism
finalize
reduce
reduce reduce
reduce
map
reduce
map
reduce
map
reduce
map
MapReduce example 2
• What is the average weight of a collection of
bicycles?
map = function() {emit(0, {sum: this.weight, count: 1})}
reduce = fuction(k,vs) {
var v = {sum: 0, count: 0}
for (var i in vs) {
v.sum += vs[i].sum
v.count += vs[i].count }
return v }
finalize = function(k,v) {return v.sum/v.count}
MapReduce optional parameters
• Apply mapReduce to subset of documents
– query: Eg. {price: {$lt: 1000}}
– limit: Eg. 100
– sort: only meaningful with limit. Eg. {price: 1}
• Result collection
– out: name of permanent result collection
• Temporary collection used if missing
– keeptemp: make temporary collection permanent
• Otherwise temporary collection is dropped when connection closes
• Javascript scope
– scope: scope contains user-defined functions and values for
map, reduce, and finalize to use
Geospatial indexing
• Query by geographic location
– Eg. Find the 10 closest museums to my location
• Location field must be an array or document of
two numbers
– Eg. [-87.7, 41.8], {long: -87.7, lat: 41.8}
– Any units can be used, but normally (long, lat) degrees
• ensureIndex (loc: “2d” [, {min: -180, max: 180}])
– Index is required for geospatial queries
• Error otherwise
Geospatial querying
• N closest in closeness order
– find (loc: {$near: [-87.7, 41.8]}).limit(N)
– Default limit is 100 if elided
• Closest within given distance
– find (loc: {$near: [-87.7, 41.8], $maxDistance: 5})
• Locations within circle or box
– find (loc: {$within: {$center: [[-87.7, 41.8], 5]}})
– find (loc: {$within: {$box: [[-86,40],[-89,44]]}})
• Box params are [lower-left, upper-right]
Geospatial, compound index
• Query by location and something else
– ensureIndex ({loc: “2d”, tag: 1})
• Compound index not required, just loc index
– find ({loc: {$near: [-87.7, 41.8]}, tag: “museum”})
Geospatial, current limitations
• Flat Euclidian geometry
– Spherical geometry in version 1.7
• To use, append “Sphere” to $near or $center
• Coordinate units must be (long, lat) degrees
• Distance units must be radians
• No wrapping at min/max (180th meridian)
• Only one geo-index per collection allowed
• Currently doesn’t work on sharded collections
– Tentatively planned for version 1.8
Geospatial, other
• runCommand ({geoNear: collection, near: [-
87.7, 41.8], num: 10})
– Similar to find $near except includes distances
with results
Confirm write, getLastError
• Check success of previous write (on same
connection)
– runCommand ({getlasterror: 1})
> {err: E [, code: C], n: N [, updatedExisting: U]}
• E = null. Success
• E = “error message”. Failure
• C only present on failure and is the error code number
• N = num docs updated (or upserted)
• U only present when write was an save/update
• U = true. Updated existing object
• U = false. Inserted new object (upsert)
Confirm write, safe mode
• Some language drivers have a safe write that
calls getlasterror automatically and raises
exception on error
– Eg. in Python: save({..}, safe=True)
Confirm write, getLastError variations
• runCommand ({getlasterror: 1, fsync: true})
– fsync data to disk. Blocks until finished
• In replicated enviroment
– runCommand ({getlasterror: 1, w: 2, wtimeout:
3000})
• Block until write reaches at least two servers or timeout
(in ms) reached
• If timeout reached, result will contain {wtimeout: true}
– Only need to do this check on last write because
all writes are replicated in order
findAndModify: atomic read & write
• Read and write a single document atomically
• Examples
– Taking from a shared queue
– Getting and incrementing a counter
findAndModify
• runCommand ({findAndModify: collection,
– query: - select document, first one if many, error if none
– sort: - pick first of sorted selection
– Either
• remove: true - deleted selected document
• update: modifier - modify selected document
– new:
• false (default) – return document before remove/update
• true – return document after update
– fields: - project given fields only
– upsert:
• false (default)
• True – create object if it does not exist })
• All fields above are optional except remove or update
findAndModify example
• Shared queue
– runCommand ({findAndModify: “queue”,
query: {},
sort: {priority: -1},
remove: true })
> {ok: 1, value: {_id: .., priority: 9, …}}
findAndModify example 2
• Auto-increment
function getNextValue(counterName) {
var r = runCommand ({findAndModify: “counters”,
query: {_id: counterName},
update: {$inc: {value: 1}},
upsert: true,
new: true })
return r.value }
> getNextValue(“x”) -> 1
> getNextValue(“x”) -> 2
> getNextValue(“y”) -> 1
> getNextValue(“x”) -> 3
Capped collection
• Very fast fixed-size circular buffer
– Oldest documents removed when new documents
inserted
– No indexes
– Supports queries, scans in (reverse) insert order
– Good for logging
• Used internally for oplog
• createCollection (“foo”, {capped: true, size: bytes
[, max: numDocs]})
– Create capped collection before first use
Capped collection, tailable cursor
• Similar to Unix “tail –f” command
• Cursor never finishes, next() just waits for next
inserted document
• Cannot sort results, insert order only
GridFS
• Specification for storing large files
(bytestrings) in MongoDB
• Interface
– store (filename, bytes)
– fetch (filename)
• Implementation
– Break file into chunks of 256KB (default)
– Store chunks in “fs.chunks” collection
– Store filename, file size, etc., in “fs.files” collection
GridFS implementation
• “files” schema
– _id: ObjectId
– filename: String
– length: Int – size of file in bytes
– chunkSize: Int – size of each chunk (default 256KB)
– uploadDate: Date – date when object stored
– md5: String – result of filemd5 on file
• Additional fields may be added by user
• Unique index on {filename: 1}
GridFS implementation
• “chunks” schema
– _id: ObjectId
– Files_id: ObjectId – id of file in “files” collection
– N: Int – chunk number
– Data: Binary – chunk bytes
• Unique index on {files_id: 1, n: 1}
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

More Related Content

What's hot

Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
MongoDB
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
PostgresOpen
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Ontico
 

What's hot (20)

Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTW
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
 
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindset
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
 
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Typelevel summit
Typelevel summitTypelevel summit
Typelevel summit
 
This is not your father's monitoring.
This is not your father's monitoring.This is not your father's monitoring.
This is not your father's monitoring.
 
Herding types with Scala macros
Herding types with Scala macrosHerding types with Scala macros
Herding types with Scala macros
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
Tracing and awk in ns2
Tracing and awk in ns2Tracing and awk in ns2
Tracing and awk in ns2
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 

Viewers also liked (17)

Chacara extrema mg
Chacara extrema mgChacara extrema mg
Chacara extrema mg
 
Ref. 158
Ref. 158Ref. 158
Ref. 158
 
Ref. 160
Ref. 160Ref. 160
Ref. 160
 
Ref. 159
Ref. 159Ref. 159
Ref. 159
 
Ref. 88
Ref. 88Ref. 88
Ref. 88
 
Ref. 38
Ref. 38Ref. 38
Ref. 38
 
Ref. 165
Ref. 165Ref. 165
Ref. 165
 
Ref.42
Ref.42Ref.42
Ref.42
 
Maisa
MaisaMaisa
Maisa
 
Ref. 75
Ref. 75Ref. 75
Ref. 75
 
Ref.42
Ref.42Ref.42
Ref.42
 
Mooca
MoocaMooca
Mooca
 
Godsgoddessess
GodsgoddessessGodsgoddessess
Godsgoddessess
 
Ref. 98
Ref. 98Ref. 98
Ref. 98
 
Dominican University Westside Community Collaborative
Dominican University Westside Community CollaborativeDominican University Westside Community Collaborative
Dominican University Westside Community Collaborative
 
Ref. 166
Ref. 166Ref. 166
Ref. 166
 
Ref. 84
Ref. 84Ref. 84
Ref. 84
 

Similar to MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
MongoDB
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
shinolajla
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin
 

Similar to MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features (20)

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Functional Programming in Javascript - IL Tech Talks week
Functional Programming in Javascript - IL Tech Talks weekFunctional Programming in Javascript - IL Tech Talks week
Functional Programming in Javascript - IL Tech Talks week
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
mongodb-aggregation-may-2012
mongodb-aggregation-may-2012mongodb-aggregation-may-2012
mongodb-aggregation-may-2012
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
 
Scaling php applications with redis
Scaling php applications with redisScaling php applications with redis
Scaling php applications with redis
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation framework
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & AggregationWebinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

  • 1. MapReduce, Geospatial Indexing, & Other Cool Features Tony Hannan Software Engineer, 10gen tony@10gen.com MongoDB Chicago, Oct 2010
  • 2. “Cool” features not covered • Scaling – Auto sharding – Replication with auto failover • Administration – Monitoring, diagnostics, profiler – Backup, import/export • General querying & indexing
  • 3. “Cool” features covered • MapReduce aggregation • Geospatial indexing • Confirm write (getLastError) • Atomic read and write (findAndModify) • Capped collection • GridFS
  • 4. MapReduce • General aggregation query • Examples – What is the average weight of a collection of bicycles? – Given a collection of docs, each with an array of tags, how many times does each tag appear in the collection? • What are the 10 most popular tags?
  • 5. MapReduce • map :: Document -> Array (key, value) • reduce :: (key, Array value) -> value • finalize :: (key, value) -> result • runCommand ({mapreduce: collection, map: map, reduce: reduce [, finalize: finalize]}) :: Array (key, result) 1. Apply map to each document 2. Collect all (key, value) pairs and group by key 3. Apply reduce to each key and its group of values 4. Apply finalize to each key and reduced value • Array of results is stored as a result collection – Which can be queried and/or mapReduced (again!)
  • 6. MapReduce example • Given a collection of docs, each with an array of tags, how many times does each tag appear in the collection (ignoring case)? map = function() { for (var i in this.tags) emit(lowercase(this.tags[i]),1) } * reduce = function(tag,counts) {return sum(counts)} finalize = function(tag,count) {return count} • * In MongoDB, map input is ‘this’ and output is emitted instead of returned
  • 7. MapReduce scaling • Output of reduce step can be input to another reduce step – reduce (k, [reduce (k,vs)]) == reduce (k,vs) – This allows easy divide-and-conquer parallelism finalize reduce reduce reduce reduce map reduce map reduce map reduce map
  • 8. MapReduce example 2 • What is the average weight of a collection of bicycles? map = function() {emit(0, {sum: this.weight, count: 1})} reduce = fuction(k,vs) { var v = {sum: 0, count: 0} for (var i in vs) { v.sum += vs[i].sum v.count += vs[i].count } return v } finalize = function(k,v) {return v.sum/v.count}
  • 9. MapReduce optional parameters • Apply mapReduce to subset of documents – query: Eg. {price: {$lt: 1000}} – limit: Eg. 100 – sort: only meaningful with limit. Eg. {price: 1} • Result collection – out: name of permanent result collection • Temporary collection used if missing – keeptemp: make temporary collection permanent • Otherwise temporary collection is dropped when connection closes • Javascript scope – scope: scope contains user-defined functions and values for map, reduce, and finalize to use
  • 10. Geospatial indexing • Query by geographic location – Eg. Find the 10 closest museums to my location • Location field must be an array or document of two numbers – Eg. [-87.7, 41.8], {long: -87.7, lat: 41.8} – Any units can be used, but normally (long, lat) degrees • ensureIndex (loc: “2d” [, {min: -180, max: 180}]) – Index is required for geospatial queries • Error otherwise
  • 11. Geospatial querying • N closest in closeness order – find (loc: {$near: [-87.7, 41.8]}).limit(N) – Default limit is 100 if elided • Closest within given distance – find (loc: {$near: [-87.7, 41.8], $maxDistance: 5}) • Locations within circle or box – find (loc: {$within: {$center: [[-87.7, 41.8], 5]}}) – find (loc: {$within: {$box: [[-86,40],[-89,44]]}}) • Box params are [lower-left, upper-right]
  • 12. Geospatial, compound index • Query by location and something else – ensureIndex ({loc: “2d”, tag: 1}) • Compound index not required, just loc index – find ({loc: {$near: [-87.7, 41.8]}, tag: “museum”})
  • 13. Geospatial, current limitations • Flat Euclidian geometry – Spherical geometry in version 1.7 • To use, append “Sphere” to $near or $center • Coordinate units must be (long, lat) degrees • Distance units must be radians • No wrapping at min/max (180th meridian) • Only one geo-index per collection allowed • Currently doesn’t work on sharded collections – Tentatively planned for version 1.8
  • 14. Geospatial, other • runCommand ({geoNear: collection, near: [- 87.7, 41.8], num: 10}) – Similar to find $near except includes distances with results
  • 15. Confirm write, getLastError • Check success of previous write (on same connection) – runCommand ({getlasterror: 1}) > {err: E [, code: C], n: N [, updatedExisting: U]} • E = null. Success • E = “error message”. Failure • C only present on failure and is the error code number • N = num docs updated (or upserted) • U only present when write was an save/update • U = true. Updated existing object • U = false. Inserted new object (upsert)
  • 16. Confirm write, safe mode • Some language drivers have a safe write that calls getlasterror automatically and raises exception on error – Eg. in Python: save({..}, safe=True)
  • 17. Confirm write, getLastError variations • runCommand ({getlasterror: 1, fsync: true}) – fsync data to disk. Blocks until finished • In replicated enviroment – runCommand ({getlasterror: 1, w: 2, wtimeout: 3000}) • Block until write reaches at least two servers or timeout (in ms) reached • If timeout reached, result will contain {wtimeout: true} – Only need to do this check on last write because all writes are replicated in order
  • 18. findAndModify: atomic read & write • Read and write a single document atomically • Examples – Taking from a shared queue – Getting and incrementing a counter
  • 19. findAndModify • runCommand ({findAndModify: collection, – query: - select document, first one if many, error if none – sort: - pick first of sorted selection – Either • remove: true - deleted selected document • update: modifier - modify selected document – new: • false (default) – return document before remove/update • true – return document after update – fields: - project given fields only – upsert: • false (default) • True – create object if it does not exist }) • All fields above are optional except remove or update
  • 20. findAndModify example • Shared queue – runCommand ({findAndModify: “queue”, query: {}, sort: {priority: -1}, remove: true }) > {ok: 1, value: {_id: .., priority: 9, …}}
  • 21. findAndModify example 2 • Auto-increment function getNextValue(counterName) { var r = runCommand ({findAndModify: “counters”, query: {_id: counterName}, update: {$inc: {value: 1}}, upsert: true, new: true }) return r.value } > getNextValue(“x”) -> 1 > getNextValue(“x”) -> 2 > getNextValue(“y”) -> 1 > getNextValue(“x”) -> 3
  • 22. Capped collection • Very fast fixed-size circular buffer – Oldest documents removed when new documents inserted – No indexes – Supports queries, scans in (reverse) insert order – Good for logging • Used internally for oplog • createCollection (“foo”, {capped: true, size: bytes [, max: numDocs]}) – Create capped collection before first use
  • 23. Capped collection, tailable cursor • Similar to Unix “tail –f” command • Cursor never finishes, next() just waits for next inserted document • Cannot sort results, insert order only
  • 24. GridFS • Specification for storing large files (bytestrings) in MongoDB • Interface – store (filename, bytes) – fetch (filename) • Implementation – Break file into chunks of 256KB (default) – Store chunks in “fs.chunks” collection – Store filename, file size, etc., in “fs.files” collection
  • 25. GridFS implementation • “files” schema – _id: ObjectId – filename: String – length: Int – size of file in bytes – chunkSize: Int – size of each chunk (default 256KB) – uploadDate: Date – date when object stored – md5: String – result of filemd5 on file • Additional fields may be added by user • Unique index on {filename: 1}
  • 26. GridFS implementation • “chunks” schema – _id: ObjectId – Files_id: ObjectId – id of file in “files” collection – N: Int – chunk number – Data: Binary – chunk bytes • Unique index on {files_id: 1, n: 1}