SlideShare a Scribd company logo
1 of 27
MapReduce,
Geospatial Indexing,
& Other Cool Features
Tony Hannan
Software Engineer, 10gen
tony@10gen.com
MongoDB Chicago, Oct 2010
“Cool” features not covered
• Scaling
– Auto sharding
– Replication with auto failover
• Administration
– Monitoring, diagnostics, profiler
– Backup, import/export
• General querying & indexing
“Cool” features covered
• MapReduce aggregation
• Geospatial indexing
• Confirm write (getLastError)
• Atomic read and write (findAndModify)
• Capped collection
• GridFS
MapReduce
• General aggregation query
• Examples
– What is the average weight of a collection of
bicycles?
– Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection?
• What are the 10 most popular tags?
MapReduce
• map :: Document -> Array (key, value)
• reduce :: (key, Array value) -> value
• finalize :: (key, value) -> result
• runCommand ({mapreduce: collection, map: map,
reduce: reduce [, finalize: finalize]}) :: Array (key, result)
1. Apply map to each document
2. Collect all (key, value) pairs and group by key
3. Apply reduce to each key and its group of values
4. Apply finalize to each key and reduced value
• Array of results is stored as a result collection
– Which can be queried and/or mapReduced (again!)
MapReduce example
• Given a collection of docs, each with an array of
tags, how many times does each tag appear in the
collection (ignoring case)?
map = function() {
for (var i in this.tags) emit(lowercase(this.tags[i]),1) } *
reduce = function(tag,counts) {return sum(counts)}
finalize = function(tag,count) {return count}
• * In MongoDB, map input is ‘this’ and output is
emitted instead of returned
MapReduce scaling
• Output of reduce step can be input to another
reduce step
– reduce (k, [reduce (k,vs)]) == reduce (k,vs)
– This allows easy divide-and-conquer parallelism
finalize
reduce
reduce reduce
reduce
map
reduce
map
reduce
map
reduce
map
MapReduce example 2
• What is the average weight of a collection of
bicycles?
map = function() {emit(0, {sum: this.weight, count: 1})}
reduce = fuction(k,vs) {
var v = {sum: 0, count: 0}
for (var i in vs) {
v.sum += vs[i].sum
v.count += vs[i].count }
return v }
finalize = function(k,v) {return v.sum/v.count}
MapReduce optional parameters
• Apply mapReduce to subset of documents
– query: Eg. {price: {$lt: 1000}}
– limit: Eg. 100
– sort: only meaningful with limit. Eg. {price: 1}
• Result collection
– out: name of permanent result collection
• Temporary collection used if missing
– keeptemp: make temporary collection permanent
• Otherwise temporary collection is dropped when connection closes
• Javascript scope
– scope: scope contains user-defined functions and values for
map, reduce, and finalize to use
Geospatial indexing
• Query by geographic location
– Eg. Find the 10 closest museums to my location
• Location field must be an array or document of
two numbers
– Eg. [-87.7, 41.8], {long: -87.7, lat: 41.8}
– Any units can be used, but normally (long, lat) degrees
• ensureIndex (loc: “2d” [, {min: -180, max: 180}])
– Index is required for geospatial queries
• Error otherwise
Geospatial querying
• N closest in closeness order
– find (loc: {$near: [-87.7, 41.8]}).limit(N)
– Default limit is 100 if elided
• Closest within given distance
– find (loc: {$near: [-87.7, 41.8], $maxDistance: 5})
• Locations within circle or box
– find (loc: {$within: {$center: [[-87.7, 41.8], 5]}})
– find (loc: {$within: {$box: [[-86,40],[-89,44]]}})
• Box params are [lower-left, upper-right]
Geospatial, compound index
• Query by location and something else
– ensureIndex ({loc: “2d”, tag: 1})
• Compound index not required, just loc index
– find ({loc: {$near: [-87.7, 41.8]}, tag: “museum”})
Geospatial, current limitations
• Flat Euclidian geometry
– Spherical geometry in version 1.7
• To use, append “Sphere” to $near or $center
• Coordinate units must be (long, lat) degrees
• Distance units must be radians
• No wrapping at min/max (180th meridian)
• Only one geo-index per collection allowed
• Currently doesn’t work on sharded collections
– Tentatively planned for version 1.8
Geospatial, other
• runCommand ({geoNear: collection, near: [-
87.7, 41.8], num: 10})
– Similar to find $near except includes distances
with results
Confirm write, getLastError
• Check success of previous write (on same
connection)
– runCommand ({getlasterror: 1})
> {err: E [, code: C], n: N [, updatedExisting: U]}
• E = null. Success
• E = “error message”. Failure
• C only present on failure and is the error code number
• N = num docs updated (or upserted)
• U only present when write was an save/update
• U = true. Updated existing object
• U = false. Inserted new object (upsert)
Confirm write, safe mode
• Some language drivers have a safe write that
calls getlasterror automatically and raises
exception on error
– Eg. in Python: save({..}, safe=True)
Confirm write, getLastError variations
• runCommand ({getlasterror: 1, fsync: true})
– fsync data to disk. Blocks until finished
• In replicated enviroment
– runCommand ({getlasterror: 1, w: 2, wtimeout:
3000})
• Block until write reaches at least two servers or timeout
(in ms) reached
• If timeout reached, result will contain {wtimeout: true}
– Only need to do this check on last write because
all writes are replicated in order
findAndModify: atomic read & write
• Read and write a single document atomically
• Examples
– Taking from a shared queue
– Getting and incrementing a counter
findAndModify
• runCommand ({findAndModify: collection,
– query: - select document, first one if many, error if none
– sort: - pick first of sorted selection
– Either
• remove: true - deleted selected document
• update: modifier - modify selected document
– new:
• false (default) – return document before remove/update
• true – return document after update
– fields: - project given fields only
– upsert:
• false (default)
• True – create object if it does not exist })
• All fields above are optional except remove or update
findAndModify example
• Shared queue
– runCommand ({findAndModify: “queue”,
query: {},
sort: {priority: -1},
remove: true })
> {ok: 1, value: {_id: .., priority: 9, …}}
findAndModify example 2
• Auto-increment
function getNextValue(counterName) {
var r = runCommand ({findAndModify: “counters”,
query: {_id: counterName},
update: {$inc: {value: 1}},
upsert: true,
new: true })
return r.value }
> getNextValue(“x”) -> 1
> getNextValue(“x”) -> 2
> getNextValue(“y”) -> 1
> getNextValue(“x”) -> 3
Capped collection
• Very fast fixed-size circular buffer
– Oldest documents removed when new documents
inserted
– No indexes
– Supports queries, scans in (reverse) insert order
– Good for logging
• Used internally for oplog
• createCollection (“foo”, {capped: true, size: bytes
[, max: numDocs]})
– Create capped collection before first use
Capped collection, tailable cursor
• Similar to Unix “tail –f” command
• Cursor never finishes, next() just waits for next
inserted document
• Cannot sort results, insert order only
GridFS
• Specification for storing large files
(bytestrings) in MongoDB
• Interface
– store (filename, bytes)
– fetch (filename)
• Implementation
– Break file into chunks of 256KB (default)
– Store chunks in “fs.chunks” collection
– Store filename, file size, etc., in “fs.files” collection
GridFS implementation
• “files” schema
– _id: ObjectId
– filename: String
– length: Int – size of file in bytes
– chunkSize: Int – size of each chunk (default 256KB)
– uploadDate: Date – date when object stored
– md5: String – result of filemd5 on file
• Additional fields may be added by user
• Unique index on {filename: 1}
GridFS implementation
• “chunks” schema
– _id: ObjectId
– Files_id: ObjectId – id of file in “files” collection
– N: Int – chunk number
– Data: Binary – chunk bytes
• Unique index on {files_id: 1, n: 1}
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

More Related Content

What's hot

Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTWsunnygleason
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichNorberto Leite
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiInfluxData
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation FrameworkMongoDB
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Anuj Jain
 
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindsetEric Normand
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkMartin Goodson
 
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Mathias Herberts
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youSHRUG GIS
 
This is not your father's monitoring.
This is not your father's monitoring.This is not your father's monitoring.
This is not your father's monitoring.Mathias Herberts
 
Herding types with Scala macros
Herding types with Scala macrosHerding types with Scala macros
Herding types with Scala macrosMarina Sigaeva
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationJoe Drumgoole
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
 

What's hot (20)

Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTW
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
 
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindset
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
 
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Typelevel summit
Typelevel summitTypelevel summit
Typelevel summit
 
This is not your father's monitoring.
This is not your father's monitoring.This is not your father's monitoring.
This is not your father's monitoring.
 
Herding types with Scala macros
Herding types with Scala macrosHerding types with Scala macros
Herding types with Scala macros
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
Tracing and awk in ns2
Tracing and awk in ns2Tracing and awk in ns2
Tracing and awk in ns2
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 

Viewers also liked (17)

Chacara extrema mg
Chacara extrema mgChacara extrema mg
Chacara extrema mg
 
Ref. 158
Ref. 158Ref. 158
Ref. 158
 
Ref. 160
Ref. 160Ref. 160
Ref. 160
 
Ref. 159
Ref. 159Ref. 159
Ref. 159
 
Ref. 88
Ref. 88Ref. 88
Ref. 88
 
Ref. 38
Ref. 38Ref. 38
Ref. 38
 
Ref. 165
Ref. 165Ref. 165
Ref. 165
 
Ref.42
Ref.42Ref.42
Ref.42
 
Maisa
MaisaMaisa
Maisa
 
Ref. 75
Ref. 75Ref. 75
Ref. 75
 
Ref.42
Ref.42Ref.42
Ref.42
 
Mooca
MoocaMooca
Mooca
 
Godsgoddessess
GodsgoddessessGodsgoddessess
Godsgoddessess
 
Ref. 98
Ref. 98Ref. 98
Ref. 98
 
Dominican University Westside Community Collaborative
Dominican University Westside Community CollaborativeDominican University Westside Community Collaborative
Dominican University Westside Community Collaborative
 
Ref. 166
Ref. 166Ref. 166
Ref. 166
 
Ref. 84
Ref. 84Ref. 84
Ref. 84
 

Similar to MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Functional Programming in Javascript - IL Tech Talks week
Functional Programming in Javascript - IL Tech Talks weekFunctional Programming in Javascript - IL Tech Talks week
Functional Programming in Javascript - IL Tech Talks weekyoavrubin
 
mongodb-aggregation-may-2012
mongodb-aggregation-may-2012mongodb-aggregation-may-2012
mongodb-aggregation-may-2012Chris Westin
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationNorberto Leite
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analyticsMongoDB
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixInfluxData
 
Scaling php applications with redis
Scaling php applications with redisScaling php applications with redis
Scaling php applications with redisjimbojsb
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkChris Westin
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & AggregationWebinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & AggregationMongoDB
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 

Similar to MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features (20)

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Functional Programming in Javascript - IL Tech Talks week
Functional Programming in Javascript - IL Tech Talks weekFunctional Programming in Javascript - IL Tech Talks week
Functional Programming in Javascript - IL Tech Talks week
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
mongodb-aggregation-may-2012
mongodb-aggregation-may-2012mongodb-aggregation-may-2012
mongodb-aggregation-may-2012
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
 
Scaling php applications with redis
Scaling php applications with redisScaling php applications with redis
Scaling php applications with redis
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation framework
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & AggregationWebinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features

  • 1. MapReduce, Geospatial Indexing, & Other Cool Features Tony Hannan Software Engineer, 10gen tony@10gen.com MongoDB Chicago, Oct 2010
  • 2. “Cool” features not covered • Scaling – Auto sharding – Replication with auto failover • Administration – Monitoring, diagnostics, profiler – Backup, import/export • General querying & indexing
  • 3. “Cool” features covered • MapReduce aggregation • Geospatial indexing • Confirm write (getLastError) • Atomic read and write (findAndModify) • Capped collection • GridFS
  • 4. MapReduce • General aggregation query • Examples – What is the average weight of a collection of bicycles? – Given a collection of docs, each with an array of tags, how many times does each tag appear in the collection? • What are the 10 most popular tags?
  • 5. MapReduce • map :: Document -> Array (key, value) • reduce :: (key, Array value) -> value • finalize :: (key, value) -> result • runCommand ({mapreduce: collection, map: map, reduce: reduce [, finalize: finalize]}) :: Array (key, result) 1. Apply map to each document 2. Collect all (key, value) pairs and group by key 3. Apply reduce to each key and its group of values 4. Apply finalize to each key and reduced value • Array of results is stored as a result collection – Which can be queried and/or mapReduced (again!)
  • 6. MapReduce example • Given a collection of docs, each with an array of tags, how many times does each tag appear in the collection (ignoring case)? map = function() { for (var i in this.tags) emit(lowercase(this.tags[i]),1) } * reduce = function(tag,counts) {return sum(counts)} finalize = function(tag,count) {return count} • * In MongoDB, map input is ‘this’ and output is emitted instead of returned
  • 7. MapReduce scaling • Output of reduce step can be input to another reduce step – reduce (k, [reduce (k,vs)]) == reduce (k,vs) – This allows easy divide-and-conquer parallelism finalize reduce reduce reduce reduce map reduce map reduce map reduce map
  • 8. MapReduce example 2 • What is the average weight of a collection of bicycles? map = function() {emit(0, {sum: this.weight, count: 1})} reduce = fuction(k,vs) { var v = {sum: 0, count: 0} for (var i in vs) { v.sum += vs[i].sum v.count += vs[i].count } return v } finalize = function(k,v) {return v.sum/v.count}
  • 9. MapReduce optional parameters • Apply mapReduce to subset of documents – query: Eg. {price: {$lt: 1000}} – limit: Eg. 100 – sort: only meaningful with limit. Eg. {price: 1} • Result collection – out: name of permanent result collection • Temporary collection used if missing – keeptemp: make temporary collection permanent • Otherwise temporary collection is dropped when connection closes • Javascript scope – scope: scope contains user-defined functions and values for map, reduce, and finalize to use
  • 10. Geospatial indexing • Query by geographic location – Eg. Find the 10 closest museums to my location • Location field must be an array or document of two numbers – Eg. [-87.7, 41.8], {long: -87.7, lat: 41.8} – Any units can be used, but normally (long, lat) degrees • ensureIndex (loc: “2d” [, {min: -180, max: 180}]) – Index is required for geospatial queries • Error otherwise
  • 11. Geospatial querying • N closest in closeness order – find (loc: {$near: [-87.7, 41.8]}).limit(N) – Default limit is 100 if elided • Closest within given distance – find (loc: {$near: [-87.7, 41.8], $maxDistance: 5}) • Locations within circle or box – find (loc: {$within: {$center: [[-87.7, 41.8], 5]}}) – find (loc: {$within: {$box: [[-86,40],[-89,44]]}}) • Box params are [lower-left, upper-right]
  • 12. Geospatial, compound index • Query by location and something else – ensureIndex ({loc: “2d”, tag: 1}) • Compound index not required, just loc index – find ({loc: {$near: [-87.7, 41.8]}, tag: “museum”})
  • 13. Geospatial, current limitations • Flat Euclidian geometry – Spherical geometry in version 1.7 • To use, append “Sphere” to $near or $center • Coordinate units must be (long, lat) degrees • Distance units must be radians • No wrapping at min/max (180th meridian) • Only one geo-index per collection allowed • Currently doesn’t work on sharded collections – Tentatively planned for version 1.8
  • 14. Geospatial, other • runCommand ({geoNear: collection, near: [- 87.7, 41.8], num: 10}) – Similar to find $near except includes distances with results
  • 15. Confirm write, getLastError • Check success of previous write (on same connection) – runCommand ({getlasterror: 1}) > {err: E [, code: C], n: N [, updatedExisting: U]} • E = null. Success • E = “error message”. Failure • C only present on failure and is the error code number • N = num docs updated (or upserted) • U only present when write was an save/update • U = true. Updated existing object • U = false. Inserted new object (upsert)
  • 16. Confirm write, safe mode • Some language drivers have a safe write that calls getlasterror automatically and raises exception on error – Eg. in Python: save({..}, safe=True)
  • 17. Confirm write, getLastError variations • runCommand ({getlasterror: 1, fsync: true}) – fsync data to disk. Blocks until finished • In replicated enviroment – runCommand ({getlasterror: 1, w: 2, wtimeout: 3000}) • Block until write reaches at least two servers or timeout (in ms) reached • If timeout reached, result will contain {wtimeout: true} – Only need to do this check on last write because all writes are replicated in order
  • 18. findAndModify: atomic read & write • Read and write a single document atomically • Examples – Taking from a shared queue – Getting and incrementing a counter
  • 19. findAndModify • runCommand ({findAndModify: collection, – query: - select document, first one if many, error if none – sort: - pick first of sorted selection – Either • remove: true - deleted selected document • update: modifier - modify selected document – new: • false (default) – return document before remove/update • true – return document after update – fields: - project given fields only – upsert: • false (default) • True – create object if it does not exist }) • All fields above are optional except remove or update
  • 20. findAndModify example • Shared queue – runCommand ({findAndModify: “queue”, query: {}, sort: {priority: -1}, remove: true }) > {ok: 1, value: {_id: .., priority: 9, …}}
  • 21. findAndModify example 2 • Auto-increment function getNextValue(counterName) { var r = runCommand ({findAndModify: “counters”, query: {_id: counterName}, update: {$inc: {value: 1}}, upsert: true, new: true }) return r.value } > getNextValue(“x”) -> 1 > getNextValue(“x”) -> 2 > getNextValue(“y”) -> 1 > getNextValue(“x”) -> 3
  • 22. Capped collection • Very fast fixed-size circular buffer – Oldest documents removed when new documents inserted – No indexes – Supports queries, scans in (reverse) insert order – Good for logging • Used internally for oplog • createCollection (“foo”, {capped: true, size: bytes [, max: numDocs]}) – Create capped collection before first use
  • 23. Capped collection, tailable cursor • Similar to Unix “tail –f” command • Cursor never finishes, next() just waits for next inserted document • Cannot sort results, insert order only
  • 24. GridFS • Specification for storing large files (bytestrings) in MongoDB • Interface – store (filename, bytes) – fetch (filename) • Implementation – Break file into chunks of 256KB (default) – Store chunks in “fs.chunks” collection – Store filename, file size, etc., in “fs.files” collection
  • 25. GridFS implementation • “files” schema – _id: ObjectId – filename: String – length: Int – size of file in bytes – chunkSize: Int – size of each chunk (default 256KB) – uploadDate: Date – date when object stored – md5: String – result of filemd5 on file • Additional fields may be added by user • Unique index on {filename: 1}
  • 26. GridFS implementation • “chunks” schema – _id: ObjectId – Files_id: ObjectId – id of file in “files” collection – N: Int – chunk number – Data: Binary – chunk bytes • Unique index on {files_id: 1, n: 1}