Advanced Analytics on MongoDBWifi: PalaceMeetingRooms/mongodbMongoDB Day SF, May 10, 2013www.precog.com@precogNov - Dec 2012
Native MongoDB Analytics
■ Mongo has support for a small set of simple aggregation primitives○ count - returns the count of a given collections doc...
> db.london_medals.group({key : {"Country":1},reduce : function(curr, result) { result.total += 1 },initial: { total : 0, ...
■ Map/Reduce: Exactly what its name says.■ You utilize JavaScript functions to map your documents data, then reduce thatda...
■ The mapping function redefines this to be the current document■ Output mapped keys and values are generated via the emit...
■ The reduction function is used to aggregate the outputs from the mappingfunction■ The function receives two inputs: the ...
■ Map/Reduce utilizes JavaScript to do all of its work○ JavaScript in MongoDB is currently single-threaded (performance bo...
■ The Aggregation Framework is designed to alleviate some of the issues withMap/Reduce for common analytical queries■ New ...
■ Filtering/paging ops○ $match - utilize Mongo selection syntax to choose input docs○ $limit○ $skip■ Field manipulation op...
■ $match is very important to getting good performance■ Needs to be the first op in the pipeline, otherwise indices cant b...
■ $project is used to select/compute/augment the fields you want in the outputdocuments{ $project : { "Countryname" : 1, "...
■ $group, like the group command, collates and computes sets of values basedon the identity field ("_id"), and whatever ot...
■ Aggregation framework has a limited set of operators○ $project limited to $add/$subtract/$multiply/$divide, as well as s...
■ Even with these tools, there are still limitations○ MongoDB is not relational. This means a lot of work on your part if ...
Precog for MongoDB
■ Precog for MongoDB allows you to perform sophisticated analytics utilizingexisting mongo instances■ 100% free for non-pr...
■ Download file○ http://precog.com/for-developers/mongodb/■ Setup$ unzip precog.zip$ cd precog$ emacs -nw config.cfg (adju...
Analyzing JSON Data with Quirrel
Quirrel is a statistically-oriented query languagedesigned for the analysis of large-scale, heterogeneousdata sets.overview
● Simple● Statistically-oriented● Purely declarative● Implicitly parallelquirrel
pageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId wherepage...
1true[[1, 0, 0], [0, 1, 0], [0, 0, 1]]"All work and no play makes jack a dullboy"{"age": 23, "gender": "female","interests...
-- Ignore me.(- Ignoreme,too -)comments
2 * 4(1 + 2) * 3 / 9 > 233 > 2 & (1 != 2)false & true | !falsebasic expressions
x := 2square := x * xnamed expressions
//pageViewsload("/pageViews")//campaigns/summer/2012loading data
pageViews := load("/pageViews")pageViews.userIdpageViews.keywords[2]drilldown
count(//pageViews)sum((//purchases).total)stdDev((//purchases).total)reductions
pageViews := //pageViewspageViews.userId wherepageViews.duration > 1000filtering
clicks with{dow: dayOfWeek(clicks.time)}augmentation
import std::stats::rankrank((//pageViews).duration)standard library
ctr(day) :=count(clicks whereclicks.day = day) /count(impressions whereimpressions.day = day)ctrOnMonday := ctr(1)ctrOnMon...
solve day{day: day,ctr: count(clicks whereclicks.day = day) /count(impressions whereimpressions.day =day)}grouping - impli...
solve day = purchases.day{day: day,cummTotal:sum(purchases.total wherepurchases.day < day)}grouping - explicit constraints
http://quirrel-lang.orgquestions?
Thank you!Follow us on Twitter@precog, @jdegoesDownload Precog for MongoDB for FREE:precog.com/for-developers/mongodbTry P...
Upcoming SlideShare
Loading in...5
×

MongoDB San Francisco 2013: Advanced Analytics on MongoDB presented by John A. De Goes, CTO, Precog

939

Published on

Scientific data sets are messy (loose data structures, evolving schemas) and large. MongoDB is becoming increasingly popular in the scientific computing space for precisely these reasons. We discuss the advantages of using MongoDB in scientific computing, and describe how we've built the Scientific Computing infrastructure for The Materials Project using MongoDB. We also discuss "warts" in the MongoDB implementation that affect our choices of how and when to use it.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
939
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

MongoDB San Francisco 2013: Advanced Analytics on MongoDB presented by John A. De Goes, CTO, Precog

  1. 1. Advanced Analytics on MongoDBWifi: PalaceMeetingRooms/mongodbMongoDB Day SF, May 10, 2013www.precog.com@precogNov - Dec 2012
  2. 2. Native MongoDB Analytics
  3. 3. ■ Mongo has support for a small set of simple aggregation primitives○ count - returns the count of a given collections documents with optionalfiltering○ distinct - returns the distinct values for given selector criteria○ group - returns groups of documents based on given key criteria. Groupcannot be used in sharded configurationsmongo query - basic
  4. 4. > db.london_medals.group({key : {"Country":1},reduce : function(curr, result) { result.total += 1 },initial: { total : 0, fullTotal: db.london_medals.count() },finalize: function(result){ result.percent = result.total * 100 / result.fullTotal }})[{"Country" : "Great Britain", "total" : 88, "fullTotal" : 1019, "percent" : 8.635917566241414},{"Country" : "Dominican Republic", "total" : 2, "fullTotal" : 1019, "percent" :0.19627085377821393},{"Country" : "Denmark", "total" : 16, "fullTotal" : 1019, "percent" : 1.5701668302257115},...■ More sophisticated queries are possible, but require a lot of JS and youll hit the limits prettyquickly■ Group cannot be used in sharded configurations. For that you need...mongo query - group
  5. 5. ■ Map/Reduce: Exactly what its name says.■ You utilize JavaScript functions to map your documents data, then reduce thatdata into a form of your choosing.mongo map/reduceInputCollectionMapping Function Reducing FunctionResultDocumentOutputCollection
  6. 6. ■ The mapping function redefines this to be the current document■ Output mapped keys and values are generated via the emit function■ Emit can be called zero or more times for a single documentfunction () { emit(this.Countryname, { count : 1 }); }function () {for (var i = 0; i < this.Pupils.length; i++) {emit(this.Pupils[i].name, { count : 1});}function () {if ((this.parents.age - this.age) < 25) { emit(this.age, { income : this.income }); }}mongo map/reduce
  7. 7. ■ The reduction function is used to aggregate the outputs from the mappingfunction■ The function receives two inputs: the key for the elements being reduced, andthe values being reduced■ The result of the reduction must be the same format as in the input elements,and must be idempotentfunction (key, values) {var count = 0;for (var item in values) {count += item.count}{ "count" : count }}mongo map/reduce
  8. 8. ■ Map/Reduce utilizes JavaScript to do all of its work○ JavaScript in MongoDB is currently single-threaded (performance bottleneck)○ Using external JS libraries is cumbersome and doesnt play well with sharding○ No matter what language youre actually using, youll be writing/maintainingJavaScript■ Troubleshooting the Map/Reduce functions is primitive.○ 10Gens advice: "write your own emit function"■ Output options are flexible, but have some caveats○ Output to a result document must fit in a BSON doc (16MB limit)○ For an output collection: if you want indices on the result set, you need to pre-create the collection then use the merge output optionmongo map/reduce
  9. 9. ■ The Aggregation Framework is designed to alleviate some of the issues withMap/Reduce for common analytical queries■ New in 2.2■ Works by constructing a pipeline of operations on data. Similar to M/R, butimplemented in native code (higher performance, not single-threaded)mongo aggregation frameworkInputCollectionMatch Project Group
  10. 10. ■ Filtering/paging ops○ $match - utilize Mongo selection syntax to choose input docs○ $limit○ $skip■ Field manipulation ops○ $project - select which fields are processed. Can add new fields○ $unwind - flattens a doc with an array field into multiple events, one per arrayvalue■ Output ops○ $group○ $sort■ Most common pipelines will be of the form $match ⇒ $project ⇒ $groupmongo aggregation framework
  11. 11. ■ $match is very important to getting good performance■ Needs to be the first op in the pipeline, otherwise indices cant be used■ Uses normal MongoDB query syntax, with two exceptions○ Cant use a $where clause (this requires JavaScript)○ Cant use Geospatial queries (just because){ $match : { "Name" : "Fred" } }{ $match : { "Countryname" : { $neq : "Great Britain" } } }{ $match : { "Income" : { $exists : 1 } } }mongo aggregation framework
  12. 12. ■ $project is used to select/compute/augment the fields you want in the outputdocuments{ $project : { "Countryname" : 1, "Sportname" : 1 } }■ Can reference input document fields in computations via "$"{ $project : { "country_name" : "$Countryname" } } /* renames field */■ Computation of field values is possible, but its limited and can be quite painful{ $project: {"_id":0, "height":1, "weight":1,"bmi": { $divide : ["$weight", { $multiply : [ "$height", "$height" ] } ] } }} /* omit "_id" field, inflict pain and suffering on future maintainers... */mongo aggregation framework
  13. 13. ■ $group, like the group command, collates and computes sets of values basedon the identity field ("_id"), and whatever other fields you want{ $group : { "_id" : "$Countryname" } } /* distinct list of countries */■ Aggregation operators can be used to perform computation ($max, $min, $avg,$sum){ $group : { "_id" : "$Countryname", "count" : { $sum : 1 } } } /* histogram bycountry */{ $group : { "_id" : "$Countryname", "weight" : { $avg : "$weight" } } }{ $group : { "_id" : "$Countryname", "weight" : { $sum : "$weight" } } }■ Set-based operations ($addToSet, $push){ $group : { "_id" : "$Countryname", "sport" : { $addToSet : "$sport" } } }mongo aggregation framework
  14. 14. ■ Aggregation framework has a limited set of operators○ $project limited to $add/$subtract/$multiply/$divide, as well as someboolean, string, and date/time operations○ $group limited to $min/$max/$avg/$sum■ Some operators, notably $group and $sort, are required to operate entirely inmemory○ This may prevent aggregation on large data sets○ Cant work around using subsetting like you can with M/R, because output isstrictly a document (no collection option yet)mongo aggregation framework
  15. 15. ■ Even with these tools, there are still limitations○ MongoDB is not relational. This means a lot of work on your part if you havedatasets representing different things that youd like to correlate. Clicks vsviews, for example○ While the Aggregation Framework alleviates some of the performance issuesof Map/Reduce, it does so by throwing away flexibility○ The best approach for parallelization (sharding) is fraught with operationalchallengesmongo aggregation framework
  16. 16. Precog for MongoDB
  17. 17. ■ Precog for MongoDB allows you to perform sophisticated analytics utilizingexisting mongo instances■ 100% free for non-profit or commercial use■ Self-contained JAR bundling○ Precog analytics server○ Labcoat, a visual query builder■ Does not include the full Precog stack○ Minimal authentication handling (single api key in config)○ No ingest service (just add data directly to mongo)○ Does not store data, only streams it from MongoDBoverview of precog for mongodb
  18. 18. ■ Download file○ http://precog.com/for-developers/mongodb/■ Setup$ unzip precog.zip$ cd precog$ emacs -nw config.cfg (adjust ports, etc)$ ./precog.shinstallation & setup
  19. 19. Analyzing JSON Data with Quirrel
  20. 20. Quirrel is a statistically-oriented query languagedesigned for the analysis of large-scale, heterogeneousdata sets.overview
  21. 21. ● Simple● Statistically-oriented● Purely declarative● Implicitly parallelquirrel
  22. 22. pageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId wherepageViews.duration > avg + boundsneak peek
  23. 23. 1true[[1, 0, 0], [0, 1, 0], [0, 0, 1]]"All work and no play makes jack a dullboy"{"age": 23, "gender": "female","interests": ["sports", "tennis"]}quirrel speaks json
  24. 24. -- Ignore me.(- Ignoreme,too -)comments
  25. 25. 2 * 4(1 + 2) * 3 / 9 > 233 > 2 & (1 != 2)false & true | !falsebasic expressions
  26. 26. x := 2square := x * xnamed expressions
  27. 27. //pageViewsload("/pageViews")//campaigns/summer/2012loading data
  28. 28. pageViews := load("/pageViews")pageViews.userIdpageViews.keywords[2]drilldown
  29. 29. count(//pageViews)sum((//purchases).total)stdDev((//purchases).total)reductions
  30. 30. pageViews := //pageViewspageViews.userId wherepageViews.duration > 1000filtering
  31. 31. clicks with{dow: dayOfWeek(clicks.time)}augmentation
  32. 32. import std::stats::rankrank((//pageViews).duration)standard library
  33. 33. ctr(day) :=count(clicks whereclicks.day = day) /count(impressions whereimpressions.day = day)ctrOnMonday := ctr(1)ctrOnMondayuser-defined functions
  34. 34. solve day{day: day,ctr: count(clicks whereclicks.day = day) /count(impressions whereimpressions.day =day)}grouping - implicit constraints
  35. 35. solve day = purchases.day{day: day,cummTotal:sum(purchases.total wherepurchases.day < day)}grouping - explicit constraints
  36. 36. http://quirrel-lang.orgquestions?
  37. 37. Thank you!Follow us on Twitter@precog, @jdegoesDownload Precog for MongoDB for FREE:precog.com/for-developers/mongodbTry Precog for free and get a free account:precog.com/for-developers/Subscribe to our monthly newsletter:precog.com/about/contact-us/Nov - Dec 2012
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×