MongoDB San Francisco 2013: Advanced Analytics on MongoDB  presented by John A. De Goes, CTO, Precog
 

MongoDB San Francisco 2013: Advanced Analytics on MongoDB presented by John A. De Goes, CTO, Precog

on

  • 1,120 views

Scientific data sets are messy (loose data structures, evolving schemas) and large. MongoDB is becoming increasingly popular in the scientific computing space for precisely these reasons. We discuss ...

Scientific data sets are messy (loose data structures, evolving schemas) and large. MongoDB is becoming increasingly popular in the scientific computing space for precisely these reasons. We discuss the advantages of using MongoDB in scientific computing, and describe how we've built the Scientific Computing infrastructure for The Materials Project using MongoDB. We also discuss "warts" in the MongoDB implementation that affect our choices of how and when to use it.

Statistics

Views

Total Views
1,120
Views on SlideShare
559
Embed Views
561

Actions

Likes
2
Downloads
13
Comments
0

5 Embeds 561

http://www.10gen.com 409
http://www.mongodb.com 149
http://drupal1.10gen.cc 1
http://translate.googleusercontent.com 1
https://www.mongodb.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    MongoDB San Francisco 2013: Advanced Analytics on MongoDB  presented by John A. De Goes, CTO, Precog MongoDB San Francisco 2013: Advanced Analytics on MongoDB presented by John A. De Goes, CTO, Precog Presentation Transcript

    • Advanced Analytics on MongoDBWifi: PalaceMeetingRooms/mongodbMongoDB Day SF, May 10, 2013www.precog.com@precogNov - Dec 2012
    • Native MongoDB Analytics
    • ■ Mongo has support for a small set of simple aggregation primitives○ count - returns the count of a given collections documents with optionalfiltering○ distinct - returns the distinct values for given selector criteria○ group - returns groups of documents based on given key criteria. Groupcannot be used in sharded configurationsmongo query - basic
    • > db.london_medals.group({key : {"Country":1},reduce : function(curr, result) { result.total += 1 },initial: { total : 0, fullTotal: db.london_medals.count() },finalize: function(result){ result.percent = result.total * 100 / result.fullTotal }})[{"Country" : "Great Britain", "total" : 88, "fullTotal" : 1019, "percent" : 8.635917566241414},{"Country" : "Dominican Republic", "total" : 2, "fullTotal" : 1019, "percent" :0.19627085377821393},{"Country" : "Denmark", "total" : 16, "fullTotal" : 1019, "percent" : 1.5701668302257115},...■ More sophisticated queries are possible, but require a lot of JS and youll hit the limits prettyquickly■ Group cannot be used in sharded configurations. For that you need...mongo query - group
    • ■ Map/Reduce: Exactly what its name says.■ You utilize JavaScript functions to map your documents data, then reduce thatdata into a form of your choosing.mongo map/reduceInputCollectionMapping Function Reducing FunctionResultDocumentOutputCollection
    • ■ The mapping function redefines this to be the current document■ Output mapped keys and values are generated via the emit function■ Emit can be called zero or more times for a single documentfunction () { emit(this.Countryname, { count : 1 }); }function () {for (var i = 0; i < this.Pupils.length; i++) {emit(this.Pupils[i].name, { count : 1});}function () {if ((this.parents.age - this.age) < 25) { emit(this.age, { income : this.income }); }}mongo map/reduce
    • ■ The reduction function is used to aggregate the outputs from the mappingfunction■ The function receives two inputs: the key for the elements being reduced, andthe values being reduced■ The result of the reduction must be the same format as in the input elements,and must be idempotentfunction (key, values) {var count = 0;for (var item in values) {count += item.count}{ "count" : count }}mongo map/reduce
    • ■ Map/Reduce utilizes JavaScript to do all of its work○ JavaScript in MongoDB is currently single-threaded (performance bottleneck)○ Using external JS libraries is cumbersome and doesnt play well with sharding○ No matter what language youre actually using, youll be writing/maintainingJavaScript■ Troubleshooting the Map/Reduce functions is primitive.○ 10Gens advice: "write your own emit function"■ Output options are flexible, but have some caveats○ Output to a result document must fit in a BSON doc (16MB limit)○ For an output collection: if you want indices on the result set, you need to pre-create the collection then use the merge output optionmongo map/reduce
    • ■ The Aggregation Framework is designed to alleviate some of the issues withMap/Reduce for common analytical queries■ New in 2.2■ Works by constructing a pipeline of operations on data. Similar to M/R, butimplemented in native code (higher performance, not single-threaded)mongo aggregation frameworkInputCollectionMatch Project Group
    • ■ Filtering/paging ops○ $match - utilize Mongo selection syntax to choose input docs○ $limit○ $skip■ Field manipulation ops○ $project - select which fields are processed. Can add new fields○ $unwind - flattens a doc with an array field into multiple events, one per arrayvalue■ Output ops○ $group○ $sort■ Most common pipelines will be of the form $match ⇒ $project ⇒ $groupmongo aggregation framework
    • ■ $match is very important to getting good performance■ Needs to be the first op in the pipeline, otherwise indices cant be used■ Uses normal MongoDB query syntax, with two exceptions○ Cant use a $where clause (this requires JavaScript)○ Cant use Geospatial queries (just because){ $match : { "Name" : "Fred" } }{ $match : { "Countryname" : { $neq : "Great Britain" } } }{ $match : { "Income" : { $exists : 1 } } }mongo aggregation framework
    • ■ $project is used to select/compute/augment the fields you want in the outputdocuments{ $project : { "Countryname" : 1, "Sportname" : 1 } }■ Can reference input document fields in computations via "$"{ $project : { "country_name" : "$Countryname" } } /* renames field */■ Computation of field values is possible, but its limited and can be quite painful{ $project: {"_id":0, "height":1, "weight":1,"bmi": { $divide : ["$weight", { $multiply : [ "$height", "$height" ] } ] } }} /* omit "_id" field, inflict pain and suffering on future maintainers... */mongo aggregation framework
    • ■ $group, like the group command, collates and computes sets of values basedon the identity field ("_id"), and whatever other fields you want{ $group : { "_id" : "$Countryname" } } /* distinct list of countries */■ Aggregation operators can be used to perform computation ($max, $min, $avg,$sum){ $group : { "_id" : "$Countryname", "count" : { $sum : 1 } } } /* histogram bycountry */{ $group : { "_id" : "$Countryname", "weight" : { $avg : "$weight" } } }{ $group : { "_id" : "$Countryname", "weight" : { $sum : "$weight" } } }■ Set-based operations ($addToSet, $push){ $group : { "_id" : "$Countryname", "sport" : { $addToSet : "$sport" } } }mongo aggregation framework
    • ■ Aggregation framework has a limited set of operators○ $project limited to $add/$subtract/$multiply/$divide, as well as someboolean, string, and date/time operations○ $group limited to $min/$max/$avg/$sum■ Some operators, notably $group and $sort, are required to operate entirely inmemory○ This may prevent aggregation on large data sets○ Cant work around using subsetting like you can with M/R, because output isstrictly a document (no collection option yet)mongo aggregation framework
    • ■ Even with these tools, there are still limitations○ MongoDB is not relational. This means a lot of work on your part if you havedatasets representing different things that youd like to correlate. Clicks vsviews, for example○ While the Aggregation Framework alleviates some of the performance issuesof Map/Reduce, it does so by throwing away flexibility○ The best approach for parallelization (sharding) is fraught with operationalchallengesmongo aggregation framework
    • Precog for MongoDB
    • ■ Precog for MongoDB allows you to perform sophisticated analytics utilizingexisting mongo instances■ 100% free for non-profit or commercial use■ Self-contained JAR bundling○ Precog analytics server○ Labcoat, a visual query builder■ Does not include the full Precog stack○ Minimal authentication handling (single api key in config)○ No ingest service (just add data directly to mongo)○ Does not store data, only streams it from MongoDBoverview of precog for mongodb
    • ■ Download file○ http://precog.com/for-developers/mongodb/■ Setup$ unzip precog.zip$ cd precog$ emacs -nw config.cfg (adjust ports, etc)$ ./precog.shinstallation & setup
    • Analyzing JSON Data with Quirrel
    • Quirrel is a statistically-oriented query languagedesigned for the analysis of large-scale, heterogeneousdata sets.overview
    • ● Simple● Statistically-oriented● Purely declarative● Implicitly parallelquirrel
    • pageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId wherepageViews.duration > avg + boundsneak peek
    • 1true[[1, 0, 0], [0, 1, 0], [0, 0, 1]]"All work and no play makes jack a dullboy"{"age": 23, "gender": "female","interests": ["sports", "tennis"]}quirrel speaks json
    • -- Ignore me.(- Ignoreme,too -)comments
    • 2 * 4(1 + 2) * 3 / 9 > 233 > 2 & (1 != 2)false & true | !falsebasic expressions
    • x := 2square := x * xnamed expressions
    • //pageViewsload("/pageViews")//campaigns/summer/2012loading data
    • pageViews := load("/pageViews")pageViews.userIdpageViews.keywords[2]drilldown
    • count(//pageViews)sum((//purchases).total)stdDev((//purchases).total)reductions
    • pageViews := //pageViewspageViews.userId wherepageViews.duration > 1000filtering
    • clicks with{dow: dayOfWeek(clicks.time)}augmentation
    • import std::stats::rankrank((//pageViews).duration)standard library
    • ctr(day) :=count(clicks whereclicks.day = day) /count(impressions whereimpressions.day = day)ctrOnMonday := ctr(1)ctrOnMondayuser-defined functions
    • solve day{day: day,ctr: count(clicks whereclicks.day = day) /count(impressions whereimpressions.day =day)}grouping - implicit constraints
    • solve day = purchases.day{day: day,cummTotal:sum(purchases.total wherepurchases.day < day)}grouping - explicit constraints
    • http://quirrel-lang.orgquestions?
    • Thank you!Follow us on Twitter@precog, @jdegoesDownload Precog for MongoDB for FREE:precog.com/for-developers/mongodbTry Precog for free and get a free account:precog.com/for-developers/Subscribe to our monthly newsletter:precog.com/about/contact-us/Nov - Dec 2012