Precog & MongoDB User Group: Skyrocket Your Analytics

1,553 views
1,390 views

Published on

earn how to do advanced analytics with the Precog data science platform on your MongoDB database. It's free to download the Precog file and after installing, you'll be on your way to analyzing all the data in your MongoDB database, without forcing you to export data into another tool or write any custom code. Learn more here: www.precog.com/mongodb

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,553
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Precog & MongoDB User Group: Skyrocket Your Analytics

  1. 1. Skyrocket your AnalyticsMongoDB Meetup on December 10, 2012www.precog.com@precogioNov - Dec 2012
  2. 2. welcome & agenda■ Welcome to the Precog & MongoDB Meetup! 7:00 - 7:30 Overview of Precog for MongoDB by Derek Chen-Becker 7:30 - 7:45 Break (grab a beer, drink and snacks) 7:45 - 8:15 Analyzing Big Data with Quirrel by John A. De Goes 8:15 - 8:30 Precog Challenge Problems! Win some prizes!■ Questions? Please ask away!
  3. 3. who we are■ Precog TeamDerek Chen-Becker, Lead Infrastructure EngineerJohn A. De Goes, CEO/FounderKris Nuttycombe, Dir of EngineeringNathan Lubchenco, Developer Evangelist■ MongoDB HostClay Mcllrath■ Thank you to Google for hosting us!
  4. 4. Current MongoDB Support for AnalyticsDerek Chen-BeckerPrecog Lead Infrastructure Engineer@dchenbeckerNov - Dec 2012
  5. 5. current mongodb support for analytics■ Mongo has support for a small set of simple aggregation primitives ○ count - returns the count of a given collections documents with optional filtering ○ distinct - returns the distinct values for given selector criteria ○ group - returns groups of documents based on given key criteria. Group cannot be used in sharded configurations
  6. 6. current mongodb support for analytics> db.london_medals.group({ key : {"Country":1}, reduce : function(curr, result) { result.total += 1 }, initial: { total : 0, fullTotal: db.london_medals.count() }, finalize: function(result){ result.percent = result.total * 100 / result.fullTotal } })[ {"Country" : "Great Britain", "total" : 88, "fullTotal" : 1019, "percent" : 8.635917566241414}, {"Country" : "Dominican Republic", "total" : 2, "fullTotal" : 1019, "percent" :0.19627085377821393}, {"Country" : "Denmark", "total" : 16, "fullTotal" : 1019, "percent" : 1.5701668302257115}, ...■ More sophisticated queries are possible, but require a lot of JS and youll hit the limits pretty quickly■ Group cannot be used in sharded configurations. For that you need...
  7. 7. current mongodb support for analytics■ Map/Reduce: Exactly what its name says.■ You utilize JavaScript functions to map your documents data, then reduce that data into a form of your choosing. Output Collection Input Mapping Function Reducing Function Collection Result Document
  8. 8. current mongodb support for analytics■ The mapping function redefines this to be the current document■ Output mapped keys and values are generated via the emit function■ Emit can be called zero or more times for a single documentfunction () { emit(this.Countryname, { count : 1 }); }function () { for (var i = 0; i < this.Pupils.length; i++) { emit(this.Pupils[i].name, { count : 1});}function () { if ((this.parents.age - this.age) < 25) { emit(this.age, { income : this.income }); }}
  9. 9. current mongodb support for analytics■ The reduction function is used to aggregate the outputs from the mapping function■ The function receives two inputs: the key for the elements being reduced, and the values being reduced■ The result of the reduction must be the same format as in the input elements, and must be idempotentfunction (key, values) { var count = 0; for (var item in values) { count += item.count } { "count" : count }}
  10. 10. current mongodb support for analytics■ Map/Reduce utilizes JavaScript to do all of its work ○ JavaScript in MongoDB is currently single-threaded (performance bottleneck) ○ Using external JS libraries is cumbersome and doesnt play well with sharding ○ No matter what language youre actually using, youll be writing/maintaining JavaScript■ Troubleshooting the Map/Reduce functions is primitive. 10Gens advice: "write your own emit function" (!)■ Output options are flexible, but have some caveats ○ Output to a result document must fit in a BSON doc (16MB limit) ○ For an output collection: if you want indices on the result set, you need to pre- create the collection then use the merge output option
  11. 11. current mongodb support for analytics■ The Aggregation Framework is designed to alleviate some of the issues with Map/Reduce for common analytical queries■ New in 2.2■ Works by constructing a pipeline of operations on data. Similar to M/R, but implemented in native code (higher performance, not single-threaded) Input Match Project Group Collection
  12. 12. current mongodb support for analytics■ Filtering/paging ops ○ $match - utilize Mongo selection syntax to choose input docs ○ $limit ○ $skip■ Field manipulation ops ○ $project - select which fields are processed. Can add new fields ○ $unwind - flattens a doc with an array field into multiple events, one per array value■ Output ops ○ $group ○ $sort■ Most common pipelines will be of the form $match ⇒ $project ⇒ $group
  13. 13. current mongodb support for analytics■ $match is very important to getting good performance■ Needs to be the first op in the pipeline, otherwise indices cant be used■ Uses normal MongoDB query syntax, with two exceptions ○ Cant use a $where clause (this requires JavaScript) ○ Cant use Geospatial queries (just because){ $match : { "Name" : "Fred" } }{ $match : { "Countryname" : { $neq : "Great Britain" } } }{ $match : { "Income" : { $exists : 1 } } }
  14. 14. current mongodb support for analytics■ $project is used to select/compute/augment the fields you want in the output documents { $project : { "Countryname" : 1, "Sportname" : 1 } }■ Can reference input document fields in computations via "$" { $project : { "country_name" : "$Countryname" } } /* renames field */■ Computation of field values is possible, but its limited and can be quite painful { $project: { "_id":0, "height":1, "weight":1, "bmi": { $divide : ["$weight", { $multiply : [ "$height", "$height" ] } ] } } } /* omit "_id" field, inflict pain and suffering on future maintainers... */
  15. 15. current mongodb support for analytics■ $group, like the group command, collates and computes sets of values based on the identity field ("_id"), and whatever other fields you want { $group : { "_id" : "$Countryname" } } /* distinct list of countries */■ Aggregation operators can be used to perform computation ($max, $min, $avg, $sum) { $group : { "_id" : "$Countryname", "count" : { $sum : 1 } } } /* histogram bycountry */ { $group : { "_id" : "$Countryname", "weight" : { $avg : "$weight" } } } { $group : { "_id" : "$Countryname", "weight" : { $sum : "$weight" } } }■ Set-based operations ($addToSet, $push) { $group : { "_id" : "$Countryname", "sport" : { $addToSet : "$sport" } } }
  16. 16. current mongodb support for analytics■ Aggregation framework has a limited set of operators ○ $project limited to $add/$subtract/$multiply/$divide, as well as some boolean, string, and date/time operations ○ $group limited to $min/$max/$avg/$sum■ Some operators, notably $group and $sort, are required to operate entirely in memory ○ This may prevent aggregation on large data sets ○ Cant work around using subsetting like you can with M/R, because output is strictly a document (no collection option yet)
  17. 17. current mongodb support for analytics■ Even with these tools, there are still limitations ○ MongoDB is not relational. This means a lot of work on your part if you have datasets representing different things that youd like to correlate. Clicks vs views, for example ○ While the Aggregation Framework alleviates some of the performance issues of Map/Reduce, it does so by throwing away flexibility ○ The best approach for parallelization (sharding) is fraught with operational challenges (come see me for horror stories)
  18. 18. Overview of Precog for MongoDBDerek Chen-BeckerPrecog Lead Infrastructure Engineer@dchenbeckerNov - Dec 2012
  19. 19. overview of precog for mongodb■ Download file: http://www.precog.com/mongodb■ Setup:$ unzip precog.zip$ cd precog$ emacs -nw config.cfg (adjust ports, etc)$ ./precog.sh
  20. 20. overview of precog for mongodb■ Precog for MongoDB allows you to perform sophisticated analytics utilizing existing mongo instances■ Self-contained JAR bundling: ○ The Precog Analytics service ○ Labcoat IDE for Quirrel■ Does not include the full Precog stack ○ Minimal authentication handling (single api key in config) ○ No ingest service (just add data directly to mongo)
  21. 21. overview of precog for mongodb■ Some sample queries-- histogram by countrydata := //summer_games/athletessolve country { country: country, count: count(data where data.Countryname = country) }
  22. 22. Analyzing Big Data with QuirrelJohn A. De GoesPrecog CEO/Founder@jdegoesNov - Dec 2012
  23. 23. overviewQuirrel is a statistically-oriented query languagedesigned for the analysis of large-scale, potentiallyheterogeneous data sets.
  24. 24. quirrel● Simple● Set-oriented● Statistically-oriented● Purely declarative● Implicitly parallel
  25. 25. sneak peekpageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId where pageViews.duration > avg + bound
  26. 26. quirrel speaks json1true[[1, 0, 0], [0, 1, 0], [0, 0, 1]]"All work and no play makes jack a dullboy"{"age": 23, "gender": "female","interests": ["sports", "tennis"]}
  27. 27. comments-- Ignore me.(- Ignore me, too -)
  28. 28. basic expressions2 * 4(1 + 2) * 3 / 9 > 233 > 2 & (1 != 2)false & true | !false
  29. 29. named expressionsx := 2square := x * x
  30. 30. loading data//pageViewsload("/pageViews")//campaigns/summer/2012
  31. 31. drilldownpageViews := load("/pageViews")pageViews.userIdpageViews.keywords[2]
  32. 32. reductionscount(//pageViews)sum(//purchases.total)stdDev(//purchases.total)
  33. 33. filteringpageViews := //pageViewspageViews.userId where pageViews.duration > 1000
  34. 34. augmentationclicks with {dow: dayOfWeek(clicks.time)}
  35. 35. standard libraryimport std::stats::rankrank(//pageViews.duration)
  36. 36. user-defined functionsctr(day) := count(clicks where clicks.day = day) / count(impressions where impressions.day = day)ctrOnMonday := ctr(1)ctrOnMonday
  37. 37. grouping - implicit constraintssolve day {day: day, ctr: count(clicks where clicks.day = day) / count(impressions where impressions.day = day)}
  38. 38. grouping - explicit constraintssolve day = purchases.day {day: day, cummTotal: sum(purchases.total where purchases.day < day)}
  39. 39. questions?http://quirrel-lang.org
  40. 40. Now, its your turn! Win some cool prizes!Precog Challenge ProblemsNov - Dec 2012
  41. 41. precog challenge #1■ Using the conversions data, find the state with the highest average income.■ Variable names: conversions.customers.state and conversions.customers.income
  42. 42. precog challenge #2■ Use Labcoat to display a bar chart of the clicks per month.■ Variable names: clicks.timestamp
  43. 43. precog challenge #3■ What product has the worst overall sales to women? To men?■ Variable names: billing.product.ID, billing. product.price, billing.customer.gender
  44. 44. precog challenge #1 possible solutionconversions := //conversionsresults := solve state {state: state, aveIncome: mean(conversions.customer.income where conversions.customer.state = state)}results where results.aveIncome = max(results.aveIncome)
  45. 45. precog challenge #2 possible solutionclicks := //clicksclicks := clicks with {month: std::time::monthOfYear(clicks.timeStamp)}solve month {month: month, clicks: count(clicks.product.price where clicks.month = month)}
  46. 46. precog challenge #3 possible solutionbilling := //billingresults := solve product, gender {product: product, gender: gender, sales: sum(billing.product.price where billing.product.ID = product & billing.customer.gender = gender)}worstSalesToWomen := results where results.gender = "female" & results.sales = min(results.sales where results.gender = "female")worstSalesToMen := results where results.gender = "male" & results.sales = min(results.sales where results.gender = "male")worstSalesToWomen union worstSalesToMen
  47. 47. Thank you!Follow us on Twitter@precogio@jdegoes@dchenbeckerDownload Precog for MongoDB for FREE:www.precog.com/mongodbTry Precog for free and get a free account:www.precog.comSubscribe to our monthly newsletter:www.precog.com/about/newsletterNov - Dec 2012

×