Thoughts on MongoDB Analytics

1,688 views

Published on

Thoughts on Mongodb Analytics using the Aggregation Framework

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,688
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Thoughts on MongoDB Analytics

    1. 1. ANALYTICS WITH MONGODB ROGER BODAMER
    2. 2. YOU WANT TO ANALYZE THIS
    3. 3. LIKE THIS
    4. 4. BUT HOW ?• These graphs are the end result of a process• In order get here there’s a few things you need to do and explore
    5. 5. A WORD ON NON-NATIVE APPROACHES• Yes, you can • map your document schema to a relational schema • then export your data from MongoDB to a relational db • and set up a cron job to do this every day • then use your BI tool to map relational to “objects” • and then Report and do Analytics
    6. 6. BUT THAT WOULD BE NO FUN• Analytics using Native Queries•A simple process
    7. 7. PROCESS: NAIVE• Take a sample document• Develop query• Put on chart• Done ! • and a gold star from your boss !
    8. 8. PROCESS: REALITY• Understand your schema • multiple schema’s in single collection • multiple collections / multiple data sources• Iterate: • define metric • develop query and report on metrics • understand and drill down or discard • repeat• Operationalize metrics: dashboard • Dimensions • Plotting
    9. 9. WHY ITERATE ?
    10. 10. UNDERSTAND YOUR SCHEMA{ "name" : "Mario", "games" : [{"game" : "WoW", "duration" : 130}, {"game" : "Tetris", "duration" : 130}]}
    11. 11. BUT ALSO:• Schema’s can be Polymorphic{ "name" : "Bob", "location" : "us", "games" : [{"game" : "WoW", "duration" : 2910}, {"game" : "Tetris", "duration" : 593}]}
    12. 12. SO NOW WHAT ?• Only report on common attributes • probably missing the most recent / interesting data
    13. 13. SO NOW WHAT ?• Write 2 programs, one for each schema • 2 graphs / reports • 2 programs writing to 1 graph (basically merging instance data in 2 places)
    14. 14. SO NOW WHAT ?• Unify Schema • deal with absent, null values • translate(NULL, “EU”);
    15. 15. ITERATE• total time and how many games people play in the us vs eu ?
    16. 16. QUERYdb.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, games: 1 }}, { $unwind : "$games" }, { $group : { _id : { location : 1}, number_games: { $sum : 1 }, total_duration: {$sum : "$games.duration"} }}, { $project : { _id : 0, location : "$_id.location", number_games : 1, total_duration : 1 }}]})
    17. 17. SIDEBAR: WRITING AGGREGATION QUERIES• Prepare Data • Extract relevant properties from collection documents • Unwind sub collection if its document is contributing to aggregation• Aggregate data • determine the key (_id) on which the aggregates should be done • name aggregates• Project Data • For final results
    18. 18. EXAMPLE{ "name" : "Alice", "location" : "us", "games" : [{ "game" : "WoW", "duration" : 200 }, { "game" : "Tetris", "duration" : 100 }]}
    19. 19. PREPARE• Only use location and games:{ $project : { location : 1, games: 1 }}• Unwind games as properties of its documents are aggregated over:{ $unwind : "$games" }
    20. 20. AGGREGATE DATA• Aggregate on number of games (add 1 per game) and total duration (add duration per game) using location as key{ $group : { _id : { location : 1}, number_games: { $sum : 1 }, total_duration: {$sum : "$games.duration"} }}
    21. 21. PROJECT• Only show location and aggregates, do not show _id{ $project : { _id : 0, location : "$_id.location", number_games : 1, total_duration : 1 }}
    22. 22. RESULT 1• People spend a little more time playing in the US• More games played in the EU
    23. 23. RING....
    24. 24. CHALLENGE 2• Since we found EU and US play similar amount and same number of games, new challenge is:• Lets see what the distribution of different games is the 2 locations
    25. 25. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    26. 26. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    27. 27. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    28. 28. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, key: aggregate on location and game number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    29. 29. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, key: aggregate on location and game number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    30. 30. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, key: aggregate on location and game number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", project: location, game, total(#games), sum(duration) game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    31. 31. RESULT 2Count: EU - WoW, US TetrisEU spends more time on WoW, US it’s moreevenly spread
    32. 32. RING....
    33. 33. CHALLENGE 3:• How do I compare Bob to everyone else in the EU ?
    34. 34. QUERY•2 aggregations happening at same time: •1 by user •1 by location• This query needs to be broken up in several queries• Fairly complex• Currently easiest to process in Ruby/Java/Python/...
    35. 35. db.runCommand( db.runCommand({ aggregate : "gamers", pipeline : [ { aggregate : "gamers", pipeline : [ { $project : { { $project : { name : 1, location : 1, location : 1, games : 1 games : 1 }}, }}, { $unwind : "$games" }, { $unwind : "$games" }, { $project : { { $project : { location : 1, name: 1, duration : "$games.duration" location : 1, }}, game : "$games.game", { $group : { duration : "$games.duration" _id : { location: 1}, }}, total_duration: {$sum : { $group : { "$duration"} _id : { location: "$location", name: "$name", game: }},"$game"}, { $project : { total_duration: {$sum : "$duration"} name : "$_id.location", }}, _id : 0, { $project : { total_duration : 1 name : "$_id.name", }} _id : 0, ]}) location : "$_id.location", game : "$_id.game", total_duration : 1 }}]})
    36. 36. RESULT 3• Bob plays >20% WoW in comparison to the Europeans, but plays 200% more Tetris
    37. 37. A NOTE ON QUERIES• There’s no notion of a declared schema• The augmented scheme is coded in queries• Reuse is very hard, happens at a query language
    38. 38. DIMENSIONS• Most questions / graphs have a dimension • Time, Geo • Categories • Relative: what’s X’s contribution of revenue to total• Youwill need to be able to pass in dimensions as a predicate for your queries • or cache result and post process client-side
    39. 39. A WORD ON RENDERING GRAPHS / REPORTS• Several libraries available for ruby / python / java • Gruff, Scruffy, StockCharts, D3, JRafael, JQuery Vizualize, MooCharts, etc, etc.• Also some services: John Nunemakers work (http:// get.gaug.es/)• But Basically: • you know how to program, right !
    40. 40. REVIEW• Understand your schema • multiple schema’s in single collection • multiple collections / multiple data sources• Iterate: • define metric • develop query and report on metrics • understand and drill down or discard • repeat• Operationalize metrics: dashboard • Dimensions • Plotting
    41. 41. PUNCHLINES• We have described a software engineering process • but requirements will be very fluid• When you know how to write ruby / java / python etc. - life is good• If you’re a business analyst you have a problem • better be BFF with some engineer :)
    42. 42. PLUG• We’ve been working on a declarative analytics product• (initially) uses Excel as its presentation layer• Reach out to me if you’re interested @rogerb roger@norellan.com
    43. 43. THANK YOU / QUESTIONS

    ×