ANALYTICS WITH MONGODB      ROGER BODAMER
YOU WANT TO ANALYZE THIS
LIKE THIS
BUT HOW ?• These   graphs are the end result of a process• In   order get here there’s a few things you need to do and exp...
A WORD ON NON-NATIVE         APPROACHES•   Yes, you can    •   map your document schema to a relational schema    •   then...
BUT THAT WOULD BE NO              FUN• Analytics   using Native Queries•A   simple process
PROCESS: NAIVE• Take   a sample document• Develop     query• Put   on chart• Done    !  • and   a gold star from your boss !
PROCESS: REALITY• Understand       your schema  • multiple schema’s in single collection  • multiple collections / multipl...
WHY ITERATE ?
UNDERSTAND YOUR SCHEMA{    "name" : "Mario",    "games" : [{"game" : "WoW",                "duration" : 130},             ...
BUT ALSO:• Schema’s   can be Polymorphic{    "name" : "Bob",    "location" : "us",    "games" : [{"game" : "WoW",         ...
SO NOW WHAT ?•   Only report on common attributes    •   probably missing the most recent / interesting data
SO NOW WHAT ?•   Write 2 programs, one for each schema    •   2 graphs / reports    •   2 programs writing to 1 graph (bas...
SO NOW WHAT ?•   Unify Schema    •   deal with absent, null values    •   translate(NULL, “EU”);
ITERATE• total   time and how many games people play in the us vs eu ?
QUERYdb.runCommand({ aggregate : "gamers", pipeline : [    { $project : {	    location : 1,	    games: 1    }},    { $unwi...
SIDEBAR: WRITING           AGGREGATION QUERIES•   Prepare Data    •   Extract relevant properties from collection document...
EXAMPLE{    "name" : "Alice",    "location" : "us",    "games" : [{        "game" : "WoW",        "duration" : 200      },...
PREPARE• Only   use location and games:{ $project : {	 location : 1,	 games: 1    }}• Unwind   games as properties of its ...
AGGREGATE DATA• Aggregate on number of games (add 1 per game)  and total duration (add duration per game)  using location ...
PROJECT• Only   show location and aggregates, do not show _id{ $project : {	 _id : 0,      location : "$_id.location",	 nu...
RESULT 1• People   spend a little more time playing in the US• More   games played in the EU
RING....
CHALLENGE 2• Since     we found EU and US play similar amount and same number of games, new challenge is:• Lets     see wh...
QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [    { $project : {	     location : 1,	     games : 1    }},    { ...
QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [    { $project : {	     location : 1,                            ...
QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [    { $project : {	     location : 1,                            ...
QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [    { $project : {	     location : 1,                            ...
QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [    { $project : {	     location : 1,                            ...
QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [    { $project : {	     location : 1,                            ...
RESULT 2Count: EU - WoW, US TetrisEU spends more time on WoW, US it’s moreevenly spread
RING....
CHALLENGE 3:• How   do I compare Bob to everyone else in the EU ?
QUERY•2   aggregations happening at same time:  •1   by user  •1   by location• This   query needs to be broken up in seve...
db.runCommand(                                                 db.runCommand({ aggregate : "gamers", pipeline : [         ...
RESULT 3• Bob plays >20% WoW in comparison to the Europeans, but plays 200% more Tetris
A NOTE ON QUERIES• There’s   no notion of a declared schema• The   augmented scheme is coded in queries• Reuse   is very h...
DIMENSIONS• Most   questions / graphs have a dimension • Time, Geo • Categories • Relative: what’s   X’s contribution of r...
A WORD ON RENDERING           GRAPHS / REPORTS• Several   libraries available for ruby / python / java  • Gruff, Scruffy, ...
REVIEW• Understand       your schema  • multiple schema’s in single collection  • multiple collections / multiple data sou...
PUNCHLINES• We     have described a software engineering process  • but    requirements will be very fluid• When      you k...
PLUG• We’ve    been working on a declarative analytics product• (initially)   uses Excel as its presentation layer• Reach ...
THANK YOU / QUESTIONS
Upcoming SlideShare
Loading in...5
×

Thoughts on MongoDB Analytics

1,366

Published on

Thoughts on Mongodb Analytics using the Aggregation Framework

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,366
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Thoughts on MongoDB Analytics"

    1. 1. ANALYTICS WITH MONGODB ROGER BODAMER
    2. 2. YOU WANT TO ANALYZE THIS
    3. 3. LIKE THIS
    4. 4. BUT HOW ?• These graphs are the end result of a process• In order get here there’s a few things you need to do and explore
    5. 5. A WORD ON NON-NATIVE APPROACHES• Yes, you can • map your document schema to a relational schema • then export your data from MongoDB to a relational db • and set up a cron job to do this every day • then use your BI tool to map relational to “objects” • and then Report and do Analytics
    6. 6. BUT THAT WOULD BE NO FUN• Analytics using Native Queries•A simple process
    7. 7. PROCESS: NAIVE• Take a sample document• Develop query• Put on chart• Done ! • and a gold star from your boss !
    8. 8. PROCESS: REALITY• Understand your schema • multiple schema’s in single collection • multiple collections / multiple data sources• Iterate: • define metric • develop query and report on metrics • understand and drill down or discard • repeat• Operationalize metrics: dashboard • Dimensions • Plotting
    9. 9. WHY ITERATE ?
    10. 10. UNDERSTAND YOUR SCHEMA{ "name" : "Mario", "games" : [{"game" : "WoW", "duration" : 130}, {"game" : "Tetris", "duration" : 130}]}
    11. 11. BUT ALSO:• Schema’s can be Polymorphic{ "name" : "Bob", "location" : "us", "games" : [{"game" : "WoW", "duration" : 2910}, {"game" : "Tetris", "duration" : 593}]}
    12. 12. SO NOW WHAT ?• Only report on common attributes • probably missing the most recent / interesting data
    13. 13. SO NOW WHAT ?• Write 2 programs, one for each schema • 2 graphs / reports • 2 programs writing to 1 graph (basically merging instance data in 2 places)
    14. 14. SO NOW WHAT ?• Unify Schema • deal with absent, null values • translate(NULL, “EU”);
    15. 15. ITERATE• total time and how many games people play in the us vs eu ?
    16. 16. QUERYdb.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, games: 1 }}, { $unwind : "$games" }, { $group : { _id : { location : 1}, number_games: { $sum : 1 }, total_duration: {$sum : "$games.duration"} }}, { $project : { _id : 0, location : "$_id.location", number_games : 1, total_duration : 1 }}]})
    17. 17. SIDEBAR: WRITING AGGREGATION QUERIES• Prepare Data • Extract relevant properties from collection documents • Unwind sub collection if its document is contributing to aggregation• Aggregate data • determine the key (_id) on which the aggregates should be done • name aggregates• Project Data • For final results
    18. 18. EXAMPLE{ "name" : "Alice", "location" : "us", "games" : [{ "game" : "WoW", "duration" : 200 }, { "game" : "Tetris", "duration" : 100 }]}
    19. 19. PREPARE• Only use location and games:{ $project : { location : 1, games: 1 }}• Unwind games as properties of its documents are aggregated over:{ $unwind : "$games" }
    20. 20. AGGREGATE DATA• Aggregate on number of games (add 1 per game) and total duration (add duration per game) using location as key{ $group : { _id : { location : 1}, number_games: { $sum : 1 }, total_duration: {$sum : "$games.duration"} }}
    21. 21. PROJECT• Only show location and aggregates, do not show _id{ $project : { _id : 0, location : "$_id.location", number_games : 1, total_duration : 1 }}
    22. 22. RESULT 1• People spend a little more time playing in the US• More games played in the EU
    23. 23. RING....
    24. 24. CHALLENGE 2• Since we found EU and US play similar amount and same number of games, new challenge is:• Lets see what the distribution of different games is the 2 locations
    25. 25. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    26. 26. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    27. 27. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    28. 28. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, key: aggregate on location and game number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    29. 29. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, key: aggregate on location and game number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    30. 30. QUERY 2db.runCommand({ aggregate : "gamers", pipeline : [ { $project : { location : 1, location, games games : 1 }}, { $unwind : "$games" }, { $project : { location : 1, game : "$games.game", location, game, duration duration : "$games.duration" }}, { $group : { _id : { location: "$location", game: "$game"}, key: aggregate on location and game number_games: { $sum : 1 }, total_duration: {$sum : "$duration"} }}, { $project : { _id : 0, location : "$_id.location", project: location, game, total(#games), sum(duration) game : "$_id.game", number_games : 1, total_duration : 1 }}]})
    31. 31. RESULT 2Count: EU - WoW, US TetrisEU spends more time on WoW, US it’s moreevenly spread
    32. 32. RING....
    33. 33. CHALLENGE 3:• How do I compare Bob to everyone else in the EU ?
    34. 34. QUERY•2 aggregations happening at same time: •1 by user •1 by location• This query needs to be broken up in several queries• Fairly complex• Currently easiest to process in Ruby/Java/Python/...
    35. 35. db.runCommand( db.runCommand({ aggregate : "gamers", pipeline : [ { aggregate : "gamers", pipeline : [ { $project : { { $project : { name : 1, location : 1, location : 1, games : 1 games : 1 }}, }}, { $unwind : "$games" }, { $unwind : "$games" }, { $project : { { $project : { location : 1, name: 1, duration : "$games.duration" location : 1, }}, game : "$games.game", { $group : { duration : "$games.duration" _id : { location: 1}, }}, total_duration: {$sum : { $group : { "$duration"} _id : { location: "$location", name: "$name", game: }},"$game"}, { $project : { total_duration: {$sum : "$duration"} name : "$_id.location", }}, _id : 0, { $project : { total_duration : 1 name : "$_id.name", }} _id : 0, ]}) location : "$_id.location", game : "$_id.game", total_duration : 1 }}]})
    36. 36. RESULT 3• Bob plays >20% WoW in comparison to the Europeans, but plays 200% more Tetris
    37. 37. A NOTE ON QUERIES• There’s no notion of a declared schema• The augmented scheme is coded in queries• Reuse is very hard, happens at a query language
    38. 38. DIMENSIONS• Most questions / graphs have a dimension • Time, Geo • Categories • Relative: what’s X’s contribution of revenue to total• Youwill need to be able to pass in dimensions as a predicate for your queries • or cache result and post process client-side
    39. 39. A WORD ON RENDERING GRAPHS / REPORTS• Several libraries available for ruby / python / java • Gruff, Scruffy, StockCharts, D3, JRafael, JQuery Vizualize, MooCharts, etc, etc.• Also some services: John Nunemakers work (http:// get.gaug.es/)• But Basically: • you know how to program, right !
    40. 40. REVIEW• Understand your schema • multiple schema’s in single collection • multiple collections / multiple data sources• Iterate: • define metric • develop query and report on metrics • understand and drill down or discard • repeat• Operationalize metrics: dashboard • Dimensions • Plotting
    41. 41. PUNCHLINES• We have described a software engineering process • but requirements will be very fluid• When you know how to write ruby / java / python etc. - life is good• If you’re a business analyst you have a problem • better be BFF with some engineer :)
    42. 42. PLUG• We’ve been working on a declarative analytics product• (initially) uses Excel as its presentation layer• Reach out to me if you’re interested @rogerb roger@norellan.com
    43. 43. THANK YOU / QUESTIONS

    ×