Aggregation Framework

2,930 views

Published on

1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
2,930
On SlideShare
0
From Embeds
0
Number of Embeds
983
Actions
Shares
0
Downloads
50
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Aggregation Framework

  1. 1. MapReduce, TheAggregation Framework & The Hadoop ConnectorBryan ReineroEngineer, 10gen
  2. 2. Data Warehousing
  3. 3. • Were storing our data in MongoDBData Warehousing
  4. 4. • Were storing our data in MongoDB • We need reporting, grouping, common aggregations, etc.Data Warehousing
  5. 5. • Were storing our data in MongoDB • We need reporting, grouping, common aggregations, etc. • What are we using for this?Data Warehousing
  6. 6. • SQL for reporting and analytics • Infrastructure complications • Additional maintenance • Data duplication • ETL processes • Real time?Data Warehousing
  7. 7. MapReduce
  8. 8. MapReduceWorker threadcalls mapper Data Set
  9. 9. MapReduceWorker thread Workers call Reduce()calls mapper Output Data Set
  10. 10. Our Example Data{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
  11. 11. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function map() { var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 }) }
  12. 12. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { map() { var result = { numBooks : 0, totalPages : 0}; var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 } ) values.forEach(function (value) { } result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  13. 13. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { finalize( key, value ) { var result = { numBooks : 0, totalPages : 0}; if ( value.numBooks != 0 ) return value.totalPages value.numBooks; values.forEach(function /(value) { } result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  14. 14. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) "results" : [ function finalize( key, value ) { { "_id" : "English", if ( value.numBooks != 0 ) "value" : 653 return value.totalPages / value.numBooks; }, } { "_id" : "Russian", "value" : 1440 } ]
  15. 15. MapReduceThree Node Replica Set Primary Secondary Secondary MapReduce Jobs
  16. 16. MapReduceThree Node Replica Set tags :{ tags :{ tags :{ workload workload workload :”analysis} :”prod”} :”prod”} priority : 0, Primary Secondary Secondary MapReduce Jobs CRUD operations MapReduce Jobs
  17. 17. MapReduce in MongoDB• Implemented with JavaScript – Single-threaded – Difficult to debug• Concurrency – Appearance of parallelism – Write locks
  18. 18. • Versatile, powerfulMapReduce
  19. 19. • Versatile, powerful • Intended for complex data analysisMapReduce
  20. 20. • Versatile, powerful • Intended for complex data analysis • Overkill for simple aggregationsMapReduce
  21. 21. Aggregation Framework
  22. 22. • Declared in JSON, executes in C++Aggregation Framework
  23. 23. • Declared in JSON, executes in C++ • Flexible, functional, and simpleAggregation Framework
  24. 24. • Declared in JSON, executes in C++ • Flexible, functional, and simple • Plays nice with shardingAggregation Framework
  25. 25. Pipeline
  26. 26. Pipeline Piping command line operations ps ax | grep mongod | head 1
  27. 27. Pipeline Piping aggregation operations $match | $group | $sortStream of documents Result document
  28. 28. Pipeline Operators• $match• $project• $group• $unwind• $sort• $limit• $skip
  29. 29. $match• Filter documents• Uses existing query syntax• No geospatial operations or $where
  30. 30. { $match : { language :"Russian" } }{ title: "The Great Gatsby", pages: 218, language: "English"}{ title: “War and Peace", pages: 1440, language: ”Russian"}{ title: “Atlas Shrugged", pages: 1088, language: ”English"}
  31. 31. { $match : { pages : { $gt : 1000 language :}}"Russian" } } { title: "The Great Gatsby", pages: 218, language: "English" } { title: “War and Peace", pages: 1440, language: ”Russian" } { title: “Atlas Shrugged", pages: 1088, language: ”English" }
  32. 32. $project• Reshape documents• Include, exclude or rename fields• Inject computed fields• Create sub-document fields
  33. 33. Selecting and Excluding Fields $project: { _id: 0, title: 1, language: 1 }{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, title: "Great Gatsby", subjects: [ language: "English" "Long Island", } "New York", "1920s" ], language: "English"}
  34. 34. Renaming and ComputingFields{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
  35. 35. Renaming and ComputingFields{ $project: { New Field avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
  36. 36. Renaming and ComputingFields{ $project: { New Field avgChapterLength: { Operation $divide: ["$pages", "$chapters"] Dividend }, lang: "$language" Divisor}}
  37. 37. Renaming and Computing Fields{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, _id: 375, subjects: [ avgChapterLength: 24.2222, "Long Island", lang: "English" "New York", } "1920s" ], language: "English"}
  38. 38. Creating Sub-Document Fields$project: { title: 1, stats: { pages: "$pages”, language: "$language”, }}{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
  39. 39. Creating Sub-Document Fields$project: { title: 1, stats: { pages: "$pages”, language: "$language”, }}{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, _id: 375, subjects: [ title: "Great Gatsby", "Long Island", stats: { "New York", pages: 218, "1920s" language: "English" ], } language: "English" }}
  40. 40. $group• Group documents by an ID – Field reference, object, constant• Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last• Processes all data in memory
  41. 41. $group: {_id: "$language”, Calculating an$avg:"$pages" avgPages: { Average}}{ title: "The Great Gatsby", pages: 218, language: "English" {} _id: "Russian", avgPages: 1440{ } title: "War and Peace", pages: 1440, { language: "Russian" _id: "English",} avgPages: 653{ } title: "Atlas Shrugged", pages: 1088, language: "English"}
  42. 42. $group: { _id: "$language", Collecting Distinct Values titles: { $addToSet: "$title" }}{ title: "The Great Gatsby", pages: 218, { language: "English" _id: "Russian",} titles: [ "War and Peace" ]{ } title: "War and Peace", pages: 1440, { language: "Russian" _id: "English",} titles: [{ "Atlas Shrugged", title: "Atlas Shrugged", "The Great Gatsby" pages: 1088, ] language: "English" }}
  43. 43. $unwind• Operate on an array field• Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error• Pipe to $group to aggregate array values
  44. 44. $unwind { $unwind: "$subjects" }{ title: "The Great Gatsby",{ ISBN: "9781857150193", title: "The Great Gatsby", subjects: [ ISBN: "9781857150193", "Long Island", subjects: "Long York" "New York", Island" "1920s" "New} "1920s" ]}
  45. 45. $sort, $limit, $skip• Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed• Limit and skip follow cursor behavior
  46. 46. $sort, $limit, $skip { $sort: { title: 1 }} { title: ”The Great Gatsby" } { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Brave New World" } { title: ”Grapes of Wrath” } { title: ”Fathers and Sons” } { title: ”Animal Farm" } { title: ” Grapes of Wrath" } { title: ”Lord of the Flies" } { title: ”Invisible Man" } { title: ”Father and Sons" } { title: “Lord of the Flies" } { title : “Invisible Man” } { title : “The Great Gatsby” }
  47. 47. Sort All the Documents in thePipeline { $skip: { title: 1 }} $sort: 2 } { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Fathers and Sons” } { title: ” Grapes of Wrath" } { title: ”Invisible Man" } { title: “Lord of the Flies" } { title : “The Great Gatsby” }
  48. 48. Sort All the Documents in thePipeline { $skip : 4 } $limit: 2} { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Fathers and Sons” } { title: ” Grapes of Wrath" } { title: ”Invisible Man" } { title: “Lord of the Flies" } { title : “The Great Gatsby” }
  49. 49. Usage and Limitations
  50. 50. Usage• collection.aggregate() method – Mongo shell – Most drivers• aggregate database command
  51. 51. CollectionDatabase Command db.runCommand({ aggregate: "books", db.books.aggregate([ pipeline: [ { $project: { language: 1 }}, { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }
  52. 52. Limitations• Result limited by BSON document size – Final command result – Intermediate shard results• Pipeline operator memory limits• Some BSON types unsupported – Binary, Code, deprecated types
  53. 53. Sharding
  54. 54. $match $match $project $project Shard A $group Shard B $group Shard C$match: { /* filter by shard key */ } mongos client Sharding
  55. 55. Shard A Shard B $match $match $project $project $group $group Shard C mongos $group clientSharding
  56. 56. Nice, but…• Limited parallelism• No access to analytics libraries• Separation of concerns• Need to integrate with existing tool chains
  57. 57. Hadoop Connector
  58. 58. Scaling MongoDB Sharded cluster MongoDB Single Instance Or Replica Set Client Application
  59. 59. The Mechanism of Sharding Complete Data Set Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  60. 60. The Mechanism of Sharding Chunk Chunk Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  61. 61. The Mechanism of Sharding Chunk Chunk Chunk Chunk Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  62. 62. Chunk Chunk Chunk Chunk Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies Shard 1 Shard 2 Shard 3 Shard 4
  63. 63. Data GrowthShard 1 Shard 2 Shard 3 Shard 4
  64. 64. Load BalancingShard 1 Shard 2 Shard 3 Shard 4
  65. 65. Processing Big Data Map Map Map MapReduce Reduce Reduce Reduce
  66. 66. Processing Big Data• Need to break data into smaller pieces• Process data across multiple nodesHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop
  67. 67. Input splits on UnsharedSystems Total Dataset Single Instance Or Replica Set Hadoop Hadoop Hadoop Single Hadoop Map Hadoop Reduce Hadoop Hadoop Hadoop Hadoop Hadoop
  68. 68. Hadoop Connector in Javafinal Configuration conf = new Configuration();MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );final Job job = new Job( conf, "word count" );job.setMapperClass( TokenizerMapper.class );job.setReducerClass( IntSumReducer.class );…
  69. 69. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop
  70. 70. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package
  71. 71. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/
  72. 72. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
  73. 73. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
  74. 74. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/ROCK AND ROLL!$ bin/hadoop com.xgen.WordCount
  75. 75. Thank YouBryan ReineroEngineer, 10gen

×