Your SlideShare is downloading. ×

Aggregation Framework

2,354

Published on

1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,354
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
40
Comments
1
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. MapReduce, TheAggregation Framework & The Hadoop ConnectorBryan ReineroEngineer, 10gen
  • 2. Data Warehousing
  • 3. • Were storing our data in MongoDBData Warehousing
  • 4. • Were storing our data in MongoDB • We need reporting, grouping, common aggregations, etc.Data Warehousing
  • 5. • Were storing our data in MongoDB • We need reporting, grouping, common aggregations, etc. • What are we using for this?Data Warehousing
  • 6. • SQL for reporting and analytics • Infrastructure complications • Additional maintenance • Data duplication • ETL processes • Real time?Data Warehousing
  • 7. MapReduce
  • 8. MapReduceWorker threadcalls mapper Data Set
  • 9. MapReduceWorker thread Workers call Reduce()calls mapper Output Data Set
  • 10. Our Example Data{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
  • 11. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function map() { var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 }) }
  • 12. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { map() { var result = { numBooks : 0, totalPages : 0}; var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 } ) values.forEach(function (value) { } result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  • 13. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { finalize( key, value ) { var result = { numBooks : 0, totalPages : 0}; if ( value.numBooks != 0 ) return value.totalPages value.numBooks; values.forEach(function /(value) { } result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  • 14. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) "results" : [ function finalize( key, value ) { { "_id" : "English", if ( value.numBooks != 0 ) "value" : 653 return value.totalPages / value.numBooks; }, } { "_id" : "Russian", "value" : 1440 } ]
  • 15. MapReduceThree Node Replica Set Primary Secondary Secondary MapReduce Jobs
  • 16. MapReduceThree Node Replica Set tags :{ tags :{ tags :{ workload workload workload :”analysis} :”prod”} :”prod”} priority : 0, Primary Secondary Secondary MapReduce Jobs CRUD operations MapReduce Jobs
  • 17. MapReduce in MongoDB• Implemented with JavaScript – Single-threaded – Difficult to debug• Concurrency – Appearance of parallelism – Write locks
  • 18. • Versatile, powerfulMapReduce
  • 19. • Versatile, powerful • Intended for complex data analysisMapReduce
  • 20. • Versatile, powerful • Intended for complex data analysis • Overkill for simple aggregationsMapReduce
  • 21. Aggregation Framework
  • 22. • Declared in JSON, executes in C++Aggregation Framework
  • 23. • Declared in JSON, executes in C++ • Flexible, functional, and simpleAggregation Framework
  • 24. • Declared in JSON, executes in C++ • Flexible, functional, and simple • Plays nice with shardingAggregation Framework
  • 25. Pipeline
  • 26. Pipeline Piping command line operations ps ax | grep mongod | head 1
  • 27. Pipeline Piping aggregation operations $match | $group | $sortStream of documents Result document
  • 28. Pipeline Operators• $match• $project• $group• $unwind• $sort• $limit• $skip
  • 29. $match• Filter documents• Uses existing query syntax• No geospatial operations or $where
  • 30. { $match : { language :"Russian" } }{ title: "The Great Gatsby", pages: 218, language: "English"}{ title: “War and Peace", pages: 1440, language: ”Russian"}{ title: “Atlas Shrugged", pages: 1088, language: ”English"}
  • 31. { $match : { pages : { $gt : 1000 language :}}"Russian" } } { title: "The Great Gatsby", pages: 218, language: "English" } { title: “War and Peace", pages: 1440, language: ”Russian" } { title: “Atlas Shrugged", pages: 1088, language: ”English" }
  • 32. $project• Reshape documents• Include, exclude or rename fields• Inject computed fields• Create sub-document fields
  • 33. Selecting and Excluding Fields $project: { _id: 0, title: 1, language: 1 }{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, title: "Great Gatsby", subjects: [ language: "English" "Long Island", } "New York", "1920s" ], language: "English"}
  • 34. Renaming and ComputingFields{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
  • 35. Renaming and ComputingFields{ $project: { New Field avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
  • 36. Renaming and ComputingFields{ $project: { New Field avgChapterLength: { Operation $divide: ["$pages", "$chapters"] Dividend }, lang: "$language" Divisor}}
  • 37. Renaming and Computing Fields{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, _id: 375, subjects: [ avgChapterLength: 24.2222, "Long Island", lang: "English" "New York", } "1920s" ], language: "English"}
  • 38. Creating Sub-Document Fields$project: { title: 1, stats: { pages: "$pages”, language: "$language”, }}{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
  • 39. Creating Sub-Document Fields$project: { title: 1, stats: { pages: "$pages”, language: "$language”, }}{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, _id: 375, subjects: [ title: "Great Gatsby", "Long Island", stats: { "New York", pages: 218, "1920s" language: "English" ], } language: "English" }}
  • 40. $group• Group documents by an ID – Field reference, object, constant• Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last• Processes all data in memory
  • 41. $group: {_id: "$language”, Calculating an$avg:"$pages" avgPages: { Average}}{ title: "The Great Gatsby", pages: 218, language: "English" {} _id: "Russian", avgPages: 1440{ } title: "War and Peace", pages: 1440, { language: "Russian" _id: "English",} avgPages: 653{ } title: "Atlas Shrugged", pages: 1088, language: "English"}
  • 42. $group: { _id: "$language", Collecting Distinct Values titles: { $addToSet: "$title" }}{ title: "The Great Gatsby", pages: 218, { language: "English" _id: "Russian",} titles: [ "War and Peace" ]{ } title: "War and Peace", pages: 1440, { language: "Russian" _id: "English",} titles: [{ "Atlas Shrugged", title: "Atlas Shrugged", "The Great Gatsby" pages: 1088, ] language: "English" }}
  • 43. $unwind• Operate on an array field• Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error• Pipe to $group to aggregate array values
  • 44. $unwind { $unwind: "$subjects" }{ title: "The Great Gatsby",{ ISBN: "9781857150193", title: "The Great Gatsby", subjects: [ ISBN: "9781857150193", "Long Island", subjects: "Long York" "New York", Island" "1920s" "New} "1920s" ]}
  • 45. $sort, $limit, $skip• Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed• Limit and skip follow cursor behavior
  • 46. $sort, $limit, $skip { $sort: { title: 1 }} { title: ”The Great Gatsby" } { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Brave New World" } { title: ”Grapes of Wrath” } { title: ”Fathers and Sons” } { title: ”Animal Farm" } { title: ” Grapes of Wrath" } { title: ”Lord of the Flies" } { title: ”Invisible Man" } { title: ”Father and Sons" } { title: “Lord of the Flies" } { title : “Invisible Man” } { title : “The Great Gatsby” }
  • 47. Sort All the Documents in thePipeline { $skip: { title: 1 }} $sort: 2 } { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Fathers and Sons” } { title: ” Grapes of Wrath" } { title: ”Invisible Man" } { title: “Lord of the Flies" } { title : “The Great Gatsby” }
  • 48. Sort All the Documents in thePipeline { $skip : 4 } $limit: 2} { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Fathers and Sons” } { title: ” Grapes of Wrath" } { title: ”Invisible Man" } { title: “Lord of the Flies" } { title : “The Great Gatsby” }
  • 49. Usage and Limitations
  • 50. Usage• collection.aggregate() method – Mongo shell – Most drivers• aggregate database command
  • 51. CollectionDatabase Command db.runCommand({ aggregate: "books", db.books.aggregate([ pipeline: [ { $project: { language: 1 }}, { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }
  • 52. Limitations• Result limited by BSON document size – Final command result – Intermediate shard results• Pipeline operator memory limits• Some BSON types unsupported – Binary, Code, deprecated types
  • 53. Sharding
  • 54. $match $match $project $project Shard A $group Shard B $group Shard C$match: { /* filter by shard key */ } mongos client Sharding
  • 55. Shard A Shard B $match $match $project $project $group $group Shard C mongos $group clientSharding
  • 56. Nice, but…• Limited parallelism• No access to analytics libraries• Separation of concerns• Need to integrate with existing tool chains
  • 57. Hadoop Connector
  • 58. Scaling MongoDB Sharded cluster MongoDB Single Instance Or Replica Set Client Application
  • 59. The Mechanism of Sharding Complete Data Set Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  • 60. The Mechanism of Sharding Chunk Chunk Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  • 61. The Mechanism of Sharding Chunk Chunk Chunk Chunk Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  • 62. Chunk Chunk Chunk Chunk Define shard key on titleAnimal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies Shard 1 Shard 2 Shard 3 Shard 4
  • 63. Data GrowthShard 1 Shard 2 Shard 3 Shard 4
  • 64. Load BalancingShard 1 Shard 2 Shard 3 Shard 4
  • 65. Processing Big Data Map Map Map MapReduce Reduce Reduce Reduce
  • 66. Processing Big Data• Need to break data into smaller pieces• Process data across multiple nodesHadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop
  • 67. Input splits on UnsharedSystems Total Dataset Single Instance Or Replica Set Hadoop Hadoop Hadoop Single Hadoop Map Hadoop Reduce Hadoop Hadoop Hadoop Hadoop Hadoop
  • 68. Hadoop Connector in Javafinal Configuration conf = new Configuration();MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );final Job job = new Job( conf, "word count" );job.setMapperClass( TokenizerMapper.class );job.setReducerClass( IntSumReducer.class );…
  • 69. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop
  • 70. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package
  • 71. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/
  • 72. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
  • 73. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
  • 74. MongoDB-Hadoop Quickstarthttps://github.com/mongodb/mongo-hadoop$ ./SBT package$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/ROCK AND ROLL!$ bin/hadoop com.xgen.WordCount
  • 75. Thank YouBryan ReineroEngineer, 10gen

×