Principal Solutions Architect, MongoDB, Inc.
Asya Kamsky
Data Processing and
Aggregation Options
#BigDataCamp @MongoDB @as...
Applications and data
Store
Process
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
• An ideal operational database
• High performance for storage and
retrieval at large scale
• Robust q...
MongoDB data processing
options
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
Pre-aggregate in MongoDB for real-time queries
Process in MongoDB using Aggregation
Framework
Process ...
Aggregation Framework
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
Data Processing andAggregatio...
Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
• Plays nice with sharding
Da...
Pipeline
ps ax | grep mongod | head 1
Piping command line operations
Data Processing andAggregation Options in MongoDB / A...
Pipeline
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
Data Processing andAggre...
Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort/$skip/$limit
• $redact
• $geoNear
• $out
Data Processing...
$match
• Filter documents
• Uses existing query syntax
• 2.4 added support for geospatial operations
• 2.6 added support f...
{ $match : { state : "NY" } }
{
city: “SAN FRANCISCO",
loc: [-122.4614, 37.781],
state: ”CA"
}
{
city: "NEW YORK",
loc: [ ...
{ $match : { loc : { $geoWithin:
{$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }
{
city: “SAN FRANCISCO",
loc: [-122.461...
$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields
Dat...
{
loc: [-122.3892, 37.7864],
state: ”CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ”CA"
}...
{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ...
{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ...
{
dt : {
y : 2012,
m : 9,
d : 1
},
totalprice: 123350.97,
status: "F"
}
{
_id : 6694,
cname : "Cust#000060209",
status" : ...
$group
• Group documents by an ID
– Field reference, object, constant
• Other output fields are computed
– $max, $min, $av...
Find the smallest cities
within twenty miles of San
Francisco{ _id: "94306",
city: “PALO ALTO",
loc: [ -122.127, 37.418],
...
{
_id: "WOODACRE",
pop: 1524
}
{
_id: "STINSON BEACH",
pop: 630
}
{ _id: "94306",
city: “PALO ALTO",
loc: [ -122.127, 37.4...
$unwind
• Operate on an array field
• Yield new documents for each array element
– Array replaced by element value
– Missi...
$unwind
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
}
{ $unwind: "$subjects" }
{
title: "T...
2.6 Improvements
• Returns a cursor (not a document)
– just like a regular find
• New stages
– $redact
– $out
• New operat...
Advantages
• Runs on the server
– Uses indexes
– Uses shards
• Simple to build complex pipelines
• Easy to use from any dr...
Limitations
• Pipeline operator memory limits
– 10% of total system RAM in 2.4 and earlier
– 100MB in 2.6 but can use disk...
MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
• Intended for complex data
analysis
Data Processing andAggregation Options in MongoDB / A...
MapReduce
• Versatile, powerful
• Intended for complex data
analysis
• Overkill for simple aggregations
Data Processing an...
MapReduce
Worker thread
calls mapper
Data Set
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
Workers call Reduce()
Data Set
Output
Worker thread
calls mapper
Data Processing andAggregation Options in Mongo...
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long ...
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {f...
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {f...
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {f...
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function finalize( key, value ) {
i...
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
"results" : [
{
"_id" : "English",
...
Advantages
• Map and reduce code can be arbitrarily complex
– JavaScript, helper functions
• Results can be saved into a n...
Limitations
• Implemented with JavaScript
– Single-threaded
• Slower than Aggregation Framework
– Batch, not real time
• H...
Analyzing MongoDB Data in
External Systems
Hadoop
Framework that allows for the distributed processing
of large data sets across clusters of computers
Data Processin...
Hadoop MongoDB Connector
• MongoDB or BSON files as input/output
• Source data can be filtered with queries
• Hadoop Strea...
Processing Big Data
• Data broken up into smaller pieces
• Process data across multiple nodes
Hadoop Hadoop Hadoop Hadoop
...
Input splits on Non-sharded
Systems
Single Map
Reduce
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoo...
Advantages
• Processing decoupled
from data store
• Parallel processing
• Leverage existing
infrastructure
• Java has rich...
Storm
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm MongoDB connector
• Spout for MongoDB oplog or capped collections
– Filtering capabilities
– Threaded and non-blocki...
Aggregating MongoDB’s
Data Processing Options
Internal Tools
• Storing pre-aggregated data
– An exercise in schema design
• Aggregation Framework
• MapReduce
Data Proce...
External Tools
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Questions?
Principal Solutions Architect, MongoDB Inc.
Asya Kamsky
Thank You
#BigDataCamp @MongoDB @asya999
Upcoming SlideShare
Loading in...5
×

2014 bigdatacamp asya_kamsky

468

Published on

Big Data Camp LA 2014, Data processing and Aggregation Options by Asya Kamsky of MongoDB

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
468
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • "h" : {
    "$hour" : "$time"
    },
    "m" : {
    "$minute" : "$time"
    },
    "s" : {
    "$second" : "$time"
    },
  • { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }}}
    {
    city: “PALO ALTO",
    loc: [ -122.127, 37.418],
    state: ”CA"
    }
  • { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }}}
    {
    city: “PALO ALTO",
    loc: [ -122.127, 37.418],
    state: ”CA"
    }
  • 2.4 will improve somewhat
  • 2.4 will improve somewhat
  • Distributed, real-time computation system.
  • Distributed, real-time computation system.
  • 2014 bigdatacamp asya_kamsky

    1. 1. Principal Solutions Architect, MongoDB, Inc. Asya Kamsky Data Processing and Aggregation Options #BigDataCamp @MongoDB @asya999
    2. 2. Applications and data Store Process Data Processing andAggregation Options in MongoDB / Asya Kamsky
    3. 3. Big Data Data Processing andAggregation Options in MongoDB / Asya Kamsky
    4. 4. Big Data in MongoDB Data Processing andAggregation Options in MongoDB / Asya Kamsky
    5. 5. Big Data in MongoDB • An ideal operational database • High performance for storage and retrieval at large scale • Robust query interface for intelligent operations Data Processing andAggregation Options in MongoDB / Asya Kamsky
    6. 6. MongoDB data processing options Data Processing andAggregation Options in MongoDB / Asya Kamsky
    7. 7. Big Data in MongoDB Pre-aggregate in MongoDB for real-time queries Process in MongoDB using Aggregation Framework Process in MongoDB using Map/Reduce Process outside MongoDB using Hadoop and other external tools Data Processing andAggregation Options in MongoDB / Asya Kamsky
    8. 8. Aggregation Framework Data Processing andAggregation Options in MongoDB / Asya Kamsky
    9. 9. Aggregation Framework • Declared in JSON, executes in C++ Data Processing andAggregation Options in MongoDB / Asya Kamsky
    10. 10. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple Data Processing andAggregation Options in MongoDB / Asya Kamsky
    11. 11. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple • Plays nice with sharding Data Processing andAggregation Options in MongoDB / Asya Kamsky
    12. 12. Pipeline ps ax | grep mongod | head 1 Piping command line operations Data Processing andAggregation Options in MongoDB / Asya Kamsky
    13. 13. Pipeline $match $group | $sort| Piping aggregation operations Stream of documents Result document Data Processing andAggregation Options in MongoDB / Asya Kamsky
    14. 14. Pipeline Operators • $match • $project • $group • $unwind • $sort/$skip/$limit • $redact • $geoNear • $out Data Processing andAggregation Options in MongoDB / Asya Kamsky
    15. 15. $match • Filter documents • Uses existing query syntax • 2.4 added support for geospatial operations • 2.6 added support for full text search indexes Data Processing andAggregation Options in MongoDB / Asya Kamsky
    16. 16. { $match : { state : "NY" } } { city: “SAN FRANCISCO", loc: [-122.4614, 37.781], state: ”CA" } { city: "NEW YORK", loc: [ -73.989, 40.731], state: "NY" } { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
    17. 17. { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] } { city: “SAN FRANCISCO", loc: [-122.4614, 37.781], state: ”CA" } { city: "NEW YORK", loc: [ -73.989, 40.731], state: "NY" } { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
    18. 18. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields Data Processing andAggregation Options in MongoDB / Asya Kamsky
    19. 19. { loc: [-122.3892, 37.7864], state: ”CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } Selecting and Excluding Fields $project: { _id: 0, loc: 1, state: 1 }
    20. 20. { zip: "94105", cityState: ”SAN FRANCISCO, CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } $project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0} Renaming and Computing Fields
    21. 21. { zip: "94105", cityState: ”SAN FRANCISCO, CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } $project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0} Renaming and Computing Fields New Field Operation
    22. 22. { dt : { y : 2012, m : 9, d : 1 }, totalprice: 123350.97, status: "F" } { _id : 6694, cname : "Cust#000060209", status" : "F", totalprice : 123350.97, orderdate : ISODate("2012-09- 01T13:11:31Z"), lineitems: [ { ... }, { ... }, { ... } ] } Renaming and Computing Fields $project : { dt: { y : { "$year" : "$orderdate" }, m : { "$month" : "$orderdate" }, d : { "$dayOfMonth" : "$orderdate" } }, totalprice : 1, status : 1, _id : 0 }
    23. 23. $group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memory – can utilize external disk-based sort in 2.6 Data Processing andAggregation Options in MongoDB / Asya Kamsky
    24. 24. Find the smallest cities within twenty miles of San Francisco{ _id: "94306", city: “PALO ALTO", loc: [ -122.127, 37.418], pop: 24309 } { _id: "10280", city: "NEW YORK", loc: [ -74.016, 40.710], pop: 5574 } { _id: "94124", city: “SAN FRANCISCO", loc: [-122.388, 37.73], pop: 27239 }
    25. 25. { _id: "WOODACRE", pop: 1524 } { _id: "STINSON BEACH", pop: 630 } { _id: "94306", city: “PALO ALTO", loc: [ -122.127, 37.418], pop: 24309 } { _id: "10280", city: "NEW YORK", loc: [ -74.016, 40.710], pop: 5574 } { _id: "94124", city: “SAN FRANCISCO", loc: [-122.388, 37.73], pop: 27239 } { _id: "BOLINAS", pop: 1555 } { $match : { loc : { $geoWithin: { $centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] } } } { $group : { _id : "$city", pop : {$sum: "$pop"} } } { $sort : { "pop" : 1 } }, { $limit : 3 } Find the smallest cities within twenty miles of San Francisco
    26. 26. $unwind • Operate on an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array values Data Processing andAggregation Options in MongoDB / Asya Kamsky
    27. 27. $unwind { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" } { $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] }
    28. 28. 2.6 Improvements • Returns a cursor (not a document) – just like a regular find • New stages – $redact – $out • New operators: – set expression operators. – $let and $map operators to allow for the use of variables. – $literal operator and $size operator – $cond expression object • Integrated $text search • Performance improvements, "explain" and more Data Processing andAggregation Options in MongoDB / Asya Kamsky
    29. 29. Advantages • Runs on the server – Uses indexes – Uses shards • Simple to build complex pipelines • Easy to use from any driver • Fast -er than other options Data Processing andAggregation Options in MongoDB / Asya Kamsky
    30. 30. Limitations • Pipeline operator memory limits – 10% of total system RAM in 2.4 and earlier – 100MB in 2.6 but can use disk for external sort • Some data types not allowed – Code, CodeWithScope, etc. • Result size limited• Result size limited (in 2.4 and earlier) – 2.6 returns a cursor or direct output to a new collection No result size limit! Data Processing andAggregation Options in MongoDB / Asya Kamsky
    31. 31. MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
    32. 32. MapReduce • Versatile, powerful Data Processing andAggregation Options in MongoDB / Asya Kamsky
    33. 33. MapReduce • Versatile, powerful • Intended for complex data analysis Data Processing andAggregation Options in MongoDB / Asya Kamsky
    34. 34. MapReduce • Versatile, powerful • Intended for complex data analysis • Overkill for simple aggregations Data Processing andAggregation Options in MongoDB / Asya Kamsky
    35. 35. MapReduce Worker thread calls mapper Data Set Data Processing andAggregation Options in MongoDB / Asya Kamsky
    36. 36. MapReduce Workers call Reduce() Data Set Output Worker thread calls mapper Data Processing andAggregation Options in MongoDB / Asya Kamsky
    37. 37. { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" } Our Example Data
    38. 38. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function map() { var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 } ) }
    39. 39. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
    40. 40. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function finalize( key, value ) { if ( value.numBooks != 0 ) return value.totalPages / value.numBooks; }
    41. 41. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function finalize( key, value ) { if ( value.numBooks != 0 ) return value.totalPages / value.numBooks; } db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
    42. 42. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) "results" : [ { "_id" : "English", "value" : 653 }, { "_id" : "Russian", "value" : 1440 } ]
    43. 43. Advantages • Map and reduce code can be arbitrarily complex – JavaScript, helper functions • Results can be saved into a new collection – replace, merge or re-reduce • Incremental MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
    44. 44. Limitations • Implemented with JavaScript – Single-threaded • Slower than Aggregation Framework – Batch, not real time • Harder to understand, implement, debug... Data Processing andAggregation Options in MongoDB / Asya Kamsky
    45. 45. Analyzing MongoDB Data in External Systems
    46. 46. Hadoop Framework that allows for the distributed processing of large data sets across clusters of computers Data Processing andAggregation Options in MongoDB / Asya Kamsky
    47. 47. Hadoop MongoDB Connector • MongoDB or BSON files as input/output • Source data can be filtered with queries • Hadoop Streaming support – For jobs written in Python, Ruby, Node.js • Supports Hadoop tools such as Pig and Hive Data Processing andAggregation Options in MongoDB / Asya Kamsky
    48. 48. Processing Big Data • Data broken up into smaller pieces • Process data across multiple nodes Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Data Processing andAggregation Options in MongoDB / Asya Kamsky
    49. 49. Input splits on Non-sharded Systems Single Map Reduce Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Total Dataset Data Processing andAggregation Options in MongoDB / Asya Kamsky
    50. 50. Advantages • Processing decoupled from data store • Parallel processing • Leverage existing infrastructure • Java has rich set of data processing libraries – And other languages if using Hadoop Streaming • Batch processing • Requires synchronization between data store and processor • Adds complexity to infrastructure Disadvantages Data Processing andAggregation Options in MongoDB / Asya Kamsky
    51. 51. Storm Data Processing andAggregation Options in MongoDB / Asya Kamsky
    52. 52. Storm Data Processing andAggregation Options in MongoDB / Asya Kamsky
    53. 53. Storm MongoDB connector • Spout for MongoDB oplog or capped collections – Filtering capabilities – Threaded and non-blocking • Output to new or existing documents – Insert/update bolt Data Processing andAggregation Options in MongoDB / Asya Kamsky
    54. 54. Aggregating MongoDB’s Data Processing Options
    55. 55. Internal Tools • Storing pre-aggregated data – An exercise in schema design • Aggregation Framework • MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
    56. 56. External Tools Data Processing andAggregation Options in MongoDB / Asya Kamsky
    57. 57. Questions?
    58. 58. Principal Solutions Architect, MongoDB Inc. Asya Kamsky Thank You #BigDataCamp @MongoDB @asya999

    ×