Principal Solutions Architect, MongoDB, Inc.
Asya Kamsky
Data Processing and
Aggregation Options
#BigDataCamp @MongoDB @asya999
Applications and data
Store
Process
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
• An ideal operational database
• High performance for storage and
retrieval at large scale
• Robust query interface for intelligent
operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MongoDB data processing
options
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
Pre-aggregate in MongoDB for real-time queries
Process in MongoDB using Aggregation
Framework
Process in MongoDB using Map/Reduce
Process outside MongoDB using Hadoop and
other external tools
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
• Plays nice with sharding
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Pipeline
ps ax | grep mongod | head 1
Piping command line operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Pipeline
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort/$skip/$limit
• $redact
• $geoNear
• $out
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
$match
• Filter documents
• Uses existing query syntax
• 2.4 added support for geospatial operations
• 2.6 added support for full text search indexes
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
{ $match : { state : "NY" } }
{
city: “SAN FRANCISCO",
loc: [-122.4614, 37.781],
state: ”CA"
}
{
city: "NEW YORK",
loc: [ -73.989, 40.731],
state: "NY"
}
{
city: “PALO ALTO",
loc: [ -122.127, 37.418],
state: ”CA"
}
{ $match : { loc : { $geoWithin:
{$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }
{
city: “SAN FRANCISCO",
loc: [-122.4614, 37.781],
state: ”CA"
}
{
city: "NEW YORK",
loc: [ -73.989, 40.731],
state: "NY"
}
{
city: “PALO ALTO",
loc: [ -122.127, 37.418],
state: ”CA"
}
$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
{
loc: [-122.3892, 37.7864],
state: ”CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ”CA"
}
Selecting and Excluding
Fields
$project: { _id: 0, loc: 1, state: 1 }
{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ”CA"
}
$project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0}
Renaming and Computing
Fields
{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ”CA"
}
$project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0}
Renaming and Computing
Fields
New Field Operation
{
dt : {
y : 2012,
m : 9,
d : 1
},
totalprice: 123350.97,
status: "F"
}
{
_id : 6694,
cname : "Cust#000060209",
status" : "F",
totalprice : 123350.97,
orderdate : ISODate("2012-09-
01T13:11:31Z"),
lineitems: [
{ ... },
{ ... },
{ ... }
]
}
Renaming and Computing
Fields
$project : { dt: { y : { "$year" : "$orderdate" },
m : { "$month" : "$orderdate" },
d : { "$dayOfMonth" : "$orderdate" } },
totalprice : 1, status : 1, _id : 0 }
$group
• Group documents by an ID
– Field reference, object, constant
• Other output fields are computed
– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last
• Processes all data in memory
– can utilize external disk-based sort in 2.6
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Find the smallest cities
within twenty miles of San
Francisco{ _id: "94306",
city: “PALO ALTO",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
city: “SAN FRANCISCO",
loc: [-122.388, 37.73],
pop: 27239 }
{
_id: "WOODACRE",
pop: 1524
}
{
_id: "STINSON BEACH",
pop: 630
}
{ _id: "94306",
city: “PALO ALTO",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
city: “SAN FRANCISCO",
loc: [-122.388, 37.73],
pop: 27239 }
{
_id: "BOLINAS",
pop: 1555
}
{ $match : { loc :
{ $geoWithin:
{ $centerSphere : [
[ -122.4, 37.79 ],
20/3959
]
} } }
{ $group : {
_id : "$city",
pop : {$sum:
"$pop"}
}
}
{ $sort : { "pop" : 1 } },
{ $limit : 3 }
Find the smallest cities
within twenty miles of San
Francisco
$unwind
• Operate on an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error
• Pipe to $group to aggregate array values
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
$unwind
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
}
{ $unwind: "$subjects" }
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]
}
2.6 Improvements
• Returns a cursor (not a document)
– just like a regular find
• New stages
– $redact
– $out
• New operators:
– set expression operators.
– $let and $map operators to allow for the use of variables.
– $literal operator and $size operator
– $cond expression object
• Integrated $text search
• Performance improvements, "explain" and more
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Advantages
• Runs on the server
– Uses indexes
– Uses shards
• Simple to build complex pipelines
• Easy to use from any driver
• Fast -er than other options
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Limitations
• Pipeline operator memory limits
– 10% of total system RAM in 2.4 and earlier
– 100MB in 2.6 but can use disk for external sort
• Some data types not allowed
– Code, CodeWithScope, etc.
• Result size limited• Result size limited (in 2.4 and earlier)
– 2.6 returns a cursor or direct output to a new collection
No result size limit!
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
• Intended for complex data
analysis
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
• Intended for complex data
analysis
• Overkill for simple aggregations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
Worker thread
calls mapper
Data Set
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
Workers call Reduce()
Data Set
Output
Worker thread
calls mapper
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
Our Example Data
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function map() {
var key = this.language;
emit ( key, { totalPages : this.pages, numBooks : 1
} )
}
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function reduce(key, values) {
var result = { numBooks : 0, totalPages : 0};
values.forEach(function (value) {
result.numBooks += value.numBooks;
result.totalPages += value.totalPages;
});
return result;
}
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function finalize( key, value ) {
if ( value.numBooks != 0 )
return value.totalPages / value.numBooks;
}
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function finalize( key, value ) {
if ( value.numBooks != 0 )
return value.totalPages / value.numBooks;
}
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
"results" : [
{
"_id" : "English",
"value" : 653
},
{
"_id" : "Russian",
"value" : 1440
}
]
Advantages
• Map and reduce code can be arbitrarily complex
– JavaScript, helper functions
• Results can be saved into a new collection
– replace, merge or re-reduce
• Incremental MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Limitations
• Implemented with JavaScript
– Single-threaded
• Slower than Aggregation Framework
– Batch, not real time
• Harder to understand, implement, debug...
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Analyzing MongoDB Data in
External Systems
Hadoop
Framework that allows for the distributed processing
of large data sets across clusters of computers
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Hadoop MongoDB Connector
• MongoDB or BSON files as input/output
• Source data can be filtered with queries
• Hadoop Streaming support
– For jobs written in Python, Ruby, Node.js
• Supports Hadoop tools such as Pig and Hive
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Processing Big Data
• Data broken up into smaller pieces
• Process data across multiple nodes
Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Input splits on Non-sharded
Systems
Single Map
Reduce
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Total Dataset
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Advantages
• Processing decoupled
from data store
• Parallel processing
• Leverage existing
infrastructure
• Java has rich set of data
processing libraries
– And other languages if
using Hadoop Streaming
• Batch processing
• Requires synchronization
between data store and
processor
• Adds complexity to
infrastructure
Disadvantages
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm MongoDB connector
• Spout for MongoDB oplog or capped collections
– Filtering capabilities
– Threaded and non-blocking
• Output to new or existing documents
– Insert/update bolt
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregating MongoDB’s
Data Processing Options
Internal Tools
• Storing pre-aggregated data
– An exercise in schema design
• Aggregation Framework
• MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
External Tools
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Questions?
Principal Solutions Architect, MongoDB Inc.
Asya Kamsky
Thank You
#BigDataCamp @MongoDB @asya999

2014 bigdatacamp asya_kamsky

  • 1.
    Principal Solutions Architect,MongoDB, Inc. Asya Kamsky Data Processing and Aggregation Options #BigDataCamp @MongoDB @asya999
  • 2.
    Applications and data Store Process DataProcessing andAggregation Options in MongoDB / Asya Kamsky
  • 3.
    Big Data Data ProcessingandAggregation Options in MongoDB / Asya Kamsky
  • 4.
    Big Data inMongoDB Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 5.
    Big Data inMongoDB • An ideal operational database • High performance for storage and retrieval at large scale • Robust query interface for intelligent operations Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 6.
    MongoDB data processing options DataProcessing andAggregation Options in MongoDB / Asya Kamsky
  • 7.
    Big Data inMongoDB Pre-aggregate in MongoDB for real-time queries Process in MongoDB using Aggregation Framework Process in MongoDB using Map/Reduce Process outside MongoDB using Hadoop and other external tools Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 8.
    Aggregation Framework Data ProcessingandAggregation Options in MongoDB / Asya Kamsky
  • 9.
    Aggregation Framework • Declaredin JSON, executes in C++ Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 10.
    Aggregation Framework • Declaredin JSON, executes in C++ • Flexible, functional, and simple Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 11.
    Aggregation Framework • Declaredin JSON, executes in C++ • Flexible, functional, and simple • Plays nice with sharding Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 12.
    Pipeline ps ax |grep mongod | head 1 Piping command line operations Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 13.
    Pipeline $match $group |$sort| Piping aggregation operations Stream of documents Result document Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 14.
    Pipeline Operators • $match •$project • $group • $unwind • $sort/$skip/$limit • $redact • $geoNear • $out Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 15.
    $match • Filter documents •Uses existing query syntax • 2.4 added support for geospatial operations • 2.6 added support for full text search indexes Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 16.
    { $match :{ state : "NY" } } { city: “SAN FRANCISCO", loc: [-122.4614, 37.781], state: ”CA" } { city: "NEW YORK", loc: [ -73.989, 40.731], state: "NY" } { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  • 17.
    { $match :{ loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] } { city: “SAN FRANCISCO", loc: [-122.4614, 37.781], state: ”CA" } { city: "NEW YORK", loc: [ -73.989, 40.731], state: "NY" } { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  • 18.
    $project • Reshape documents •Include, exclude or rename fields • Inject computed fields • Create sub-document fields Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 19.
    { loc: [-122.3892, 37.7864], state:”CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } Selecting and Excluding Fields $project: { _id: 0, loc: 1, state: 1 }
  • 20.
    { zip: "94105", cityState: ”SANFRANCISCO, CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } $project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0} Renaming and Computing Fields
  • 21.
    { zip: "94105", cityState: ”SANFRANCISCO, CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } $project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0} Renaming and Computing Fields New Field Operation
  • 22.
    { dt : { y: 2012, m : 9, d : 1 }, totalprice: 123350.97, status: "F" } { _id : 6694, cname : "Cust#000060209", status" : "F", totalprice : 123350.97, orderdate : ISODate("2012-09- 01T13:11:31Z"), lineitems: [ { ... }, { ... }, { ... } ] } Renaming and Computing Fields $project : { dt: { y : { "$year" : "$orderdate" }, m : { "$month" : "$orderdate" }, d : { "$dayOfMonth" : "$orderdate" } }, totalprice : 1, status : 1, _id : 0 }
  • 23.
    $group • Group documentsby an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memory – can utilize external disk-based sort in 2.6 Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 24.
    Find the smallestcities within twenty miles of San Francisco{ _id: "94306", city: “PALO ALTO", loc: [ -122.127, 37.418], pop: 24309 } { _id: "10280", city: "NEW YORK", loc: [ -74.016, 40.710], pop: 5574 } { _id: "94124", city: “SAN FRANCISCO", loc: [-122.388, 37.73], pop: 27239 }
  • 25.
    { _id: "WOODACRE", pop: 1524 } { _id:"STINSON BEACH", pop: 630 } { _id: "94306", city: “PALO ALTO", loc: [ -122.127, 37.418], pop: 24309 } { _id: "10280", city: "NEW YORK", loc: [ -74.016, 40.710], pop: 5574 } { _id: "94124", city: “SAN FRANCISCO", loc: [-122.388, 37.73], pop: 27239 } { _id: "BOLINAS", pop: 1555 } { $match : { loc : { $geoWithin: { $centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] } } } { $group : { _id : "$city", pop : {$sum: "$pop"} } } { $sort : { "pop" : 1 } }, { $limit : 3 } Find the smallest cities within twenty miles of San Francisco
  • 26.
    $unwind • Operate onan array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array values Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 27.
    $unwind { title: "The GreatGatsby", ISBN: "9781857150193", subjects: "Long Island" } { $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] }
  • 28.
    2.6 Improvements • Returnsa cursor (not a document) – just like a regular find • New stages – $redact – $out • New operators: – set expression operators. – $let and $map operators to allow for the use of variables. – $literal operator and $size operator – $cond expression object • Integrated $text search • Performance improvements, "explain" and more Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 29.
    Advantages • Runs onthe server – Uses indexes – Uses shards • Simple to build complex pipelines • Easy to use from any driver • Fast -er than other options Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 30.
    Limitations • Pipeline operatormemory limits – 10% of total system RAM in 2.4 and earlier – 100MB in 2.6 but can use disk for external sort • Some data types not allowed – Code, CodeWithScope, etc. • Result size limited• Result size limited (in 2.4 and earlier) – 2.6 returns a cursor or direct output to a new collection No result size limit! Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 31.
    MapReduce Data Processing andAggregationOptions in MongoDB / Asya Kamsky
  • 32.
    MapReduce • Versatile, powerful DataProcessing andAggregation Options in MongoDB / Asya Kamsky
  • 33.
    MapReduce • Versatile, powerful •Intended for complex data analysis Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 34.
    MapReduce • Versatile, powerful •Intended for complex data analysis • Overkill for simple aggregations Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 35.
    MapReduce Worker thread calls mapper DataSet Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 36.
    MapReduce Workers call Reduce() DataSet Output Worker thread calls mapper Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 37.
    { _id: 375, title: "TheGreat Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" } Our Example Data
  • 38.
    MapReduce db.books.mapReduce( map, reduce, {finalize:finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function map() { var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 } ) }
  • 39.
    MapReduce db.books.mapReduce( map, reduce, {finalize:finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  • 40.
    MapReduce db.books.mapReduce( map, reduce, {finalize:finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function finalize( key, value ) { if ( value.numBooks != 0 ) return value.totalPages / value.numBooks; }
  • 41.
    MapReduce db.books.mapReduce( map, reduce, {finalize:finalize, out: { inline : 1} } ) function finalize( key, value ) { if ( value.numBooks != 0 ) return value.totalPages / value.numBooks; } db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
  • 42.
    MapReduce db.books.mapReduce( map, reduce, {finalize:finalize, out: { inline : 1} } ) "results" : [ { "_id" : "English", "value" : 653 }, { "_id" : "Russian", "value" : 1440 } ]
  • 43.
    Advantages • Map andreduce code can be arbitrarily complex – JavaScript, helper functions • Results can be saved into a new collection – replace, merge or re-reduce • Incremental MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 44.
    Limitations • Implemented withJavaScript – Single-threaded • Slower than Aggregation Framework – Batch, not real time • Harder to understand, implement, debug... Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 45.
    Analyzing MongoDB Datain External Systems
  • 46.
    Hadoop Framework that allowsfor the distributed processing of large data sets across clusters of computers Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 47.
    Hadoop MongoDB Connector •MongoDB or BSON files as input/output • Source data can be filtered with queries • Hadoop Streaming support – For jobs written in Python, Ruby, Node.js • Supports Hadoop tools such as Pig and Hive Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 48.
    Processing Big Data •Data broken up into smaller pieces • Process data across multiple nodes Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 49.
    Input splits onNon-sharded Systems Single Map Reduce Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Total Dataset Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 50.
    Advantages • Processing decoupled fromdata store • Parallel processing • Leverage existing infrastructure • Java has rich set of data processing libraries – And other languages if using Hadoop Streaming • Batch processing • Requires synchronization between data store and processor • Adds complexity to infrastructure Disadvantages Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 51.
    Storm Data Processing andAggregationOptions in MongoDB / Asya Kamsky
  • 52.
    Storm Data Processing andAggregationOptions in MongoDB / Asya Kamsky
  • 53.
    Storm MongoDB connector •Spout for MongoDB oplog or capped collections – Filtering capabilities – Threaded and non-blocking • Output to new or existing documents – Insert/update bolt Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 54.
  • 55.
    Internal Tools • Storingpre-aggregated data – An exercise in schema design • Aggregation Framework • MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 56.
    External Tools Data ProcessingandAggregation Options in MongoDB / Asya Kamsky
  • 57.
  • 58.
    Principal Solutions Architect,MongoDB Inc. Asya Kamsky Thank You #BigDataCamp @MongoDB @asya999

Editor's Notes

  • #23 "h" : { "$hour" : "$time" }, "m" : { "$minute" : "$time" }, "s" : { "$second" : "$time" },
  • #25 { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }}} { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  • #26 { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }}} { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  • #44 2.4 will improve somewhat
  • #45 2.4 will improve somewhat
  • #52 Distributed, real-time computation system.
  • #53 Distributed, real-time computation system.