2014 bigdatacamp asya_kamsky

Principal Solutions Architect, MongoDB, Inc.
Asya Kamsky
Data Processing and
Aggregation Options
#BigDataCamp @MongoDB @asya999

Applications and data
Store
Process
Data Processing andAggregation Options in MongoDB / Asya
Kamsky

Big Data
Kamsky

Big Data in MongoDB
Kamsky

Big Data in MongoDB
• An ideal operational database
• High performance for storage and
retrieval at large scale
• Robust query interface for intelligent
operations
Kamsky

MongoDB data processing
options
Kamsky

Big Data in MongoDB
Pre-aggregate in MongoDB for real-time queries
Process in MongoDB using Aggregation
Framework
Process in MongoDB using Map/Reduce
Process outside MongoDB using Hadoop and
other external tools
Kamsky

Aggregation Framework
Kamsky

• Declared in JSON, executes in C++
Kamsky

• Flexible, functional, and simple
Kamsky

• Flexible, functional, and simple
• Plays nice with sharding
Kamsky

Pipeline
ps ax | grep mongod | head 1
Piping command line operations
Kamsky

Pipeline
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
Kamsky

Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort/$skip/$limit
• $redact
• $geoNear
• $out
Kamsky

$match
• Filter documents
• Uses existing query syntax
• 2.4 added support for geospatial operations
• 2.6 added support for full text search indexes
Kamsky

{ $match : { state : "NY" } }
{
city: “SAN FRANCISCO",
loc: [-122.4614, 37.781],
state: ”CA"
}
{
city: "NEW YORK",
loc: [ -73.989, 40.731],
state: "NY"
}
{
city: “PALO ALTO",
loc: [ -122.127, 37.418],
state: ”CA"
}

{ $match : { loc : { $geoWithin:
{$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }
{
loc: [-122.4614, 37.781],
state: ”CA"
}
{
city: "NEW YORK",
loc: [ -73.989, 40.731],
state: "NY"
}
{
loc: [ -122.127, 37.418],
state: ”CA"
}

$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields
Kamsky

{
loc: [-122.3892, 37.7864],
state: ”CA"
}
{
_id: "94105",
loc: [-122.3892, 37.7864],
state: ”CA"
}
Selecting and Excluding
Fields
$project: { _id: 0, loc: 1, state: 1 }

{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
loc: [-122.3892, 37.7864],
state: ”CA"
}
$project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0}
Renaming and Computing
Fields

{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
loc: [-122.3892, 37.7864],
state: ”CA"
}
$project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0}
Fields
New Field Operation

{
dt : {
y : 2012,
m : 9,
d : 1
},
totalprice: 123350.97,
status: "F"
}
{
_id : 6694,
cname : "Cust#000060209",
status" : "F",
totalprice : 123350.97,
orderdate : ISODate("2012-09-
01T13:11:31Z"),
lineitems: [
{ ... },
{ ... },
{ ... }
]
}
Fields
$project : { dt: { y : { "$year" : "$orderdate" },
m : { "$month" : "$orderdate" },
d : { "$dayOfMonth" : "$orderdate" } },
totalprice : 1, status : 1, _id : 0 }

$group
• Group documents by an ID
– Field reference, object, constant
• Other output fields are computed
– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last
• Processes all data in memory
– can utilize external disk-based sort in 2.6
Kamsky

Find the smallest cities
within twenty miles of San
Francisco{ _id: "94306",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
loc: [-122.388, 37.73],
pop: 27239 }

{
_id: "WOODACRE",
pop: 1524
}
{
_id: "STINSON BEACH",
pop: 630
}
{ _id: "94306",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
loc: [-122.388, 37.73],
pop: 27239 }
{
_id: "BOLINAS",
pop: 1555
}
{ $match : { loc :
{ $geoWithin:
{ $centerSphere : [
[ -122.4, 37.79 ],
20/3959
]
} } }
{ $group : {
_id : "$city",
pop : {$sum:
"$pop"}
}
}
{ $sort : { "pop" : 1 } },
{ $limit : 3 }
Find the smallest cities
within twenty miles of San
Francisco

$unwind
• Operate on an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error
• Pipe to $group to aggregate array values
Kamsky

$unwind
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
}
{ $unwind: "$subjects" }
{
ISBN: "9781857150193",
subjects: "New York"
}
{
ISBN: "9781857150193",
subjects: "1920s"
}
{
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]
}

2.6 Improvements
• Returns a cursor (not a document)
– just like a regular find
• New stages
– $redact
– $out
• New operators:
– set expression operators.
– $let and $map operators to allow for the use of variables.
– $literal operator and $size operator
– $cond expression object
• Integrated $text search
• Performance improvements, "explain" and more
Kamsky

Advantages
• Runs on the server
– Uses indexes
– Uses shards
• Simple to build complex pipelines
• Easy to use from any driver
• Fast -er than other options
Kamsky

Limitations
• Pipeline operator memory limits
– 10% of total system RAM in 2.4 and earlier
– 100MB in 2.6 but can use disk for external sort
• Some data types not allowed
– Code, CodeWithScope, etc.
• Result size limited• Result size limited (in 2.4 and earlier)
– 2.6 returns a cursor or direct output to a new collection
No result size limit!
Kamsky

MapReduce
Kamsky

MapReduce
• Versatile, powerful
Kamsky

MapReduce
• Intended for complex data
analysis
Kamsky

MapReduce
• Intended for complex data
analysis
• Overkill for simple aggregations
Kamsky

MapReduce
Worker thread
calls mapper
Data Set
Kamsky

MapReduce
Workers call Reduce()
Data Set
Output
Worker thread
calls mapper
Kamsky

{
_id: 375,
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
Our Example Data

MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
function map() {
var key = this.language;
emit ( key, { totalPages : this.pages, numBooks : 1
} )
}

MapReduce
db.books.mapReduce(
db.books.mapReduce(
function reduce(key, values) {
var result = { numBooks : 0, totalPages : 0};
values.forEach(function (value) {
result.numBooks += value.numBooks;
result.totalPages += value.totalPages;
});
return result;
}

MapReduce
db.books.mapReduce(
db.books.mapReduce(
function finalize( key, value ) {
if ( value.numBooks != 0 )
return value.totalPages / value.numBooks;
}

MapReduce
db.books.mapReduce(
function finalize( key, value ) {
if ( value.numBooks != 0 )
return value.totalPages / value.numBooks;
}
db.books.mapReduce(

MapReduce
db.books.mapReduce(
"results" : [
{
"_id" : "English",
"value" : 653
},
{
"_id" : "Russian",
"value" : 1440
}
]

Advantages
• Map and reduce code can be arbitrarily complex
– JavaScript, helper functions
• Results can be saved into a new collection
– replace, merge or re-reduce
• Incremental MapReduce
Kamsky

Limitations
• Implemented with JavaScript
– Single-threaded
• Slower than Aggregation Framework
– Batch, not real time
• Harder to understand, implement, debug...
Kamsky

Analyzing MongoDB Data in
External Systems

Hadoop
Framework that allows for the distributed processing
of large data sets across clusters of computers
Kamsky

Hadoop MongoDB Connector
• MongoDB or BSON files as input/output
• Source data can be filtered with queries
• Hadoop Streaming support
– For jobs written in Python, Ruby, Node.js
• Supports Hadoop tools such as Pig and Hive
Kamsky

Processing Big Data
• Data broken up into smaller pieces
• Process data across multiple nodes
Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop
Kamsky

Input splits on Non-sharded
Systems
Single Map
Reduce
Total Dataset
Kamsky

Advantages
• Processing decoupled
from data store
• Parallel processing
• Leverage existing
infrastructure
• Java has rich set of data
processing libraries
– And other languages if
using Hadoop Streaming
• Batch processing
• Requires synchronization
between data store and
processor
• Adds complexity to
infrastructure
Disadvantages
Kamsky

Storm
Kamsky

Storm MongoDB connector
• Spout for MongoDB oplog or capped collections
– Filtering capabilities
– Threaded and non-blocking
• Output to new or existing documents
– Insert/update bolt
Kamsky

Aggregating MongoDB’s
Data Processing Options

Internal Tools
• Storing pre-aggregated data
– An exercise in schema design
• Aggregation Framework
• MapReduce
Kamsky

External Tools
Kamsky

Principal Solutions Architect, MongoDB Inc.
Asya Kamsky
Thank You
#BigDataCamp @MongoDB @asya999

2014 bigdatacamp asya_kamsky

More Related Content

What's hot

Viewers also liked

Similar to 2014 bigdatacamp asya_kamsky

More from Data Con LA

Recently uploaded

2014 bigdatacamp asya_kamsky

Editor's Notes