Codemotion Milano 2013

Data Processing and
Aggregation
Massimo Brignoli
Solutions Architect, MongoDB Inc.

Except where o...
Who Am I?
• Solutions Architect/Evangelist in MongoDB Inc.
• 20 years of experience in databases
• Former MySQL employee

...
Big Data

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
What is Big Data?
• Big Data is like teenage sex:
• everyone talks about it
• nobody really knows how to do it

• everyone...
Understanding Big Data – It’s Not Very “Big”

64% - Ingest diverse,
new data in real-time

15% - More than 100TB
of data
2...
For over a decade

Big Data == Custom
Software

Except where otherwise noted, this work is licensed under http://createive...
Lots of Great Innovations Since 1970
Including the Relational Database
RDBMS Makes Development Hard

Code

XML Config

DB Schema

Application

Object Relational
Mapping

Relational
Database
And Even Harder To Iterate
New
Table

New
Column

New
Table
Name

Pet

Phone

New
Column

3 months later…

Email
From Complexity to Simplicity
MongoDB

RDBMS

{

_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
depar...
In the past few years
Open source software has
emerged enabling the rest of
us to handle Big Data

Except where otherwise ...
Use Popular, Well-Known Technologies

Source: Silicon Angle, 2012
Enterprise Big Data Stack

CRM, ERP, Collaboration, Mobile, BI

Data Management
Online Data
RDBMS
RDBMS

Offline Data
Hado...
Consideration – Online vs. Offline
Online

• Real-time
• Low-latency
• High availability

vs.

Offline

• Long-running
• H...
How MongoDB Meets Our
Requirements
• MongoDB is an operational database
• MongoDB provides high performance for storage

a...
MongoDB data processing options
http://www.flickr.com/photos/torek/4444673930/ http://createivecommons.org/licenses/by-nc-...
Getting Example Data

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc...
The “hello world” of
MapReduce is counting words
in a paragraph of text.
Let’s try something a little
more interesting…

E...
What is the most popular pub
name?

Except where otherwise noted, this work is licensed under http://createivecommons.org/...
Open Street Map Data
#!/usr/bin/env python
# Data Source
# http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,4...
Example Pub Data
{
"_id" : 451152,
"amenity" : "pub",
"name" : "The Dignity",
"addr:housenumber" : "363",
"addr:street" : ...
MapReduce

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB MapReduce
•

map
MongoDB

reduce
finalize

Except where otherwise noted, this work is licensed under http://create...
map

Map Function
MongoDB

reduce

> var map = function() {
finalize

emit(this.name, 1);

Except where otherwise noted, t...
map

Reduce Function
MongoDB

reduce

> var reduce = function (key, values) {
finalize

var sum = 0;
values.forEach( funct...
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{ "_id" : "The Red Lion", "value" : 407 }
{ "_id" : "The Royal O...
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Pub Names in the Center of London
> db.pubs.mapReduce(map, reduce, { out: "pub_names",
query: {
location: {
$within: { $ce...
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"...
MongoDB MapReduce
• Real-time
• Output directly to document or collection
• Runs inside MongoDB on local data

− Adds load...
Aggregation Framework

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-n...
Aggregation Framework
•

op1
MongoDB

op2

opN
Except where otherwise noted, this work is licensed under http://createivec...
Aggregation Framework in 60
Seconds

Except where otherwise noted, this work is licensed under http://createivecommons.org...
Aggregation Framework Operators
• $project
• $match
• $limit

• $skip
• $sort
• $unwind
• $group

Except where otherwise n...
$match
• Filter documents
• Uses existing query syntax
• If using $geoNear it has to be first in pipeline

• $where is not...
Matching Field Values
{

"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point"...
$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields

• Create sub-document fields

E...
Including and Excluding Fields
{ “$project”: {

{

"_id" : 271466,

"name" : "The Red Lion",

“_id”: 0,
“amenity”: 1,
“nam...
Reformatting Documents
{ “$project”: {

{

"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {

“_i...
Dealing with Arrays
{ “$project”: {

{

"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"facilities" : [

"toi...
$group
• Group documents by an ID
• Field reference, object, constant
• Other output fields are computed

$max, $min, $avg...
Back to the pub!

•

http://www.offwestend.com/index.php/theatres/pastshows/71

Except where otherwise noted, this work is...
Popular Pub Names
>var popular_pub_names = [
{ $match : location:
{ $within: { $centerSphere:
[[-0.12, 51.516], 2 / 3959]}...
Results
> db.pubs.aggregate(popular_pub_names)
{
"result" : [
{ "_id" : "All Bar One", "value" : 11 }
{ "_id" : "The Slug ...
Aggregation Framework Benefits
• Real-time
• Simple yet powerful interface
• Declared in JSON, executes in C++

• Runs ins...
Analyzing MongoDB Data in
External Systems

Except where otherwise noted, this work is licensed under http://createivecomm...
MongoDB with Hadoop
•

MongoDB

Except where otherwise noted, this work is licensed under http://createivecommons.org/lice...
MongoDB with Hadoop
•

MongoDB

Except where otherwise noted, this work is licensed under http://createivecommons.org/lice...
MongoDB with Hadoop
•

ETL

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses...
Map Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONMapper
def mapper(documents):
bounds = get_bo...
Reduce Pub Names in Python
#!/usr/bin/env python

from pymongo_hadoop import BSONReducer

def reducer(key, values):
_count...
Execute MapReduce
hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar 
-mapper examples/pub/map.py 
-reducer e...
Popular Pub Names Nearby
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_...
MongoDB and Hadoop
• Away from data store
• Can leverage existing data processing infrastructure
• Can horizontally scale ...
The Future of Big Data and
MongoDB

Except where otherwise noted, this work is licensed under http://createivecommons.org/...
What is Big Data?
Big Data today will be
normal tomorrow

Except where otherwise noted, this work is licensed under http:/...
Exponential Data Growth
Billions of URLs indexed by Google
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
2000

2002...
MongoDB enables you to
scale big

Except where otherwise noted, this work is licensed under http://createivecommons.org/li...
MongoDB is evolving

so you can process the
big

Except where otherwise noted, this work is licensed under http://createiv...
Data Processing with MongoDB
• Process in MongoDB using Map/Reduce
• Process in MongoDB using Aggregation

Framework
• Pro...
Questions?

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Codemotion Milano

Thanks!
Massimo Brignoli
Solutions Architect, MongoDB Inc.

Except where otherwise noted, this work is ...
Upcoming SlideShare
Loading in...5
×

Past, Present and Future of Data Processing in Apache Hadoop

616

Published on

MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we’ll take a look at 3 different ways of aggregating your data with MongoDB, and determine the reasons why you might choose one way over another. No matter what your big data needs are, you will find out how MongoDB the big data store is evolving to help make sense of your data.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
616
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS's challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
  • This is helpful because as much as 95% of enterprise information is unstructured, and doesn’t fit neatly into tidy rows and columns. NoSQL and Hadoop allow for dynamic schema.
  • The industry is talking about Hadoop and MongoDB for Big Data. So should you
  • This is where MongoDB fits into the existing enterprise IT stackMongoDB is an operational data store used for online data, in the same way that Oracle is an operational data store. It supports applications that ingest, store, manage and even analyze data in real-time. (Compared to Hadoop and data warehouses, which are used for offline, batch analytical workloads.)
  • Another common use case we see is warehousing of data -* again the connector allows you to utilize existing libraries via hadoopUS
  • The third most common usecase is an ETL - extract transform load - function.Then putting the aggregated data into mongodb for further analysis.
  • Past, Present and Future of Data Processing in Apache Hadoop

    1. 1. Codemotion Milano 2013 Data Processing and Aggregation Massimo Brignoli Solutions Architect, MongoDB Inc. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    2. 2. Who Am I? • Solutions Architect/Evangelist in MongoDB Inc. • 20 years of experience in databases • Former MySQL employee • Previous life: web, web, web Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    3. 3. Big Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    4. 4. What is Big Data? • Big Data is like teenage sex: • everyone talks about it • nobody really knows how to do it • everyone thinks everyone else is doing it • so everyone claims they are doing it… Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    5. 5. Understanding Big Data – It’s Not Very “Big” 64% - Ingest diverse, new data in real-time 15% - More than 100TB of data 20% - Less than 100TB (average of all? <20TB) from Big Data Executive Summary – 50+ top executives from Government and F500 firms
    6. 6. For over a decade Big Data == Custom Software Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    7. 7. Lots of Great Innovations Since 1970
    8. 8. Including the Relational Database
    9. 9. RDBMS Makes Development Hard Code XML Config DB Schema Application Object Relational Mapping Relational Database
    10. 10. And Even Harder To Iterate New Table New Column New Table Name Pet Phone New Column 3 months later… Email
    11. 11. From Complexity to Simplicity MongoDB RDBMS { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }
    12. 12. In the past few years Open source software has emerged enabling the rest of us to handle Big Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    13. 13. Use Popular, Well-Known Technologies Source: Silicon Angle, 2012
    14. 14. Enterprise Big Data Stack CRM, ERP, Collaboration, Mobile, BI Data Management Online Data RDBMS RDBMS Offline Data Hadoop Infrastructure OS & Virtualization, Compute, Storage, Network EDW Security & Auditing Management & Monitoring Applications
    15. 15. Consideration – Online vs. Offline Online • Real-time • Low-latency • High availability vs. Offline • Long-running • High-Latency • Availability is lower priority
    16. 16. How MongoDB Meets Our Requirements • MongoDB is an operational database • MongoDB provides high performance for storage and retrieval at large scale • MongoDB has a robust query interface permitting intelligent operations • MongoDB is not a data processing engine, but provides processing functionality Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    17. 17. MongoDB data processing options http://www.flickr.com/photos/torek/4444673930/ http://createivecommons.org/licenses/by-nc-sa/3.0/ Except where otherwise noted, this work is licensed under
    18. 18. Getting Example Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    19. 19. The “hello world” of MapReduce is counting words in a paragraph of text. Let’s try something a little more interesting… Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    20. 20. What is the most popular pub name? Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    21. 21. Open Street Map Data #!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs) Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    22. 22. Example Pub Data { "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    23. 23. MapReduce Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    24. 24. MongoDB MapReduce • map MongoDB reduce finalize Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    25. 25. map Map Function MongoDB reduce > var map = function() { finalize emit(this.name, 1); Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    26. 26. map Reduce Function MongoDB reduce > var reduce = function (key, values) { finalize var sum = 0; values.forEach( function (val) {sum += val;} ); return sum; } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    27. 27. Results > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    28. 28. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    29. 29. Pub Names in the Center of London > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    30. 30. Results > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Green Man", "value" : 5 } "The Kings Arms", "value" : 5 } "The Red Lion", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    31. 31. MongoDB MapReduce • Real-time • Output directly to document or collection • Runs inside MongoDB on local data − Adds load to your DB − In Javascript – debugging can be a challenge − Translating in and out of C++ Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    32. 32. Aggregation Framework Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    33. 33. Aggregation Framework • op1 MongoDB op2 opN Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    34. 34. Aggregation Framework in 60 Seconds Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    35. 35. Aggregation Framework Operators • $project • $match • $limit • $skip • $sort • $unwind • $group Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    36. 36. $match • Filter documents • Uses existing query syntax • If using $geoNear it has to be first in pipeline • $where is not supported Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    37. 37. Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "$match": { "name": "The Red Lion" }} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    38. 38. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    39. 39. Including and Excluding Fields { “$project”: { { "_id" : 271466, "name" : "The Red Lion", “_id”: 0, “amenity”: 1, “name”: 1, "location" : { }} "amenity" : "pub", "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } { “amenity” : “pub”, “name” : “The Red Lion” } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    40. 40. Reformatting Documents { “$project”: { { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } { “name” : “The Red Lion” “meta” : { “type” : “pub” }} Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    41. 41. Dealing with Arrays { “$project”: { { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "facilities" : [ "toilets", “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} {"$unwind": "$facility"} "food" ], } { "name" : "The Red Lion", "facility" : "toilets" }, { "name" : "The Red Lion", "facility" : "food" } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    42. 42. $group • Group documents by an ID • Field reference, object, constant • Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last • Processes all data in memory Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    43. 43. Back to the pub! • http://www.offwestend.com/index.php/theatres/pastshows/71 Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    44. 44. Popular Pub Names >var popular_pub_names = [ { $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}} }, { $group : { _id: “$name” value: {$sum: 1} } }, { $sort : {value: -1} }, { $limit : 10 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    45. 45. Results > db.pubs.aggregate(popular_pub_names) { "result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } ], "ok" : 1 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    46. 46. Aggregation Framework Benefits • Real-time • Simple yet powerful interface • Declared in JSON, executes in C++ • Runs inside MongoDB on local data − Adds load to your DB − Limited Operators − Data output is limited Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    47. 47. Analyzing MongoDB Data in External Systems Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    48. 48. MongoDB with Hadoop • MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    49. 49. MongoDB with Hadoop • MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ warehouse
    50. 50. MongoDB with Hadoop • ETL Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ MongoDB
    51. 51. Map Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    52. 52. Reduce Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONReducer def reducer(key, values): _count = 0 for v in values: _count += v['count'] return {'_id': key, 'value': _count} BSONReducer(reducer) Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    53. 53. Execute MapReduce hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar -mapper examples/pub/map.py -reducer examples/pub/reduce.py -mongo mongodb://127.0.0.1/demo.pubs -outputURI mongodb://127.0.0.1/demo.pub_names Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    54. 54. Popular Pub Names Nearby > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Kings Arms", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } "The George", "value" : 4 } "The Green Man", "value" : 4 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    55. 55. MongoDB and Hadoop • Away from data store • Can leverage existing data processing infrastructure • Can horizontally scale your data processing - Offline batch processing - Requires synchronisation between store & processor - Infrastructure is much more complex Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    56. 56. The Future of Big Data and MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    57. 57. What is Big Data? Big Data today will be normal tomorrow Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    58. 58. Exponential Data Growth Billions of URLs indexed by Google 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 2000 2002 2004 2006 2008 Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ 2010 2012
    59. 59. MongoDB enables you to scale big Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    60. 60. MongoDB is evolving so you can process the big Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    61. 61. Data Processing with MongoDB • Process in MongoDB using Map/Reduce • Process in MongoDB using Aggregation Framework • Process outside MongoDB using Hadoop and other external tools Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    62. 62. Questions? Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    63. 63. Codemotion Milano Thanks! Massimo Brignoli Solutions Architect, MongoDB Inc. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×