Analyze Yourselves
Agenda
Rationale
Tools
Schema Design
Instruction Set
Ola!
Norberto Leite
Technical Evangelist
Madrid, Spain
http://www.mongodb.com/norberto
@nleite
norberto@mongodb.com
Rationale
With all this big data stuff out there …
Things to consider
•  Data Handling
–  Processing
–  Storage
•  Which schema?
•  Data types to use?
•  Visualization
–  Access to data
–  Use Data
•  Usage
–  Enrichement
–  Actualization / Updates
–  Format Changes
How we can use our day-to-day data to
experiment different "bigdata" options
And all for fun … if your that kind of person
9
Feeds
Machine Data Twitter Feed Facebook Posts
scapy Implementation
Sniffer
TwitterAPI facebook-sdk
All out/inbound traffic
for the last hours
All tweets that match a
set of terms
All my personal posts
Tools
11
Tools
•  MongoDB
–  Standard query language
–  Aggregation Framework
•  Python
–  2.7.10 (yes I'm lagging behind!)
–  scapy
–  pymongo
–  TwitterAPI
–  facebook-sdk
–  Matplotlib
–  Ipython notebook
Schema Design
13
Different Approaches
•  Raw Data Collection
–  Individual Feed Collections
–  Global Feed Collections
•  Base Structured Documents
•  Time Series Model
•  Purpose Modeling
–  Read Oriented
–  Write Oriented
Raw Collections
db.network.findOne()
{
"_id": ObjectId("55fc4faf4cc75f4fa21b2f64"),
"src": "00:11:32:34:9a:b7",
"ip": {
"frag": NumberLong("0"),
"src": "192.168.1.45",
"proto": 6,
"tos": 0,
"dst": "192.168.1.39",
"chksum": 47515,
...
}
db.fb.findOne()
{
"_id": ObjectId("55fc4fa44cc75f4fa21b2de0"),
"picture": "https://fbcdn-photos-b-
a.akamaihd.net/hphotos-ak-xpf1/v/t1.0-0/
s130x130/11938079_10153567958826624_15
15311618300487358_n.jpg?
oh=0a59f8eebaea7536939c04e178fe8f29&oe
=56A52C83&__gda__=1453828245_72225acf
102eeeb4f4f02cb09d668ab9",
"story": "Norberto Leite updated his cover
photo.",
"likes": {
"paging": {
"cursors": {
...
}
db.twitter.findOne()
{
"_id":
ObjectId("55fe4d194cc75f0157a8c8b4"),
"contributors": null,
"truncated": false,
"text": "We compared #python vs #nodejs
see results: http://t.co/WVeOGWMR5V",
"in_reply_to_status_id": null,
"id": NumberLong("64547933684644659
"favorite_count": 0,
Raw Collections
Posi%ve	
   Not	
  So	
  Much	
  
Simple	
  Approach	
   Hard	
  to	
  Maintain	
  
Fast	
  to	
  Develop	
   More	
  logic	
  on	
  the	
  App	
  Layer	
  
Direct	
  Model	
  to	
  Service	
   Dependency	
  on	
  3rd	
  Party	
  Model	
  
Simple	
  direct	
  queries	
   More	
  complicated	
  to	
  Merge	
  
Results	
  
Single Raw Collection
db.raw.find()
{
"_id": ObjectId("55fe4d194cc75f0157a8c8b4"),
"contributors": null,
"truncated": false,
"text": "We compared #python vs #nodejs - see results: http://t.co/WVeOGWMR5V",
...
}
{
{
"_id": ObjectId("55fc4fa44cc75f4fa21b2de0"),
"picture": "https://fbcdn-photos-b-a.akamaihd.net/hphotos..."
...
{
"_id": ObjectId("55fc4faf4cc75f4fa21b2f64"),
"src": "00:11:32:34:9a:b7",
"ip": {
Single Raw
Posi%ve	
   Not	
  So	
  Much	
  
Single	
  Access	
  Point	
   Even	
  Harder	
  to	
  Maintain	
  
Same	
  development	
  speed	
   Loading	
  data	
  requires	
  Codecs	
  
to	
  be	
  done	
  well	
  
Faster	
  Access	
  to	
  Result	
  Set	
   More	
  complicated	
  to	
  Filter	
  
Results	
  
Semi-structure Collection
{
"_id": ObjectId("55fea46a4cc75f1848559476"),
"feed": "network",
…
]
},
"process_date": ISODate("2015-09-20T14:19:54.945Z"),
"type": 2048
}
Semi-structure Single Collection
Posi%ve	
   Not	
  So	
  Much	
  
Single	
  Access	
  Point	
   Needs	
  modeling	
  	
  
Common	
  Structure	
  to	
  all	
  data	
  
Faster	
  Access	
  to	
  Result	
  Set	
  
Single	
  "Shardable"	
  collecDon	
  
Time Series
21
Time Series
Positive Not So Much
Size Deterministic Discards Data
In-place Updates
Fast Operations – reads and
writes
Purpose Model
Purpose Model- Fan on Write
Purpose Model – Fan On Read
Instruction Set
26
Instruction Set Available
•  Standard CRUD Operations
–  Queries
–  Updates – "$set", "$inc", "$setOnInsert", "$upsert"
•  Aggregation Framework
–  Worst name ever for a framework!
•  Grouping
•  Project
•  Unwind
Takeways
28
Takeway
•  A good schema is crucial to the performance of your
system
–  Functional
–  Logical
•  Different usage of data will shape your Schema
•  Storage Engines will also be important
–  Different storage Engines perform different according
with workload
MongoDB Days 2015
5	
  November,	
  2015	
   London	
  
https://www.mongodb.com/events/mongodb-days-uk
Obrigado!
Norberto Leite
Technical Evangelist
norberto@mongodb.com
@nleite
Analyse Yourself

Analyse Yourself

  • 2.
  • 3.
  • 4.
    Ola! Norberto Leite Technical Evangelist Madrid,Spain http://www.mongodb.com/norberto @nleite norberto@mongodb.com
  • 5.
  • 6.
    With all thisbig data stuff out there …
  • 7.
    Things to consider • Data Handling –  Processing –  Storage •  Which schema? •  Data types to use? •  Visualization –  Access to data –  Use Data •  Usage –  Enrichement –  Actualization / Updates –  Format Changes
  • 8.
    How we canuse our day-to-day data to experiment different "bigdata" options And all for fun … if your that kind of person
  • 9.
    9 Feeds Machine Data TwitterFeed Facebook Posts scapy Implementation Sniffer TwitterAPI facebook-sdk All out/inbound traffic for the last hours All tweets that match a set of terms All my personal posts
  • 10.
  • 11.
    11 Tools •  MongoDB –  Standardquery language –  Aggregation Framework •  Python –  2.7.10 (yes I'm lagging behind!) –  scapy –  pymongo –  TwitterAPI –  facebook-sdk –  Matplotlib –  Ipython notebook
  • 12.
  • 13.
    13 Different Approaches •  RawData Collection –  Individual Feed Collections –  Global Feed Collections •  Base Structured Documents •  Time Series Model •  Purpose Modeling –  Read Oriented –  Write Oriented
  • 14.
    Raw Collections db.network.findOne() { "_id": ObjectId("55fc4faf4cc75f4fa21b2f64"), "src":"00:11:32:34:9a:b7", "ip": { "frag": NumberLong("0"), "src": "192.168.1.45", "proto": 6, "tos": 0, "dst": "192.168.1.39", "chksum": 47515, ... } db.fb.findOne() { "_id": ObjectId("55fc4fa44cc75f4fa21b2de0"), "picture": "https://fbcdn-photos-b- a.akamaihd.net/hphotos-ak-xpf1/v/t1.0-0/ s130x130/11938079_10153567958826624_15 15311618300487358_n.jpg? oh=0a59f8eebaea7536939c04e178fe8f29&oe =56A52C83&__gda__=1453828245_72225acf 102eeeb4f4f02cb09d668ab9", "story": "Norberto Leite updated his cover photo.", "likes": { "paging": { "cursors": { ... } db.twitter.findOne() { "_id": ObjectId("55fe4d194cc75f0157a8c8b4"), "contributors": null, "truncated": false, "text": "We compared #python vs #nodejs see results: http://t.co/WVeOGWMR5V", "in_reply_to_status_id": null, "id": NumberLong("64547933684644659 "favorite_count": 0,
  • 15.
    Raw Collections Posi%ve  Not  So  Much   Simple  Approach   Hard  to  Maintain   Fast  to  Develop   More  logic  on  the  App  Layer   Direct  Model  to  Service   Dependency  on  3rd  Party  Model   Simple  direct  queries   More  complicated  to  Merge   Results  
  • 16.
    Single Raw Collection db.raw.find() { "_id":ObjectId("55fe4d194cc75f0157a8c8b4"), "contributors": null, "truncated": false, "text": "We compared #python vs #nodejs - see results: http://t.co/WVeOGWMR5V", ... } { { "_id": ObjectId("55fc4fa44cc75f4fa21b2de0"), "picture": "https://fbcdn-photos-b-a.akamaihd.net/hphotos..." ... { "_id": ObjectId("55fc4faf4cc75f4fa21b2f64"), "src": "00:11:32:34:9a:b7", "ip": {
  • 17.
    Single Raw Posi%ve  Not  So  Much   Single  Access  Point   Even  Harder  to  Maintain   Same  development  speed   Loading  data  requires  Codecs   to  be  done  well   Faster  Access  to  Result  Set   More  complicated  to  Filter   Results  
  • 18.
    Semi-structure Collection { "_id": ObjectId("55fea46a4cc75f1848559476"), "feed":"network", … ] }, "process_date": ISODate("2015-09-20T14:19:54.945Z"), "type": 2048 }
  • 19.
    Semi-structure Single Collection Posi%ve   Not  So  Much   Single  Access  Point   Needs  modeling     Common  Structure  to  all  data   Faster  Access  to  Result  Set   Single  "Shardable"  collecDon  
  • 20.
  • 21.
    21 Time Series Positive NotSo Much Size Deterministic Discards Data In-place Updates Fast Operations – reads and writes
  • 22.
  • 23.
  • 24.
    Purpose Model –Fan On Read
  • 25.
  • 26.
    26 Instruction Set Available • Standard CRUD Operations –  Queries –  Updates – "$set", "$inc", "$setOnInsert", "$upsert" •  Aggregation Framework –  Worst name ever for a framework! •  Grouping •  Project •  Unwind
  • 27.
  • 28.
    28 Takeway •  A goodschema is crucial to the performance of your system –  Functional –  Logical •  Different usage of data will shape your Schema •  Storage Engines will also be important –  Different storage Engines perform different according with workload
  • 29.
    MongoDB Days 2015 5  November,  2015   London   https://www.mongodb.com/events/mongodb-days-uk
  • 30.