Building a Scalable Inbox System with MongoDB and Java

Technical Account Manager Lead, MongoDB Inc
@antoinegirbal
Antoine Girbal
JavaOne 2013
Building a scalable inbox
system with MongoDB and
Java

Single Table En
Agenda
• Problem Overview
• Schema and queries
• Java Development
• Design Options
– Fan out on Read
– Fan out on Write
– Bucketed Fan out on Write
– Cached Inbox
• Discussion

Basic CRUD
• Save your first document:
> db.test.insert({firstName: "Antoine", lastName: "Girbal" } )
• Find the document:
> db.test.find({firstName: "Antoine" } )
{ _id: ObjectId("524495105889411fab0cdfa3"),firstName: "Antoine", lastName: "Girbal"
}
• Update the document:
> db.test.update({_id: ObjectId("524495105889411fab0cdfa3")}, { x: 1, y: 2 } )
• Remove the document:
> db.test.remove({_id: ObjectId("524495105889411fab0cdfa3")})
• No schema definition or other declaration, it's easy!

The User Document
{ "_id": ObjectId("519c12d53004030e5a6316d2"),
"address": {
"streetAddress": "2600 Rafe Lane",
"city": "Jackson",
"state": "MS",
"zip": 39201,
"country": "US" },
"birthday": "IDODate("1980-12-26T00:00:00.000Z"),
"company": "Parade of Shoes",
"domain": "SanFranciscoAgency.com",
"email": "AnthonyJDacosta@pookmail.com",
"firstName": "Anthony",
"gender": "male",
"lastName": "Dacosta",
"location": [ -90.183518, 32.368619 ],
…
}

The User Collection
The collection statistics:
> db.users.stats()
{
"ns": "edges.users",
"count": 1000000, // number of documents
"size": 637864480, // size of all documents
"avgObjSize": 637.86448,
"storageSize": 845197312,
"numExtents": 16,
"nindexes": 2,
"lastExtentSize": 227786752,
"paddingFactor": 1.0000000000260925, // padding after documents
"systemFlags": 1,
"userFlags": 0,
"totalIndexSize": 66070256,
"indexSizes": { "_id_": 29212848, "uid_1": 36857408 },
"ok": 1
}

Queries on Users
Finding a user by email address…
> db.users.find({ "email": "AnthonyJDacosta@pookmail.com" }).pretty()
{ "_id": ObjectId("519c12d53004030e5a6316d2"),
…
By default will use a slow table scan…
> db.users.find({ "email": "AnthonyJDacosta@pookmail.com" } ).explain()
{ "cursor": "BasicCursor",
"nscannedObjects": 1000000, // 1m objects scanned
"nscanned": 1000000,
…
Use an index for fast performance…
> db.users.ensureIndex({ "email": 1 } ) // does not do anything if index is there
> db.users.find({ "email": "AnthonyJDacosta@pookmail.com" }).explain()
{ "cursor": "BtreeCursor email_1", // Btree, sweet!
"nscannedObjects": 1, // document is found almost right away
"nscanned": 1,
…

Users Relationships
• Here the follower / followee relationships are of
"many-to-many" type. It can be either stored as:
1. a list of followers in user
2. a list of followees in user
3. a relationship collection: "followees"
4. two relationship collections: "followees" and "followers".
• Ideal solutions:
– a few million users and a 1000 followee limit: Solution #2
– no boundaries and relative scaling: Solution #3
– no boundaries and max scaling: Solution #4

Relationship Data
Let's look at a sample document:
> use edges
switched to db edges
> db.followees.findOne()
{ "_id": ObjectId(),
"user": "17052001”,
"followee": "31554261”
}
And the statistics:
> db.followees.stats()
{
"ns": "edges.followees",
"count": 1000000,
"size": 64000048,
"avgObjSize": 64.000048,
"numExtents": 10,
"nindexes": 2,
"paddingFactor": 1,
"systemFlags": 1,
"userFlags": 0,
"indexSizes": {
"_id_": 32458720,
"user_1_followee_1": 53103120 },
"ok": 1
}

Relationship Queries
To find all the users that a user follows:
> db.followees.ensureIndex({ user: 1, followee: 1 }) // why not just index on user? We shall see
> db.followees.find({user: "11622712"})
{ "_id" : ObjectId("51641c02e4b0ef6827a34569"), "user" : "11622712", "followee" : "30432718" }
…
> db.followees.find({user: "11622712"}).explain()
{
"cursor" : "BtreeCursor user_1_followee_1",
"n" : 66,
"indexOnly" : false,
"millis" : 0, // this is fast
Even faster if using a “covered” index:
> db.followees.find({user: "11622712"}, {followee: 1, _id: 0}).explain()
{
"cursor" : "BtreeCursor user_1_followee_1",
"n" : 66,
"nscannedObjects" : 0,
"nscanned" : 66,
"indexOnly" : true, // this means covered
To find all the followers of a user, we just need the opposite index::
> db.followees.ensureIndex({followee: 1, user: 1})
> db.followees.find({followee: "30313973"}, {user: 1, _id: 0})

Message Document
The message document:
> db.messages.findOne()
{
"_id": "ObjectId("519d4858e4b079162fe7eb12"),
"uid": "48268973", // the author id
"username": "Abiall", // why store the username?
"text": "Lorem ipsum dolor sit amet, consectetur ...",
"created": ISODate(2013-05-22T22:36:08.663Z"),
"location": [ -95.470188, 37.366044 ],
"tags": [ "gadgets" ]
}
Collection statistics:
> db.messages.stats()
{
"ns": "msg.messages",
"count": 21440518,
"size": 14184598000,
"avgObjSize": 661.5790719235422,
"numExtents": 27,
"nindexes": 2,
"paddingFactor": 1,
"systemFlags": 1,
"userFlags": 0,
"indexSizes": {
"_id_": 695646784,
"uid_1_created_1": 758642864 },
"ok": 1
}

Implementing the Outbox
The query is on "uid" and needs to be sorted by descending "created" time:
> db.messages.ensureIndex({ "uid": 1, "created": 1 } ) // use a compound index
> db.messages.find({ "uid": "31837072" } ).sort({ "created": -1 } ).limit(100)
{ "_id": ObjectId("519d626ae4b07916312e15b1") }, "uid": "31837072", "username": "Roya
"text": "Lorem ipsum dolor sit amet, consectetur adipisicing elit , sed do eiusmod tempor …",
"created": ISODate("2013-05-23T00:27:22.369Z"),
"location": [ "-118.296138", "33.772832" ],
"tags": [ "Art" ] }
…
> db.messages.find({ "uid": "31837072" }).sort({ "created": -1 }).limit(100).explain()
{
"cursor": "BtreeCursor uid_1_created_1 reverse",
"n": 18,
"nscannedObjects": 18,
"nscanned": 18,
"scanAndOrder": false,
"millis": 0
…

Java support
• Java driver is open source, available on github
and Maven.
• mongo.jar is the driver, bson.jar is a subset with
BSON library only.
• Java driver is probably the most used MongoDB
driver
• It receives active development by MongoDB Inc
and the community

Driver Features
• CRUD
• Support for replica sets
• Connection pooling
• Distributed reads to slave servers
• BSON serializer/deserializer (lazy option)
• JSON serializer/deserializer
• GridFS

Message Store
public class MessageStoreDAO implements MessageStore {
private Morphia morphia;
private Datastore ds;
public MessageStoreDAO( MongoClient mongo ) {
this.morphia = new Morphia();
this.morphia.map(DBMessage.class);
this.ds = morphia.createDatastore(mongo, "messages");
this.ds.getCollection(DBMessage.class).
ensureIndex(new BasicDBObject("sender",1).append("sentAt",1) );
}
// get a message
public Message get(String user_id, String msg_id) {
return (Message) this.ds.find(DBMessage.class)
.filter("sender", user_id)
.filter("_id", new ObjectId(msg_id))
.get();
}

Message Store
// save a message
public Message save(String user_id, String message, Date date) {
Message msg = new DBMessage( user_id, message, date );
ds.save( msg );
return msg;
}
// find message by author sorted by descending time
public List<Message> sentBy(String user_id) {
return (List) this.ds.find(DBMessage.class)
.filter("sender",user_id).order("-sentAt").limit(50).asList();
}
// find message by several authors sorted by descending time
public List<Message> sentBy(List<String> user_ids) {
return (List) this.ds.find(DBMessage.class)
.field("sender").in(user_ids).order("-sentAt").limit(50).asList();
}

Graph Store
Below uses Solution #4: both a follower and followee list
public class GraphStoreDAO implements GraphStore {
private DBCollection friends;
private DBCollection followers;
public GraphStoreDAO(MongoClient mongo) {
this.followers = mongo.getDB("edges").getCollection("followers");
this.friends = mongo.getDB("edges").getCollection("friends");
followers.ensureIndex( new BasicDBObject("u",1).append("o",1), new BasicDBObject("unique", true));
friends.ensureIndex( new BasicDBObject("u",1).append("o",1), new BasicDBObject("unique",true));
}
// find users that are followed
public List<String> friendsOf(String user_id) {
List<String> theFriends = new ArrayList<String>();
DBCursor cursor = friends.find( new BasicDBObject("u",user_id), new
BasicDBObject("_id",0).append("o",1));
while(cursor.hasNext())
theFriends.add( (String) cursor.next().get("o"));
return theFriends;
}

Graph Store
// find followers of a user
public List<String> followersOf(String user_id) {
List<String> theFollowers = new ArrayList<String>();
DBCursor cursor = followers.find( new BasicDBObject("u",user_id),
new BasicDBObject("_id",0).append("o",1));
while(cursor.hasNext())
theFollowers.add( (String) cursor.next().get("o"));
return theFollowers;
}
public void follow(String user_id, String toFollow) {
friends.save( new BasicDBObject("u",user_id).append("o",toFollow));
followers.save( new BasicDBObject("u",toFollow).append("o",user_id));
}
public void unfollow(String user_id, String toUnFollow) {
friends.remove(new BasicDBObject("u", user_id).append("o", toUnFollow));
followers.remove(new BasicDBObject("u", toUnFollow).append("o", user_id));
}

4 Approaches (there are
more)
• Fan out on Read
• Fan out on Write
• Bucketed Fan out on Write
• Inbox Caches

Fan out on read
• Generally, not the right approach
• 1 document per message sent
• Reading an inbox is finding all messages sent by
the list of people users follow
• Requires scatter-gather on sharded cluster
• Then a lot of random IO on a shard to find
everything

Fan out on Read
Put the followees ids in a list:
> var fees = []
> db.followees.find({user: "11622712"})
.forEach( function(doc) { fees.push( doc.followee ) } )
Use $in and sort() and limit() to gather the inbox:
> db.messages.find({ uid: { $in: fees } }).sort({ created: -1 }).limit(100)
{ "_id": ObjectId("519d627ce4b07916312f0a09"), "uid": "34660390", "username": "Dingdowas"
{ "_id": ObjectId("519d627ce4b07916312f0a10"), "uid": "34661390", "username": "John" } …
{ "_id": ObjectId("519d627ce4b07916312f0a11"), "uid": "34662390", "username": "Brenda" } …
…

Fan out on read – Send
Message
Shard 1 Shard 2 Shard 3
Send
Message

Fan out on read – Inbox Read
Read
Inbox

Fan out on read
> db.messages.find({ uid: { $in: fees } } ).sort({ created: -1 } ).limit(100).explain()
{
"cursor": "BtreeCursor uid_1_created_1 multi",
"isMultiKey": false,
"n": 100,
"nscannedObjects": 1319,
"nscanned": 1384,
"nscannedObjectsAllPlans": 1425,
"nscannedAllPlans": 1490,
"scanAndOrder": true, // it is sorting in RAM??
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 31 // takes about 30ms
}

Fan out on write
• Tends to scale better than fan out on read
• 1 document per recipient
• Reading my inbox is just finding all of the
messages with me as the recipient
• Can shard on recipient, so inbox reads hit one
shard
• But still lots of random IO on the shard

Fan out on Write
// Shard on “recipient” and “sent”
db.shardCollection(”myapp.inbox”, { ”recipient”: 1, ”sent”: 1 } )
msg = { from: "Joe”, sent: new Date(), message: ”Hi!” }
// Send a message, write one message per follower
for( follower in followersOf( msg.from) ) {
msg.recipient = recipient
db.inbox.save(msg);
}
// Read my inbox, super easy
db.inbox.find({ recipient: ”Joe” }).sort({ sent: -1 })

Fan out on write – Send
Message
Send
Message

Fan out on write– Read Inbox
Read
Inbox

Bucketed Fan out on write
• Each “inbox” document is an array of messages
• Append a message onto “inbox” of recipient
• Bucket inbox documents so there‟s not too many
per document
• Can shard on recipient, so inbox reads hit one
shard
• 1 or 2 documents to read the whole inbox

Bucketed Fan out on Write
// Shard on “owner / sequence”
db.shardCollection(”myapp.buckets”, { ”owner”: 1, ”sequence”: 1 } )
db.shardCollection(”myapp.users”, { ”user_name”: 1 } )
msg = { from: "Joe”, sent: new Date(), message: ”Hi!” }
// Send a message, have to find the right sequence document
sequence = db.users.findAndModify({
query: { user_name: recipient},
update: { '$inc': { ‟msg_count': 1 }},
upsert: true,
new: true }).msg_count / 50;
db.buckets.update({ owner: recipient, sequence: sequence},
{ $push: { „messages‟: msg } },
{ upsert: true });
}
// Read my inbox
db.buckets.find({ owner: ”Joe” }).sort({ sequence: -1 }).limit(2)

Bucketed fan out on write -
Send
Send
Message

Bucketed fan out on write -
Read
Read
Inbox

Cached inbox
• Recent messages are fast, but older messages
are slower
• Store a cache of last N messages per user
• Used capped array to age out older messages
• Create cache lazily when user accesses inbox
• Only write the message if cache exists.
• Use TTL collection to time out caches for inactive
users

Cached Inbox
// Shard on “owner"
db.shardCollection(”myapp.caches”, { ”owner”: 1 } )
// Send a message, add it to the existing caches of followers
db.caches.update({ owner: recipient }, { $push: { messages: {
$each: [ msg ],
$sort: { „sent‟: 1 },
$slice: -50 } } } );
// Read my inbox
If( msgs = db.caches.find({ owner: ”Joe” }) ) {
// cache document exists
return msgs;
} else {
// fall back to "fan out on read" and cache it
db.caches.save({owner:‟joe‟, messages:[]});
msgs = db.outbox.find({sender: { $in: [ followersOf( msg.from ) ] }}).sort({sent:-1}).limit(50);
db.caches.update({user:‟joe‟}, {$push: msgs });
}

Cached Inbox – Send
Send
Message

Cached Inbox- Read
Read
Inbox
1
2
Cache Hit
Cache Miss

Tradeoffs
Fan out on
Read
Fan out on
Write
Bucketed Fan
out on Write
Inbox Cache
Send
Message
Performance
Best
Single shard
Single write
Good
Shard per
recipient
Multiple writes
Worst
Shard per recipient
Appends (grows)
Mixed
Depends on how
many users are in
cache
Read Inbox
Performance
Worst
Broadcast all
shards
Random reads
Good
Single shard
Random reads
Best
Single shard
Single read
Mixed
Recent
messages fast
Older messages
are slow
Data Size Best
Message stored
once
Worst
Copy per
recipient
Worst
Copy per recipient
Good
Same as FoR +
size of cache

Things to consider
• Lots of recipients
• Fan out on write might become prohibitive
• Consider introducing a “Group”
• Make fan out asynchronous
• Very large message size
• Multiple copies of messages can be a burden
• Consider single copy of message with a “pointer” per inbox
• More writes than reads
• Fan out on read might be okay

Summary
• Multiple ways to model status updates
• Think about characteristics of your network
– Number of users
– Number of edges
– Publish frequency
– Access patterns
• Try to minimize random IO

Technical Account Manager Lead, MongoDB Inc
Antoine Girbal
JavaOne 2013
Thank You

Building a Scalable Inbox System with MongoDB and Java

More Related Content

What's hot

Similar to Building a Scalable Inbox System with MongoDB and Java

Recently uploaded

Building a Scalable Inbox System with MongoDB and Java