#MongoDBdays #askAsya @asya999 
Building a Social Platform 
with MongoDB 
Asya Kamsky 
MongoDB Inc
Solutions Engineering 
• Identify Popular Use Cases 
– Directly from MongoDB Users 
– Addressing "limitations" 
• Go beyond documentation and blogs 
• Create open source project 
• Run it!
Social Status Feed
Status Feed
Status Feed
Socialite 
https://github.com/10gen-labs/socialite 
• Open Source 
• Reference Implementation 
– Various Fanout Feed Models 
– User Graph Implementation 
– Content storage 
• Configurable models and options 
• REST API in Dropwizard (Yammer) 
– https://dropwizard.github.io/dropwizard/ 
• Built-in benchmarking
Architecture 
Content Proxy 
Graph Service Proxy
Pluggable Services 
• Major components each have an interface 
– see com.mongodb.socialite.services 
• Configuration selects implementation to use 
• ServiceManager organizes : 
– Default implementations 
– Lifecycle 
– Binding configuration 
– Wiring dependencies 
– see com.mongodb.socialite.ServiceManager
Simple Interface 
https://github.com/10gen-labs/socialite 
GET /users/{user_id} Get a User by their ID 
DELETE /users/{user_id} Remove a user by their ID 
POST /users/{user_id}/posts Send a message from this user 
GET /users/{user_id}/followers Get a list of followers of a user 
GET /users/{user_id}/followers_count Get the number of followers of a user 
GET /users/{user_id}/following Get the list of users this user is following 
GET /users/{user_id}/following count Get the number of users this user follows 
GET /users/{user_id}/posts Get the messages sent by a user 
GET /users/{user_id}/timeline Get the timeline for this user 
PUT /users/{user_id} Create a new user 
PUT /users/{user_id}/following/{target} Follow a user 
DELETE /users/{user_id}/following/{target} Unfollow a user
Technical Decisions 
User 
timeline 
cache 
Schema 
Indexing Horizontal Scaling
Operational Testing 
Real life validation of our choices. 
Most important criteria? 
User facing latency 
Linear scaling of resources
Scaling Goals 
• Realistic real-life-scale workload 
– compared to Twitter, etc. 
• Understanding of HW required 
– containing costs 
• Confirm architecture scales linearly 
– without loss of responsiveness
Architecture 
Content Proxy 
Graph Service Proxy
Operational Testing 
• All hosts in AWS 
• Each service used its own DB, cluster or shards 
• All benchmarks through `mongos` (sharded config) 
• Used MMS monitoring for measuring throughput 
• Used internal benchmarks for measuring latency 
• Based volume tested on real life social metrics
Scaling for Infinite Content
Architecture 
Content Proxy 
Graph Service Proxy
Socialite Content Service 
• System of record for all user content 
• Initially very simple (no search) 
• Mainly designed to support feed 
– Lookup/indexed by _id and userid 
– Time based anchors/pagination
Social Data Ages Fast 
• Half life of most content is 1 day ! 
• Popular content usually < 1 month 
• Access to old data is rare
Content Service 
• Index by userId, _id 
• Shard by userId (or userId, _id) 
• Supports “user data” as pass-through 
{ 
"_id" : ObjectId("52aaaa14a0ee0d44323e623a"), 
"_a" : "user1", 
"_m" : "this is a post”, 
"_d" : { 
"geohash" : "6gkzwgjzn820" 
} 
}
Benchmarks
Architecture 
Content Proxy 
Graph Service Proxy
Graph Data - Social 
John Kate 
follows 
Bob 
Pete 
Recommendation ?
Graph Data - Promotional 
John Kate 
follows 
Bob 
Pete 
Mention 
Acme 
Soda 
Recommendation ?
Graph Data - Everywhere 
• Retail 
• Complex product catalogues 
• Product recommendation engines 
• Manufacturing and Logistics 
• Tracing failures to faulty component batches 
• Determining fallout from supply interruption 
• Healthcare 
• Patient/Physician interactions
Design Considerations
The Tale of Two Biebers 
VS
Follower Churn 
• Tempting to focus on scaling content 
• Follow requests rival message send rates 
• Twitter enforces per day follow limits
Edge Metadata 
• Models – friends/followers 
• Requirements typically start simple 
• Add Groups, Favorites, Relationships
Storing Graphs in MongoDB
Option One – Embedding Edges
Embedded Edge Arrays 
• Storing connections with user (popular choice) 
Most compact form 
Efficient for reads 
• However…. 
– User documents grow 
– Upper limit on degree (document size) 
– Difficult to annotate (and index) edge 
{ 
"_id" : "djw", 
"fullname" : "Darren Wood", 
"country" : "Australia", 
"followers" : [ "jsr", "ian"], 
"following" : [ "jsr", "pete"] 
}
Embedded Edge Arrays 
• Creating Rich Graph Information 
– Can become cumbersome 
{ 
"_id" : "djw", 
"fullname" : "Darren Wood", 
"country" : "Australia", 
"friends" : [ 
{"uid" : "jsr", "grp" : "school"}, 
{"uid" : "ian", "grp" : "work"} ] 
} 
{ 
"_id" : "djw", 
"fullname" : "Darren Wood", 
"country" : "Australia", 
"friends" : [ "jsr", "ian"], 
"group" : [ ”school", ”work"] 
}
Option Two – Edge Collection
Edge Collections 
• Document per edge 
> db.followers.findOne() 
{ 
"_id" : ObjectId(…), 
"from" : "djw", 
"to" : "jsr" 
} 
• Very flexible for adding edge data 
> db.friends.findOne() 
{ 
"_id" : ObjectId(…), 
"from" : "djw", 
"to" : "jsr", 
"grp" : "work", 
"ts" : Date("2013-07-10") 
}
Operational comparison 
• Updates of embedded arrays 
– grow non-linearly with number of indexed array elements 
• Updating edge collection => inserts 
– grows close to linearly with existing number of edges/user
Edge Insert Rate
Edge Collection 
Indexing Strategies
Finding Followers 
Consider our single follower collection : 
> db.followers.find({from : "djw"}, {_id:0, to:1}) 
{ 
"to" : "jsr" 
} 
Using index : 
{ 
"v" : 1, 
"key" : { "from" : 1, "to" : 1 }, 
"unique" : true, 
"ns" : "socialite.followers", 
"name" : "from_1_to_1" 
} 
Covered index 
when searching on 
"from" for all 
followers 
Specify only if 
multiple edges 
cannot exist
Finding Following 
What about who a user is following? 
Can use a reverse covered index : 
{ 
"v" : 1, 
"key" : { "from" : 1, "to" : 1 }, 
"unique" : true, 
"ns" : "socialite.followers", 
"name" : "from_1_to_1" 
} 
{ 
"v" : 1, 
"key" : { "to" : 1, "from" : 1 }, 
"unique" : true, 
"ns" : "socialite.followers", 
"name" : "to_1_from_1" 
} 
Notice the flipped 
field order here
Finding Following 
Wait ! There is an issue with the reverse index….. 
SHARDING ! 
{ 
"v" : 1, 
"key" : { "from" : 1, "to" : 1 }, 
"unique" : true, 
"ns" : "socialite.followers", 
"name" : "from_1_to_1" 
} 
{ 
"v" : 1, 
"key" : { "to" : 1, "from" : 1 }, 
"unique" : true, 
"ns" : "socialite.followers", 
"name" : "to_1_from_1" 
} 
If we shard this collection 
by "from", looking up 
followers for a specific 
user is "targeted" to a 
shard 
To find who the user is 
following however, it must 
scatter-gather the query to 
all shards
Dual Edge Collections
Dual Edge Collections 
When "following" queries are common 
– Not always the case 
– Consider overhead carefully 
Can use dual collections storing 
– One for each direction 
– Edges are duplicated reversed 
– Can be sharded independently
Edge Query Rate Comparison 
Number of shards 
vs 
Number of queries 
Followers collection 
with forward and 
reverse indexes 
Two collections, 
followers, following 
one index each 
1 10,000 10,000 
3 90,000 30,000 
6 360,000 60,000 
12 1,440,000 120,000
Architecture 
Content Proxy 
Graph Service Proxy
Feed Service 
• Two main functions : 
– Aggregating “followed” content for a user 
– Forwarding user’s content to “followers” 
• Common implementation models : 
– Fanout on read 
• Query content of all followed users on fly 
– Fanout on write 
• Add to “cache” of each user’s timeline for every post 
• Various storage models for the timeline
Fanout On Read
Fanout On Read 
Pros 
Simple implementation 
No extra storage for timelines 
Cons 
– Timeline reads (typically) hit all shards 
– Often involves reading more data than required 
– May require additional indexing on Content
Fanout On Write
Fanout On Write 
Pros 
Timeline can be single document read 
Dormant users easily excluded 
Working set minimized 
Cons 
– Fanout for large follower lists can be expensive 
– Additional storage for materialized timelines
Fanout On Write 
• Three different approaches 
– Time buckets 
– Size buckets 
– Cache 
• Each has different pros & cons
Timeline Buckets - Time 
Upsert to time range buckets for each user 
> db.timed_buckets.find().pretty() 
{ 
"_id" : {"_u" : "jsr", "_t" : 516935}, 
"_c" : [ 
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"}, 
{"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"} 
] 
} 
{ 
"_id" : {"_u" : "ian", "_t" : 516935}, 
"_c" : [ 
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"} 
] 
} 
{ 
"_id" : {"_u" : "jsr", "_t" : 516934 }, 
"_c" : [ 
{"_id" : ObjectId("...da7"), "_a" : "ian", "_m" : "earlier from ian"} 
] 
}
Timeline Buckets - Size 
More complex, but more consistently sized 
> db.sized_buckets.find().pretty() 
{ 
"_id" : ObjectId("...122"), 
"_c" : [ 
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"}, 
{"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"}, 
{"_id" : ObjectId("...da7"), "_a" : "ian", "_m" : "earlier from ian"} 
], 
"_s" : 3, 
"_u" : "jsr" 
} 
{ 
"_id" : ObjectId("...011"), 
"_c" : [ 
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"} 
], 
"_s" : 1, 
"_u" : "ian" 
}
Timeline - Cache 
Store a limited cache, fall back to "fanout on read" 
– Create single cache doc on demand with upsert 
– Limit size of cache with $slice 
– Timeout docs with TTL for inactive users 
> db.timeline_cache.find().pretty() 
{ 
"_c" : [ 
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"}, 
{"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"}, 
{"_id" : ObjectId("...da7"), "_a" : "ian", "_m" : "earlier from ian"} 
], 
"_u" : "jsr" 
} 
{ 
"_c" : [ 
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"} 
], 
"_u" : "ian" 
}
Embedding vs Linking Content 
Embedded content for direct access 
– Great when it is small, predictable in size 
Link to content, store only metadata 
– Read only desired content on demand 
– Further stabilizes cache document sizes 
> db.timeline_cache.findOne({”_id" : "jsr"}) 
{ 
"_c" : [ 
{"_id" : ObjectId("...dc1”)}, 
{"_id" : ObjectId("...dd2”)}, 
{"_id" : ObjectId("...da7”)} 
], 
”_id" : "jsr" 
}
Socialite Feed Service 
• Implemented four models as plugins 
– FanoutOnRead 
– FanoutOnWrite – Buckets (size) 
– FanoutOnWrite – Buckets (time) 
– FanoutOnWrite - Cache 
• Switchable by config 
• Store content by reference or value 
• Benchmark-able back to back
Benchmark by feed type
Benchmarking the Feed 
• Biggest challenge: scaling the feed 
• High cost of "fanout on write" 
• Popular user posts => # operations: 
– Content collection insert: 1 
– Timeline Cache: on average, 130+ cache document 
updates 
• SCATTER GATHER (slowest shard determines 
latency)
Benchmarking the Feed 
• Timeline is different from content! 
– "It's a Cache" 
IT CAN BE REBUILT!
Benchmarking the Feed 
IT CAN BE REBUILT!
Benchmarking the Feed 
• Results 
– over two weeks 
– ran load with one million users 
– ran load with ten million users 
– used avg send rate 1K/s; 2K/s; reads 10K-20k/s 
– 22 AWS c3.2xlarge servers (7.5GB RAM) 
– 18 across six shards (3 content, 3 user graph) 
– 4 mongos and app machines 
– 2 c2x4xlarge servers (30GB RAM) 
– timeline feed cache (six shards)
Summary
Socialite 
https://github.com/10gen-labs/socialite 
• Real Working Implementation 
– Implements All Components 
– Configurable models and options 
• Built-in benchmarking 
• Questions? 
– I will be at "Ask The Experts" this afternoon! 
https://github.com/10gen-labs/socialite
Thank You! 
https://github.com/10gen-labs/socialite

Socialite, the Open Source Status Feed

  • 1.
    #MongoDBdays #askAsya @asya999 Building a Social Platform with MongoDB Asya Kamsky MongoDB Inc
  • 2.
    Solutions Engineering •Identify Popular Use Cases – Directly from MongoDB Users – Addressing "limitations" • Go beyond documentation and blogs • Create open source project • Run it!
  • 3.
  • 4.
  • 5.
  • 6.
    Socialite https://github.com/10gen-labs/socialite •Open Source • Reference Implementation – Various Fanout Feed Models – User Graph Implementation – Content storage • Configurable models and options • REST API in Dropwizard (Yammer) – https://dropwizard.github.io/dropwizard/ • Built-in benchmarking
  • 7.
    Architecture Content Proxy Graph Service Proxy
  • 8.
    Pluggable Services •Major components each have an interface – see com.mongodb.socialite.services • Configuration selects implementation to use • ServiceManager organizes : – Default implementations – Lifecycle – Binding configuration – Wiring dependencies – see com.mongodb.socialite.ServiceManager
  • 9.
    Simple Interface https://github.com/10gen-labs/socialite GET /users/{user_id} Get a User by their ID DELETE /users/{user_id} Remove a user by their ID POST /users/{user_id}/posts Send a message from this user GET /users/{user_id}/followers Get a list of followers of a user GET /users/{user_id}/followers_count Get the number of followers of a user GET /users/{user_id}/following Get the list of users this user is following GET /users/{user_id}/following count Get the number of users this user follows GET /users/{user_id}/posts Get the messages sent by a user GET /users/{user_id}/timeline Get the timeline for this user PUT /users/{user_id} Create a new user PUT /users/{user_id}/following/{target} Follow a user DELETE /users/{user_id}/following/{target} Unfollow a user
  • 10.
    Technical Decisions User timeline cache Schema Indexing Horizontal Scaling
  • 11.
    Operational Testing Reallife validation of our choices. Most important criteria? User facing latency Linear scaling of resources
  • 12.
    Scaling Goals •Realistic real-life-scale workload – compared to Twitter, etc. • Understanding of HW required – containing costs • Confirm architecture scales linearly – without loss of responsiveness
  • 13.
    Architecture Content Proxy Graph Service Proxy
  • 14.
    Operational Testing •All hosts in AWS • Each service used its own DB, cluster or shards • All benchmarks through `mongos` (sharded config) • Used MMS monitoring for measuring throughput • Used internal benchmarks for measuring latency • Based volume tested on real life social metrics
  • 15.
  • 16.
    Architecture Content Proxy Graph Service Proxy
  • 17.
    Socialite Content Service • System of record for all user content • Initially very simple (no search) • Mainly designed to support feed – Lookup/indexed by _id and userid – Time based anchors/pagination
  • 18.
    Social Data AgesFast • Half life of most content is 1 day ! • Popular content usually < 1 month • Access to old data is rare
  • 19.
    Content Service •Index by userId, _id • Shard by userId (or userId, _id) • Supports “user data” as pass-through { "_id" : ObjectId("52aaaa14a0ee0d44323e623a"), "_a" : "user1", "_m" : "this is a post”, "_d" : { "geohash" : "6gkzwgjzn820" } }
  • 20.
  • 21.
    Architecture Content Proxy Graph Service Proxy
  • 22.
    Graph Data -Social John Kate follows Bob Pete Recommendation ?
  • 23.
    Graph Data -Promotional John Kate follows Bob Pete Mention Acme Soda Recommendation ?
  • 24.
    Graph Data -Everywhere • Retail • Complex product catalogues • Product recommendation engines • Manufacturing and Logistics • Tracing failures to faulty component batches • Determining fallout from supply interruption • Healthcare • Patient/Physician interactions
  • 25.
  • 26.
    The Tale ofTwo Biebers VS
  • 27.
    Follower Churn •Tempting to focus on scaling content • Follow requests rival message send rates • Twitter enforces per day follow limits
  • 28.
    Edge Metadata •Models – friends/followers • Requirements typically start simple • Add Groups, Favorites, Relationships
  • 29.
  • 30.
    Option One –Embedding Edges
  • 31.
    Embedded Edge Arrays • Storing connections with user (popular choice) Most compact form Efficient for reads • However…. – User documents grow – Upper limit on degree (document size) – Difficult to annotate (and index) edge { "_id" : "djw", "fullname" : "Darren Wood", "country" : "Australia", "followers" : [ "jsr", "ian"], "following" : [ "jsr", "pete"] }
  • 32.
    Embedded Edge Arrays • Creating Rich Graph Information – Can become cumbersome { "_id" : "djw", "fullname" : "Darren Wood", "country" : "Australia", "friends" : [ {"uid" : "jsr", "grp" : "school"}, {"uid" : "ian", "grp" : "work"} ] } { "_id" : "djw", "fullname" : "Darren Wood", "country" : "Australia", "friends" : [ "jsr", "ian"], "group" : [ ”school", ”work"] }
  • 33.
    Option Two –Edge Collection
  • 34.
    Edge Collections •Document per edge > db.followers.findOne() { "_id" : ObjectId(…), "from" : "djw", "to" : "jsr" } • Very flexible for adding edge data > db.friends.findOne() { "_id" : ObjectId(…), "from" : "djw", "to" : "jsr", "grp" : "work", "ts" : Date("2013-07-10") }
  • 35.
    Operational comparison •Updates of embedded arrays – grow non-linearly with number of indexed array elements • Updating edge collection => inserts – grows close to linearly with existing number of edges/user
  • 36.
  • 37.
  • 38.
    Finding Followers Considerour single follower collection : > db.followers.find({from : "djw"}, {_id:0, to:1}) { "to" : "jsr" } Using index : { "v" : 1, "key" : { "from" : 1, "to" : 1 }, "unique" : true, "ns" : "socialite.followers", "name" : "from_1_to_1" } Covered index when searching on "from" for all followers Specify only if multiple edges cannot exist
  • 39.
    Finding Following Whatabout who a user is following? Can use a reverse covered index : { "v" : 1, "key" : { "from" : 1, "to" : 1 }, "unique" : true, "ns" : "socialite.followers", "name" : "from_1_to_1" } { "v" : 1, "key" : { "to" : 1, "from" : 1 }, "unique" : true, "ns" : "socialite.followers", "name" : "to_1_from_1" } Notice the flipped field order here
  • 40.
    Finding Following Wait! There is an issue with the reverse index….. SHARDING ! { "v" : 1, "key" : { "from" : 1, "to" : 1 }, "unique" : true, "ns" : "socialite.followers", "name" : "from_1_to_1" } { "v" : 1, "key" : { "to" : 1, "from" : 1 }, "unique" : true, "ns" : "socialite.followers", "name" : "to_1_from_1" } If we shard this collection by "from", looking up followers for a specific user is "targeted" to a shard To find who the user is following however, it must scatter-gather the query to all shards
  • 41.
  • 42.
    Dual Edge Collections When "following" queries are common – Not always the case – Consider overhead carefully Can use dual collections storing – One for each direction – Edges are duplicated reversed – Can be sharded independently
  • 43.
    Edge Query RateComparison Number of shards vs Number of queries Followers collection with forward and reverse indexes Two collections, followers, following one index each 1 10,000 10,000 3 90,000 30,000 6 360,000 60,000 12 1,440,000 120,000
  • 44.
    Architecture Content Proxy Graph Service Proxy
  • 45.
    Feed Service •Two main functions : – Aggregating “followed” content for a user – Forwarding user’s content to “followers” • Common implementation models : – Fanout on read • Query content of all followed users on fly – Fanout on write • Add to “cache” of each user’s timeline for every post • Various storage models for the timeline
  • 46.
  • 47.
    Fanout On Read Pros Simple implementation No extra storage for timelines Cons – Timeline reads (typically) hit all shards – Often involves reading more data than required – May require additional indexing on Content
  • 48.
  • 49.
    Fanout On Write Pros Timeline can be single document read Dormant users easily excluded Working set minimized Cons – Fanout for large follower lists can be expensive – Additional storage for materialized timelines
  • 50.
    Fanout On Write • Three different approaches – Time buckets – Size buckets – Cache • Each has different pros & cons
  • 51.
    Timeline Buckets -Time Upsert to time range buckets for each user > db.timed_buckets.find().pretty() { "_id" : {"_u" : "jsr", "_t" : 516935}, "_c" : [ {"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"}, {"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"} ] } { "_id" : {"_u" : "ian", "_t" : 516935}, "_c" : [ {"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"} ] } { "_id" : {"_u" : "jsr", "_t" : 516934 }, "_c" : [ {"_id" : ObjectId("...da7"), "_a" : "ian", "_m" : "earlier from ian"} ] }
  • 52.
    Timeline Buckets -Size More complex, but more consistently sized > db.sized_buckets.find().pretty() { "_id" : ObjectId("...122"), "_c" : [ {"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"}, {"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"}, {"_id" : ObjectId("...da7"), "_a" : "ian", "_m" : "earlier from ian"} ], "_s" : 3, "_u" : "jsr" } { "_id" : ObjectId("...011"), "_c" : [ {"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"} ], "_s" : 1, "_u" : "ian" }
  • 53.
    Timeline - Cache Store a limited cache, fall back to "fanout on read" – Create single cache doc on demand with upsert – Limit size of cache with $slice – Timeout docs with TTL for inactive users > db.timeline_cache.find().pretty() { "_c" : [ {"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"}, {"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"}, {"_id" : ObjectId("...da7"), "_a" : "ian", "_m" : "earlier from ian"} ], "_u" : "jsr" } { "_c" : [ {"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"} ], "_u" : "ian" }
  • 54.
    Embedding vs LinkingContent Embedded content for direct access – Great when it is small, predictable in size Link to content, store only metadata – Read only desired content on demand – Further stabilizes cache document sizes > db.timeline_cache.findOne({”_id" : "jsr"}) { "_c" : [ {"_id" : ObjectId("...dc1”)}, {"_id" : ObjectId("...dd2”)}, {"_id" : ObjectId("...da7”)} ], ”_id" : "jsr" }
  • 55.
    Socialite Feed Service • Implemented four models as plugins – FanoutOnRead – FanoutOnWrite – Buckets (size) – FanoutOnWrite – Buckets (time) – FanoutOnWrite - Cache • Switchable by config • Store content by reference or value • Benchmark-able back to back
  • 56.
  • 57.
    Benchmarking the Feed • Biggest challenge: scaling the feed • High cost of "fanout on write" • Popular user posts => # operations: – Content collection insert: 1 – Timeline Cache: on average, 130+ cache document updates • SCATTER GATHER (slowest shard determines latency)
  • 58.
    Benchmarking the Feed • Timeline is different from content! – "It's a Cache" IT CAN BE REBUILT!
  • 59.
    Benchmarking the Feed IT CAN BE REBUILT!
  • 60.
    Benchmarking the Feed • Results – over two weeks – ran load with one million users – ran load with ten million users – used avg send rate 1K/s; 2K/s; reads 10K-20k/s – 22 AWS c3.2xlarge servers (7.5GB RAM) – 18 across six shards (3 content, 3 user graph) – 4 mongos and app machines – 2 c2x4xlarge servers (30GB RAM) – timeline feed cache (six shards)
  • 61.
  • 62.
    Socialite https://github.com/10gen-labs/socialite •Real Working Implementation – Implements All Components – Configurable models and options • Built-in benchmarking • Questions? – I will be at "Ask The Experts" this afternoon! https://github.com/10gen-labs/socialite
  • 63.

Editor's Notes

  • #5 News/Social Status Feed: popular and common Internal goals: implement different schema options, builtin benchmarking for comparison External goals: low latency from end-user perspective, linear scaling from operational perspective
  • #6 News/Social Status Feed: popular and common Internal goals: implement different schema options, builtin benchmarking for comparison External goals: low latency from end-user perspective, linear scaling from operational perspective
  • #7 image at https://dropwizard.github.io/dropwizard of the hat 
  • #10 add REST API calls
  • #36 How to test, show how growing documents are very painful to update. Add the MTV or appmetrics mtools plot showing what happens to outliers.
  • #37 actual performance – show how inserting million users was easy – no point even trying to update embedded documents...
  • #39 side-point of
  • #51 Variants?
  • #55 Should you embed the messages/content into "cache"/buckets/etc. or just store references?
  • #56 WHICH ONE DID WE IMPLEMENT IN SOCIALITE??? All work with Async Service(? or mention later) And we did benchmark them! -> Asya
  • #57 examining latency of reading content by fanout type - note two types of latency – for sender and for recipient. scaling throughput... THIS WILL NOT SCALE LINEARLY(!) *RERUN WITH SEVERAL SHARDS* replace with new screenshot
  • #58 MongoDB as a cache Storage amplification on a feed service – Justin Bieber makes a single post and we need to write it to 2 million timelines.... ??? Cache only for active users. Number of updates across all cache / number of documents updated
  • #59 MongoDB as a cache Storage amplification on a feed service – Justin Bieber makes a single post and we need to write it to 2 million timelines.... ??? Cache only for active users.
  • #60 MongoDB as a cache Storage amplification on a feed service – Justin Bieber makes a single post and we need to write it to 2 million timelines.... ??? Cache only for active users.