Socialite, the Open Source Status Feed

#MongoDBdays #askAsya @asya999
Building a Social Platform
with MongoDB
Asya Kamsky
MongoDB Inc

Solutions Engineering
• Identify Popular Use Cases
– Directly from MongoDB Users
– Addressing "limitations"
• Go beyond documentation and blogs
• Create open source project
• Run it!

Socialite
https://github.com/10gen-labs/socialite
• Open Source
• Reference Implementation
– Various Fanout Feed Models
– User Graph Implementation
– Content storage
• Configurable models and options
• REST API in Dropwizard (Yammer)
– https://dropwizard.github.io/dropwizard/
• Built-in benchmarking

Architecture
Content Proxy
Graph Service Proxy

Pluggable Services
• Major components each have an interface
– see com.mongodb.socialite.services
• Configuration selects implementation to use
• ServiceManager organizes :
– Default implementations
– Lifecycle
– Binding configuration
– Wiring dependencies
– see com.mongodb.socialite.ServiceManager

Simple Interface
GET /users/{user_id} Get a User by their ID
DELETE /users/{user_id} Remove a user by their ID
POST /users/{user_id}/posts Send a message from this user
GET /users/{user_id}/followers Get a list of followers of a user
GET /users/{user_id}/followers_count Get the number of followers of a user
GET /users/{user_id}/following Get the list of users this user is following
GET /users/{user_id}/following count Get the number of users this user follows
GET /users/{user_id}/posts Get the messages sent by a user
GET /users/{user_id}/timeline Get the timeline for this user
PUT /users/{user_id} Create a new user
PUT /users/{user_id}/following/{target} Follow a user
DELETE /users/{user_id}/following/{target} Unfollow a user

Technical Decisions
User
timeline
cache
Schema
Indexing Horizontal Scaling

Operational Testing
Real life validation of our choices.
Most important criteria?
User facing latency
Linear scaling of resources

Scaling Goals
• Realistic real-life-scale workload
– compared to Twitter, etc.
• Understanding of HW required
– containing costs
• Confirm architecture scales linearly
– without loss of responsiveness

Operational Testing
• All hosts in AWS
• Each service used its own DB, cluster or shards
• All benchmarks through `mongos` (sharded config)
• Used MMS monitoring for measuring throughput
• Used internal benchmarks for measuring latency
• Based volume tested on real life social metrics

Socialite Content Service
• System of record for all user content
• Initially very simple (no search)
• Mainly designed to support feed
– Lookup/indexed by _id and userid
– Time based anchors/pagination

Social Data Ages Fast
• Half life of most content is 1 day !
• Popular content usually < 1 month
• Access to old data is rare

Content Service
• Index by userId, _id
• Shard by userId (or userId, _id)
• Supports “user data” as pass-through
{
"_id" : ObjectId("52aaaa14a0ee0d44323e623a"),
"_a" : "user1",
"_m" : "this is a post”,
"_d" : {
"geohash" : "6gkzwgjzn820"
}
}

Graph Data - Social
John Kate
follows
Bob
Pete
Recommendation ?

Graph Data - Promotional
John Kate
follows
Bob
Pete
Mention
Acme
Soda
Recommendation ?

Graph Data - Everywhere
• Retail
• Complex product catalogues
• Product recommendation engines
• Manufacturing and Logistics
• Tracing failures to faulty component batches
• Determining fallout from supply interruption
• Healthcare
• Patient/Physician interactions

Follower Churn
• Tempting to focus on scaling content
• Follow requests rival message send rates
• Twitter enforces per day follow limits

Edge Metadata
• Models – friends/followers
• Requirements typically start simple
• Add Groups, Favorites, Relationships

Option One – Embedding Edges

Embedded Edge Arrays
• Storing connections with user (popular choice)
Most compact form
Efficient for reads
• However….
– User documents grow
– Upper limit on degree (document size)
– Difficult to annotate (and index) edge
{
"_id" : "djw",
"fullname" : "Darren Wood",
"country" : "Australia",
"followers" : [ "jsr", "ian"],
"following" : [ "jsr", "pete"]
}

Embedded Edge Arrays
• Creating Rich Graph Information
– Can become cumbersome
{
"_id" : "djw",
"friends" : [
{"uid" : "jsr", "grp" : "school"},
{"uid" : "ian", "grp" : "work"} ]
}
{
"_id" : "djw",
"friends" : [ "jsr", "ian"],
"group" : [ ”school", ”work"]
}

Option Two – Edge Collection

Edge Collections
• Document per edge
> db.followers.findOne()
{
"_id" : ObjectId(…),
"from" : "djw",
"to" : "jsr"
}
• Very flexible for adding edge data
> db.friends.findOne()
{
"_id" : ObjectId(…),
"from" : "djw",
"to" : "jsr",
"grp" : "work",
"ts" : Date("2013-07-10")
}

Operational comparison
• Updates of embedded arrays
– grow non-linearly with number of indexed array elements
• Updating edge collection => inserts
– grows close to linearly with existing number of edges/user

Edge Collection
Indexing Strategies

Finding Followers
Consider our single follower collection :
> db.followers.find({from : "djw"}, {_id:0, to:1})
{
"to" : "jsr"
}
Using index :
{
"v" : 1,
"key" : { "from" : 1, "to" : 1 },
"unique" : true,
"ns" : "socialite.followers",
"name" : "from_1_to_1"
}
Covered index
when searching on
"from" for all
followers
Specify only if
multiple edges
cannot exist

Finding Following
What about who a user is following?
Can use a reverse covered index :
{
"v" : 1,
"key" : { "from" : 1, "to" : 1 },
"unique" : true,
}
{
"v" : 1,
"key" : { "to" : 1, "from" : 1 },
"unique" : true,
"name" : "to_1_from_1"
}
Notice the flipped
field order here

Finding Following
Wait ! There is an issue with the reverse index…..
SHARDING !
{
"v" : 1,
"key" : { "from" : 1, "to" : 1 },
"unique" : true,
}
{
"v" : 1,
"key" : { "to" : 1, "from" : 1 },
"unique" : true,
"name" : "to_1_from_1"
}
If we shard this collection
by "from", looking up
followers for a specific
user is "targeted" to a
shard
To find who the user is
following however, it must
scatter-gather the query to
all shards

Dual Edge Collections
When "following" queries are common
– Not always the case
– Consider overhead carefully
Can use dual collections storing
– One for each direction
– Edges are duplicated reversed
– Can be sharded independently

Edge Query Rate Comparison
Number of shards
vs
Number of queries
Followers collection
with forward and
reverse indexes
Two collections,
followers, following
one index each
1 10,000 10,000
3 90,000 30,000
6 360,000 60,000
12 1,440,000 120,000

Feed Service
• Two main functions :
– Aggregating “followed” content for a user
– Forwarding user’s content to “followers”
• Common implementation models :
– Fanout on read
• Query content of all followed users on fly
– Fanout on write
• Add to “cache” of each user’s timeline for every post
• Various storage models for the timeline

Fanout On Read
Pros
Simple implementation
No extra storage for timelines
Cons
– Timeline reads (typically) hit all shards
– Often involves reading more data than required
– May require additional indexing on Content

Fanout On Write
Pros
Timeline can be single document read
Dormant users easily excluded
Working set minimized
Cons
– Fanout for large follower lists can be expensive
– Additional storage for materialized timelines

Fanout On Write
• Three different approaches
– Time buckets
– Size buckets
– Cache
• Each has different pros & cons

Timeline Buckets - Time
Upsert to time range buckets for each user
> db.timed_buckets.find().pretty()
{
"_id" : {"_u" : "jsr", "_t" : 516935},
"_c" : [
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"},
{"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"}
]
}
{
"_id" : {"_u" : "ian", "_t" : 516935},
"_c" : [
{"_id" : ObjectId("...dc1"), "_a" : "djw", "_m" : "message from daz"}
]
}
{
"_id" : {"_u" : "jsr", "_t" : 516934 },
"_c" : [
{"_id" : ObjectId("...da7"), "_a" : "ian", "_m" : "earlier from ian"}
]
}

Timeline Buckets - Size
More complex, but more consistently sized
> db.sized_buckets.find().pretty()
{
"_id" : ObjectId("...122"),
"_c" : [
{"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"},
],
"_s" : 3,
"_u" : "jsr"
}
{
"_id" : ObjectId("...011"),
"_c" : [
],
"_s" : 1,
"_u" : "ian"
}

Timeline - Cache
Store a limited cache, fall back to "fanout on read"
– Create single cache doc on demand with upsert
– Limit size of cache with $slice
– Timeout docs with TTL for inactive users
> db.timeline_cache.find().pretty()
{
"_c" : [
{"_id" : ObjectId("...dd2"), "_a" : "ian", "_m" : "message from ian"},
],
"_u" : "jsr"
}
{
"_c" : [
],
"_u" : "ian"
}

Embedding vs Linking Content
Embedded content for direct access
– Great when it is small, predictable in size
Link to content, store only metadata
– Read only desired content on demand
– Further stabilizes cache document sizes
> db.timeline_cache.findOne({”_id" : "jsr"})
{
"_c" : [
{"_id" : ObjectId("...dc1”)},
{"_id" : ObjectId("...dd2”)},
{"_id" : ObjectId("...da7”)}
],
”_id" : "jsr"
}

Socialite Feed Service
• Implemented four models as plugins
– FanoutOnRead
– FanoutOnWrite – Buckets (size)
– FanoutOnWrite – Buckets (time)
– FanoutOnWrite - Cache
• Switchable by config
• Store content by reference or value
• Benchmark-able back to back

Benchmarking the Feed
• Biggest challenge: scaling the feed
• High cost of "fanout on write"
• Popular user posts => # operations:
– Content collection insert: 1
– Timeline Cache: on average, 130+ cache document
updates
• SCATTER GATHER (slowest shard determines
latency)

• Timeline is different from content!
– "It's a Cache"
IT CAN BE REBUILT!

IT CAN BE REBUILT!

• Results
– over two weeks
– ran load with one million users
– ran load with ten million users
– used avg send rate 1K/s; 2K/s; reads 10K-20k/s
– 22 AWS c3.2xlarge servers (7.5GB RAM)
– 18 across six shards (3 content, 3 user graph)
– 4 mongos and app machines
– 2 c2x4xlarge servers (30GB RAM)
– timeline feed cache (six shards)

Socialite
• Real Working Implementation
– Implements All Components
– Configurable models and options
• Built-in benchmarking
• Questions?
– I will be at "Ask The Experts" this afternoon!

Thank You!

Socialite, the Open Source Status Feed

More Related Content

What's hot

Viewers also liked

Similar to Socialite, the Open Source Status Feed

More from MongoDB

Recently uploaded

Socialite, the Open Source Status Feed

Editor's Notes