MongoDB: How We Did It – Reanimating Identity at AOL

MongoDB: How We Did It –
Reanimating Identity at AOL

Topics
• Motivation
• Challenges
• Approach
• MongoDB Testing
• Deployment
• Collections
• Problem/Solution
• Lessons Learned
• Going Forward

Motivation
• Cluttered data
• Ambiguous data
• Functionally shared data
• Immutable data model

Challenges
• Leaving behind fault-tolerant (Non-Stop)
platform/Transactional integrity
• Merge/extricate Identity data
• Scaling to handle consolidated traffic
• Continue to support Legacy

Approach
• Document-based data model – use MongoDB
• Migrate data
• Build adapter/interceptor layer
• Production testing with no impacts

Approach
• Audit of setup with MongoDB
• Tweak mongo settings, including driver, to
optimize for performance
• Leverage eventual consistency to overcome
transactional integrity loss
• Switch Identity to new data model using
MongoDB

Migration
• Adapters support 4 stages:
1. Read/write legacy
2. Read/write legacy, write mongoDB (shadow read
mongoDB)
3. read/write mongoDB, write legacy
4. Read/write mongoDB

Production Testing
• “Chaos Monkey” testing of MongoDB
• 4 Million requests/Minute (production load,
read to write ratio 99%)
• Test primary failover (graceful)
• Kill Primary

Production Testing
• Test secondary failure
• Shutdown all secondaries
• Manually shutdown interface on primary
• Performance benchmarking

Production Testing
• Performance very good, shard key reads ~2-
3ms
• Scatter-gather reads ~12ms
• Writes good as well, ~3-20ms
• Failovers 4-5 minutes

MongoDB Healthcheck
• Use dedicated machines for Config servers
• Place Config servers in different data centers
• Handle failover in application, if network
exception, fallback to secondary
• Set lower TCP keepalive values (5 minutes)

Deployment
• Version 2.4.9
• All 75 mongod’s on separate switches
• 2 x 12 Core CPUs, 192GB of RAM and internal
controller based RAID 10 Ext4 File Systems
• Using default chunk size (64MB)

Deployment
• Have dedicated slaves for backup (configured
as hidden members with priority 0). Backup
runs during 6-8am window
• Enable powerOf2Sizes for collections to
reduce fragmentation
• Balancer restricted to 4-6am daily

Document Model
• Entire data set must be in memory to meet
performance demands
• Document field names abbreviated, but
descriptive
• Don’t store default values (Legacy document is
80% defaults)
• Working hard to keep legacy artifacts out, but
always about trade-offs

UserIdentity Collection
• Core data model for Identity
• Heterogenous collection (some documents
are “aliases” which are pointers to primary
document)
• Index on user+namespace
• Shard key is guid (UUID Type 1, flipped –
node then time)

UserIdentity
{
_id: “baebc8bcc8e14f6e9bf70221d81711e2”,
user: “jdoe”,
ns: “aol,
…
"profile" : {
"cc" : "US",
"firstNm" : ”John",
"lang" : "en_US",
"lastNm" : ”Doe”},
"sysTime" : ISODate("2014-05-03T04:43:49.899Z”)
}

Relationship Collection
• Support all cardinalities
• Equivalent to RDBMS intersection table (guid
on each end of relationship)
• Use eventually consistent framework for non-atomic
writes
• Shard key is parent+child+type (parent lookup
is primary use case)

Relationship Collection
{
"_id" : ”baa000163e5ff405b8083d5f164c11e3",
"child" : "8a9e00237d617f08df7f1685527711e2",
"createTime" : ISODate("2013-09-05T17:00:51.209Z"),
"modTime" : ISODate("2013-09-05T17:00:51.209Z"),
"attributes" : null,
"parent" : ” baebc8bcc8e14f6e9bf70221d81711e2",
"type" : ”CLASSROOM”
}

Legacy Collection
• Bridge collection to facilitate migration from
old data model to new
• Near-image of old data model but with some
refactoring (3 tables into 1 document)
• Once migration is complete, plan is to drop
this collection
• Defaults not stored, 1-2 character field names

Legacy Collection
{
"_id" : ”jdoe",
”subData" : {
"f" : NumberLong(1018628731),
"g" : ”jdoe",
"d" : false,
"e" : NumberLong(1018628731),
"b" : NumberLong(434077116),
"a" : ”JDoe",
"l" : NumberLong("212200907100000000"),
"i" : NumberLong(659952670)
},
”guid" : "baebc8bcc8e14f6e9bf70221d81711e2",
"st" : ISODate("2013-06-24T20:13:16.627Z")
}

Reservation Collection
• Namespace protection
• Enforce uniqueness of user/namespace from
application side because shard key for
UserIdentity collection is guid
• Shard key is username+namespace

Reservation Collection
{
"_id" : "b13a00163e062d8ee9dc9eaf3e2411e1",
"createTime" : ISODate("2012-01-
13T20:26:46.111Z"),
"user" : ”jdoe",
"expires" : ISODate("2012-01-13T21:26:46.111Z"),
”rsvId" : "e9bddfe1-1c84-42c9-8f4c-1a7a96920ff4",
”data" : { "k1": "v1", "k2" : "v2" },
”ns" : "aol",
"type" : "R"
}

Problem
Writes spanning multiple documents sometimes
fail part way

Solution
• Developed eventually consistent framework
“synchronizer”
• Events sent to framework to validate, repair,
or finish
• Events retryable until success or ttl is expired

Problem
Scatter-gather queries slower, 100%
performance impact on failover

Solution
• Use Memcached to map non-shard key to
shard key (99% hit ratio for one mapping, 55%
for other)
• Use Memcached to map potentially expensive
intermediary results (88% hit ratio)

Problem
Querying lists of users required parallel
processing for performance -- increasing
connection requirements

Solution
Use $in operator to query lists of users rather
than looping through individual queries

Problem
At application startup a large number of
requests failed because of overhead in creating
mongos connections

Solution
Build into application a “warm-up” stage that
executes stock queries prior to going online and
taking traffic

Problem
During failovers or other slow periods,
application queues back up and recovery takes
too long

Solution
Determine request’s time in queue, if exceeds
client’s timeout, don’t process, drop request

Problem
Using application applied optimistic lock
encounters lock errors during concurrent writes
(entire document updated)

Solution
Use $Set operator to target writes to just those
impacted elements, use MongoDB to enforce
atomicity

Problem
Reads from primary, but when secondaries lost,
reads fail

Solution
Use primaryPreferred for reads. Want the
freshest data (password for example), but still
want reads to work if no primary exists

Problem
Large number of connections to
mongos/mongod is extending the failover times
and nearing limits

Solution
• Application DAOs share connections to same
Mongo cluster
• Connection params initially set too high
• Set connectionsPerHost and
connectionMultiplier plus a buffer to cover
the fixed number of worker threads per
application (15/5 for 32 worker threads).
• Went from 15K connections to 2K connections

Benefits
• Unanticipated benefit was ability for all
eligible users to use the AOL client
• Easily added Identity extensions leveraging
the new data model
• Support for multiple namespaces made
building APIs for multi-tenancy
straightforward
• Model is positioned in such a way to make
vision for AOL Identity feasible

Lessons Learned
• Keep connections as low as possible
– Higher connection numbers increase failover
times
• Avoid scatter-gather reads (use cache if
possible to get to shard key)
• Keep data set in memory
• Fail fast on application side to lower recovery
time

Going forward
• Implement tagging to target secondaries
• Further reduction in scatter-gather reads
• Reduce failover window to as short as possible
• Contact: doug.haydon@teamaol.com

MongoDB: How We Did It – Reanimating Identity at AOL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB: How We Did It – Reanimating Identity at AOL

Similar to MongoDB: How We Did It – Reanimating Identity at AOL (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB: How We Did It – Reanimating Identity at AOL

Editor's Notes