2. What this Talk is About
• 5 Key reasons why Wordnik migrated into
a Non-Relational database
• Process for selection, migration
• Optimizations and tips from living
survivors of the battle field
3. Why Should You Care?
• MongoDB user for 2 years
• Lessons learned, analysis, benefits from
process
• We migrated from MySQL to MongoDB
with no downtime
• We have interesting/challenging data
needs, likely relevant to you
4. More on Wordnik
• World’s fastest updating English dictionary
• Based on input of text up to 8k words/second
• Word Graph as basis to our analysis
• Synchronous & asynchronous processing
• 10’s of Billions of documents in NR
storage
• 20M daily REST API calls, billions served
• Powered by Swagger OSS API framework
Powered API
swagger.wordnik.com
5. Architectural History
• 2008: Wordnik was born as a LAMP AWS
EC2 stack
• 2009: Introduced public REST
API, powered wordnik.com, partner APIs
• 2009: drank NoSQL cool-aid
• 2010: Scala
• 2011: Micro SOA
6. Non-relational by Necessity
• Moved to NR because of ―4S‖
• Speed
• Stability
• Scaling
• Simplicity
• But…
• MySQL can go a LONG way
• Takes right team, right reasons (+ patience)
• NR offerings simply too compelling to focus on
scaling MySQL
8. Why #1: Speed bumps with MySQL
• Inserting data fast (50k recs/second)
caused MySQL mayhem
• Maintaining indexes largely to blame
• Operations for consistency unnecessary but
"cannot be turned off‖
• Devised twisted schemes to avoid client
blocking
• Aka the ―master/slave tango‖
9. Why #2: Retrieval Complexity
• Objects typically mapped to tables
• Object Hierarchy always => inner + outer joins
• Lots of static data, so why join?
• “Noun” is not getting renamed in my code’s
lifetime!
• Logic like this is probably in application logic
• Since storage is cheap
• I’ll choose speed
10. Why #2: Retrieval Complexity
One definition = 10+ joins
50
requests
per
second!
11. Why #2: Retrieval Complexity
• Embed objects in rows ―sort of works‖
• Filtering gets really nasty
• Native XML in MySQL?
• If a full table-scan is OK…
• OK, then cache it!
• Layers of caching introduced layers of complexity
• Stale data/corruption
• Object versionitis
• Cache stampedes
12. Why #3: Object Modeling
• Object models being compromised for
sake of persistence
• This is backwards!
• Extra abstraction for the wrong reason
• OK, then performance suffers
• In-application joins across objects
• ―Who ran the fetch all query against production?!‖
–any sysadmin
• ―My zillionth ORM layer that only I
understand‖ (and can maintain)
13. Why #4: Scaling
• Needed "cloud friendly storage"
• Easy up, easy down!
• Startup: Sync your data, and announce to
clients when ready for business
• Shutdown: Announce your departure and leave
• Adding MySQL instances was a dance
• Snapshot + bin files
mysql> change master to
MASTER_HOST='db1', MASTER_USER='xxx', MASTER_
PASSWORD='xxx', MASTER_LOG_FILE='master-
relay.000431', MASTER_LOG_POS=1035435402;
14. Why #4: Scaling
• What about those VMs?
• So convenient! But… they kind of suck
• Can the database succeed on a VM?
• VM Performance:
• Memory, CPU or I/O—Pick only one
• Can your database really reduce CPU or disk I/O
with lots of RAM?
15. Why #5: Big Picture
• BI tools use relational constraints for discovery
• Is this the right reason for them?
• Can we work around this?
• Let’s have a BI tool revolution, too!
• True service architecture makes relational
constraints impractical/impossible
• Distributed sharding makes relational
constraints impractical/impossible
16. Why #5: Big Picture
• Is your app smarter than your database?
• The logic line is probably blurry!
• What does count(*) really mean when you
add 5k records/sec?
• Maybe eventual consistency is not so bad…
• 2PC? Do some reading and decide!
http://eaipatterns.com/docs/IEEE_Software_Design_2PC.pdf
17. Ok, I’m in!
• I thought deciding was easy!?
• Many quickly maturing products
• Divergent features tackle different needs
• Wordnik spent 8 weeks researching and
testing NoSQL solutions
• This is a long time! (for a startup)
• Wrote ODM classes and migrated our data
• Surprise! There were surprises
• Be prepared to compromise
18. Choice Made, Now What?
• We went with MongoDB ***
• Fastest to implement
• Most reliable
• Best community
• Why?
• Why #1: Fast loading/retrieval
• Why #2: Fast ODM (50 tps => 1000 tps!)
• Why #3: Document Models === Object models
• Why #4: MMF => Kernel-managed memory + RS
• Why #5: It’s 2011, is there no progress?
19. More on Why MongoDB
• Testing, testing, testing
• Used our migration tools to load test
• Read from MySQL, write to MongoDB
• We loaded 5+ billion documents, many times over
• In the end, one server could…
• Insert 100k records/sec sustained
• Read 250k records/sec sustained
• Support concurrent loading/reading
20. Migration & Testing
• Iterated ODM mapping multiple times
• Some issues
• Type Safety
cur.next.get("iWasAnIntOnce").asInstanceOf[Long]
• Dates as Strings
obj.put("a_date", "2011-12-31") !=
obj.put("a_date", new Date("2011-12-31"))
• Storage Size
obj.put("very_long_field_name", true) >>
obj.put("vsfn", true)
21. Migration & Testing
• Expect data model iterations
• Wordnik migrated table to Mongo collection "as-is‖
• Easier to migrate, test
• _id field used same MySQL PK
• Auto Increment?
• Used MySQL to ―check-out‖ sequences
• One row per mongo collection
• Run out of sequences => get more
• Need exclusive locks here!
22. Migration & Testing
• Sequence generator in-process
SequenceGenerator.checkout("doc_metadata,100")
• Sequence generator as web service
• Centralized UID management
23. Migration & Testing
• Expect data access pattern iterations
• So much more flexibility!
• Reach into objects
> db.dictionary_entry.find({"hdr.sr":"cmu"})
• Access to a whole object tree at query time
• Overwrite a whole object at once… when desired
• Not always! This clobbers the whole record
> db.foo.save({_id:18727353,foo:"bar"})
• Update a single field:
> db.foo.update({_id:18727353},{$set:{foo:"bar"}})
24. Flip the Switch
• Migrate production with zero downtime
• We temporarily halted loading data
• Added a switch to flip between MySQL/MongoDB
• Instrument, monitor, flip it, analyze, flip back
• Profiling your code is key
• What is slow?
• Build this in your app from day 1
26. Flip the Switch
• Storage selected at runtime
val h = shouldUseMongoDb match {
case true => new MongoDbSentenceDAO
case _ => new MySQLDbSentenceDAO
}
h.find(...)
• Hot-swappable storage via configuration
• It worked!
27. Then What?
• Watch our deployment, many iterations to
mapping layer
• Settled on in-house, type-safe mapper
https://github.com/fehguy/mongodb-benchmark-tools
• Some gotchas (of course)
• Locking issues on long-running updates (more in a
minute)
• We want more of this!
• Migrated shared files to Mongo GridFS
• Easy-IT
28. Performance + Optimization
• Loading data is fast!
• Fixed collection padding, similarly-sized records
• Tail of collection is always in memory
• Append faster than MySQL in every case tested
• But... random access started getting slow
• Indexes in RAM? Yes
• Data in RAM? No, > 2TB per server
• Limited by disk I/O /seek performance
• EC2 + EBS for storage?
29. Performance + Optimization
• Moved to physical data center
• DAS & 72GB RAM => great uncached
performance
• Good move? Depends on use case
• If ―access anything anytime‖, not many options
• You want to support this?
30. Performance + Optimization
• Inserts are fast, how about updates?
• Well… update => find object, update it, save
• Lock acquired at ―find‖, released after ―save‖
• If hitting disk, lock time could be large
• Easy answer, pre-fetch on update
• Oh, and NEVER do ―update all records‖ against a
large collection
31. Performance + Optimization
• Indexes
• Can't always keep index in ram. MMF "does it's
thing"
• Right-balanced b-tree keeps necessary index hot
• Indexes hit disk => mute your pager
1
7
1 2
5 7
32. More Mongo, Please!
• We modeled our word graph in mongo
• 50M Nodes
• 80M Edges
• 80 S edge fetch
33. More Mongo, Please!
• Analytics rolled-up from aggregation jobs
• Send to Hadoop, load to mongo for fast access
34. What’s next
• Liberate our models
• stop worrying about how to store them (for the
most part)
• New features almost always NR
• Some MySQL left
• Less on each release
35. Questions?
• See more about Wordnik APIs
http://developer.wordnik.com
• Migrating from MySQL to MongoDB
http://www.slideshare.net/fehguy/migrating-from-mysql-to-mongodb-at-wordnik
• Maintaining your MongoDB Installation
http://www.slideshare.net/fehguy/mongo-sv-tony-tam
• Swagger API Framework
http://swagger.wordnik.com
• Mapping Benchmark
https://github.com/fehguy/mongodb-benchmark-tools
• Wordnik OSS Tools
https://github.com/wordnik/wordnik-oss