MongoDB at Sailthru                         Scaling and Schema                                Design                      ...
Sailthru                    • API-based transactional email led to...                    • Mass campaign email led to...  ...
How We Got To                         MongoDB from SQL                    • JSON was part of Sailthru infrastructure      ...
Sailthru Architecture                    • User interface to display stats, build                         campaigns and te...
MongoDB Overview                    • 13 instances on EC2 (6 two-member                         replica sets, 1 backup ser...
Users are Documents                    • Users aren’t records split among multiple                         tables         ...
Profiles Accessible                             Everywhere                    • Put abandoned shopping cart notifications   ...
Profiles Accessible                             Everywhere                    • Show a section of content conditional on   ...
Profiles Accessible                                 Everywhere                    • Show different content depending on use...
Profiles Accessible                                 Everywhere                    • Pick top content from a data feed based...
Other Advantages of                             MongoDB                    • High performance                    • Take an...
How We Run mongod                    •    mongod --dbpath /path/to/db --logpath /path/to/log/                         mong...
Separate DBs By                              Collections                    • Lower-effort than auto-sharding             ...
Our Five Replica Sets                    • main: most of the stuff on the UI, lots of                         small/medium...
Monitoring                    • Some stuff to monitor: faults/sec, index                         misses, % locked, queue s...
Backups                    • Used to use mongodump - don’t do that                         anymore                    • Ha...
The Great EC2 EBS                         Outage Adventure                    • We survived                    • Most of o...
DESIGNSunday, August 7, 2011
Develop Your Mental                         Model of MongoDB                    • You don’t need to look at the internals ...
Big-Picture Design                              Questions                    • What is the data I want to store?          ...
“But premature                         optimization is evil”                    • Knuth said that about code, which is    ...
Specific MongoDB                         Design Questions                    • Embed vs top-level collection?              ...
Short Field Names?                    • Disk space: cheap                    • RAM: not cheap                    • Develop...
Favor Human-Readable                       Foreign Keys                    • DBRefs are a bit cumbersome                  ...
Example                    • Store the Template and the Email as strings                         on the message object    ...
Embed vs Top-Level                           Collections?                    • Major question of MongoDB schema design    ...
Typical Properties of                         Top-Level Collections                    • Independence: They don’t “belong”...
Embedding Pros                    • Super-fast retrieval of document with                         related data            ...
Embedding Cons                    • Harder to get at, do mass queries                    • Does not size up infinitely, wil...
If You Think You Can                                 Embed                    • You probably should                    • I...
Design Example:                           User Permissions                    • Users can have various broad permission   ...
How Will We Use This                          Data?                    • Retrieve all clients for a given user            ...
How Will This Data                             Grow?                    • In the medium term, it will stay small          ...
Back in SQL-land                    • There’s a fairly standard way to do it                    • It’s a many-many relatio...
Should We Use a New                    Top-Level Collection?                         db.client.user.save( {               ...
Probably Not                    • Only needed if we have lots of clients per                         user AND lots of user...
Three Ways to Embed                         ‘clients’: {                            ‘76’: ‘admin’,                        ...
Indexes                    • Index all highly frequent queries                    • Do less-indexed queries only on       ...
Take Advantage of                         Multiple-Field Indexes                    • Order matters                    • I...
Use your _id                    • You must use an _id for every collection,                         which will cost you in...
Take advantage of fast                               ^indexes                    • Messages have _ids like: 32423.00000341...
Manual Range                                   Partioning                    • We moved a big message.blast collection    ...
Questions?                         Looking for a job?                              ian@sailthru.com                       ...
Upcoming SlideShare
Loading in …5
×

MongoDB at Sailthru: Scaling and Schema Design

1,519 views

Published on

Sailthru provides all your website email delivery needs, ensuring Inbox delivery for transactional and mass mail. Sailthru started out as a MySQL-powered transactional-mail service. Starting in 2009, we migrated to the document-oriented "nosql" database MongoDB. Moving entirely to MongoDB has allowed us to build complex user profiles to power behavioral-targeted mass emails and onsite recommendations. How and why we made the move, and how we use MongoDB today.

Published in: Technology

MongoDB at Sailthru: Scaling and Schema Design

  1. 1. MongoDB at Sailthru Scaling and Schema Design Ian White @eonwhite NoSQL Now! 8/25/11Sunday, August 7, 2011
  2. 2. Sailthru • API-based transactional email led to... • Mass campaign email led to... • Intelligence and user behavior • Three engineers built the ESP we always wanted to use • Some Clients: Huffpo-AOL, Thrillist, Refinery 29, Flavorpill, Business Insider, Fab, Totsy, New York ObserverSunday, August 7, 2011
  3. 3. How We Got To MongoDB from SQL • JSON was part of Sailthru infrastructure from start (SQL columns and S3) • Kept a close eye on CouchDB project • MongoDB felt like natural fit • Used for user profiles and analytics initially • Migrated one table at a time (very, very carefully)Sunday, August 7, 2011
  4. 4. Sailthru Architecture • User interface to display stats, build campaigns and templates, etc (PHP/EC2) • API, link rewriting, and onsite endpoints (PHP/EC2) • Core mailer engine (Java/EC2 and colo) • Modified-postfix SMTP servers (colo) • 11 database servers on EC2 (for now)Sunday, August 7, 2011
  5. 5. MongoDB Overview • 13 instances on EC2 (6 two-member replica sets, 1 backup server) • About 40 collections • About 1TB • Largest single collection is 500m docsSunday, August 7, 2011
  6. 6. Users are Documents • Users aren’t records split among multiple tables • End user’s lists, clickstream interests, geolocation, browser, time of day, purchase history becomes one ever-growing documentSunday, August 7, 2011
  7. 7. Profiles Accessible Everywhere • Put abandoned shopping cart notifications within a mass email {if profile.purchase_incomplete} <p>This is what’s in your cart:</p> {foreach profile.purchase_incomplete.items as item} {item.qty} <a href=”{item.url}”>{item.title}</a><br/> {/foreach} {/if}Sunday, August 7, 2011
  8. 8. Profiles Accessible Everywhere • Show a section of content conditional on the user’s location {if profile.geo.city[‘New York, NY US’]} <div>Come to the New York Meetup on the 27th!</div> {/if}Sunday, August 7, 2011
  9. 9. Profiles Accessible Everywhere • Show different content depending on user interests as measured by on-site behavior {select} {case horizon_interest(black,dark)} <img src="http://example.com/dress-image-black.jpg" /> {/case} {case horizon_interest(green)} <img src="http://example.com/dress-image-green.jpg" /> {/case} {case horizon_interest(purple,polka_dot,pattern)} <img src="http://example.com/dress-image-polkadot.jpg" /> {/case} {/select}Sunday, August 7, 2011
  10. 10. Profiles Accessible Everywhere • Pick top content from a data feed based on tags {content = horizon_select(content,10)} {foreach content as c} <a href=”{c.url}”>{c.title}</a><br/> {/foreach}Sunday, August 7, 2011
  11. 11. Other Advantages of MongoDB • High performance • Take any parameters from our clients • Really flexible development • Great for analytics (internal and external) • No more downtime for schema migrations or reindexingSunday, August 7, 2011
  12. 12. How We Run mongod • mongod --dbpath /path/to/db --logpath /path/to/log/ mongodb.log --logappend --fork --rest --replSet main1 --journal • Don’t ever run without replication • Don’t ever kill -9 • Don’t run without writing to a log • Run behind a firewall • Use journaling now that it’s there • Use --rest, it’s handySunday, August 7, 2011
  13. 13. Separate DBs By Collections • Lower-effort than auto-sharding • Separate databases for different usage patterns • Consider consequences of database failure/ unavailability • But make sure your backup and monitoring strategy is prepared for multiple DBsSunday, August 7, 2011
  14. 14. Our Five Replica Sets • main: most of the stuff on the UI, lots of small/medium collections • horizon: realtime onsite browsing data • profile: user profile data (60m user docs) • message: last three months of emails • archive: emails older than three monthsSunday, August 7, 2011
  15. 15. Monitoring • Some stuff to monitor: faults/sec, index misses, % locked, queue size, load average • we check basic status once/minute on all database servers (SMS alerts if down), email warnings on thresholds every 10 minutes • have been beta-ing 10gen’s MMS productSunday, August 7, 2011
  16. 16. Backups • Used to use mongodump - don’t do that anymore • Have single node of each replica set on a backup server • Two-hour slave delay • fsync/lock, freeze xfs file system, EBS snapshot, unfreeze, unlockSunday, August 7, 2011
  17. 17. The Great EC2 EBS Outage Adventure • We survived • Most of our nodes unavailable for 2-4 days • Were able to spin up new instances from backup server, snapshots, and get operational within hours • Wasn’t funSunday, August 7, 2011
  18. 18. DESIGNSunday, August 7, 2011
  19. 19. Develop Your Mental Model of MongoDB • You don’t need to look at the internals • But try to gain a working understanding of how MongoDB operates, especially RAM and indexesSunday, August 7, 2011
  20. 20. Big-Picture Design Questions • What is the data I want to store? • How will I want to use that data later? • How big will the data get? • If the answers are “I don’t know yet”, guess with your best YAGNISunday, August 7, 2011
  21. 21. “But premature optimization is evil” • Knuth said that about code, which is flexible and easy to optimize later • Data is not as flexible as code • So doing some planning for performance is usually good when it comes to your dataSunday, August 7, 2011
  22. 22. Specific MongoDB Design Questions • Embed vs top-level collection? • Denormalize (double-store data)? • How many/which indexes? • Arrays vs hashes for embedding? • Implicit schema (field names and types)Sunday, August 7, 2011
  23. 23. Short Field Names? • Disk space: cheap • RAM: not cheap • Developer Time: expensive • Err towards compact, readable fieldnames • Might be worth writing a mapper • Probably wish we’d used c instead of client_idSunday, August 7, 2011
  24. 24. Favor Human-Readable Foreign Keys • DBRefs are a bit cumbersome • Referencing by MongoId often means doing extra lookups • Build human-readable references to save you doing lookups and manual joinsSunday, August 7, 2011
  25. 25. Example • Store the Template and the Email as strings on the message object • { template: “Internal - Blast Notify”, email: “support-alerts@sailthru.com” } • No external reference lookups required • The tradeoff is basically just disk spaceSunday, August 7, 2011
  26. 26. Embed vs Top-Level Collections? • Major question of MongoDB schema design • If you can ask the question at all, you might want to err on the side of embedding • Don’t embed if the embedding could get huge • Don’t feel too bad about denormalizing by embedding AND storing in a top-level collectionSunday, August 7, 2011
  27. 27. Typical Properties of Top-Level Collections • Independence: They don’t “belong” conceptually to another collection • Nouns: the building blocks of your system • Easily referenceable and updatableSunday, August 7, 2011
  28. 28. Embedding Pros • Super-fast retrieval of document with related data • Atomic updates • “Ownership” of embedded document is obvious • Usually maps well to code structuresSunday, August 7, 2011
  29. 29. Embedding Cons • Harder to get at, do mass queries • Does not size up infinitely, will hit 16MB limit • Hard to create references to embedded object • Limited ability to indexed-sort the embedded objectsSunday, August 7, 2011
  30. 30. If You Think You Can Embed • You probably should • I take advantage of embedding in my designs more often now than I did three years ago • It’s a gift MongoDB gives you in exchange for giving up your joinsSunday, August 7, 2011
  31. 31. Design Example: User Permissions • Users can have various broad permission levels for any number of clients • For example, user ‘ploki’ might have permission level ‘admin’ for client 76 and permission level ‘reports_only’ for client 450Sunday, August 7, 2011
  32. 32. How Will We Use This Data? • Retrieve all clients for a given user • Retrieve all users for a given client • Retrieve a permission level for a given client for a given userSunday, August 7, 2011
  33. 33. How Will This Data Grow? • In the medium term, it will stay small • Number of clients and number of users can both grow infinitelySunday, August 7, 2011
  34. 34. Back in SQL-land • There’s a fairly standard way to do it • It’s a many-many relationship, so • Use a join table (client_user)Sunday, August 7, 2011
  35. 35. Should We Use a New Top-Level Collection? db.client.user.save( { client_id: 76, username: ‘ploki’, permission: ‘admin’, }); db.client.user.save( { client_id: 450, username: ‘ploki’, permission: ‘reports_only’, }); db.client.user.ensureIndex( { client_id: 1 } ); db.client.user.ensureIndex( { username: 1 } ); // get all users belonging to a client db.client.user.find( { client_id: 76 } ); // get all clients a user has access to db.client.user.find( { username: ‘ibwhite’ } ); // get permissions for our current user db.client.user.findOne( { username: user.name } );Sunday, August 7, 2011
  36. 36. Probably Not • Only needed if we have lots of clients per user AND lots of users per client • This is a case where we can embed, so let’s do soSunday, August 7, 2011
  37. 37. Three Ways to Embed ‘clients’: { ‘76’: ‘admin’, Not good: Object ‘450’: ‘reports_only’, can’t do a multikeys index }, on the keys of a hash index:??? Okay: Array ‘clients’: [ {‘_id’: 76, ‘access’: ‘admin’}, but have to search through array of objects }, {‘_id’: 450, ‘access’: ‘reports_only’} to find by _id index: { ‘clients._id’: 1 } on retrieved doc ‘clients’: [ 76, 450 ], Our approach: Array ‘clients_access’: { ’76’: ‘admin’, Fields next to each other alphabetically and object ‘450’: ‘reports_only’, } index: { clients: 1 }Sunday, August 7, 2011
  38. 38. Indexes • Index all highly frequent queries • Do less-indexed queries only on secondaries • Reduce the size of indexes whereever you can on big collections • Don’t sweat the medium-sized collections, focus on the big winsSunday, August 7, 2011
  39. 39. Take Advantage of Multiple-Field Indexes • Order matters • If you have an index on {client_id: 1, email: 1 } • Then you also have the {client_id: 1} index “for free” • but not { email: 1}Sunday, August 7, 2011
  40. 40. Use your _id • You must use an _id for every collection, which will cost you index size • So do something useful with _idSunday, August 7, 2011
  41. 41. Take advantage of fast ^indexes • Messages have _ids like: 32423.00000341 • Need all messages in blast 32423: • db.message.blast.find( { _id: /^32423./ } ); • (Yeah, I know the . is ugly. Don’t use a dot if you do this.)Sunday, August 7, 2011
  42. 42. Manual Range Partioning • We moved a big message.blast collection into per-day collections: • message.blast.20110605 message.blast.20110606 message.blast.20110607 etc... • Keeps working set indexes smaller • When we move data into the archive, drop() is much faster than remove()Sunday, August 7, 2011
  43. 43. Questions? Looking for a job? ian@sailthru.com twitter.com/eonwhiteSunday, August 7, 2011

×